Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
This outcome tracks the overall CoreOS Layering story as well as the technical items needed to converge CoreOS with RHEL image mode. This will provide operational consistency across the platforms.
ROADMAP for this Outcome: https://docs.google.com/document/d/1K5uwO1NWX_iS_la_fLAFJs_UtyERG32tdt-hLQM8Ow8/edit?usp=sharing
This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.
Description of problem:
When we activate the on-cluster-build functionality in a pool with yum based RHEL nodes, the pool is degraded reporting this error: - lastTransitionTime: "2023-09-20T15:14:44Z" message: 'Node ip-10-0-57-169.us-east-2.compute.internal is reporting: "error running rpm-ostree --version: exec: \"rpm-ostree\": executable file not found in $PATH"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-09-15-233408
How reproducible:
Always
Steps to Reproduce:
1. Create a cluster and add a yum based RHEL node to the worker pool (we used RHEL8) 2. Create the necessary resources to enable the OCB functionality. Pull and push secrets and the on-cluster-build-config configmap. For example we can use this if we want to use the internal registry: cat << EOF | oc create -f - apiVersion: v1 data: baseImagePullSecretName: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy") finalImagePushSecretName: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}') finalImagePullspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image" imageBuilderType: "" kind: ConfigMap metadata: name: on-cluster-build-config namespace: openshift-machine-config-operator EOF The configuration doesn't matter as long as the OCB functionality can work. 3. Label the worker pool so that the OCB functionality is enabled $ oc label mcp/worker machineconfiguration.openshift.io/layering-enabled=
Actual results:
The RHEL node shows this log: I0920 15:14:42.852742 1979 daemon.go:760] Preflight config drift check successful (took 17.527225ms) I0920 15:14:42.852763 1979 daemon.go:2150] Performing layered OS update I0920 15:14:42.868723 1979 update.go:1970] Starting transition to "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/tc-67566@sha256:24ea4b12acf93095732ba457fc3e8c7f1287b669f2aceec65a33a41f7e8ceb01" I0920 15:14:42.871625 1979 update.go:1970] drain is already completed on this node I0920 15:14:42.874305 1979 rpm-ostree.go:307] Running captured: rpm-ostree --version E0920 15:14:42.874388 1979 writer.go:226] Marking Degraded due to: error running rpm-ostree --version: exec: "rpm-ostree": executable file not found in $PATH I0920 15:15:37.570503 1979 daemon.go:670] Transitioned from state: Working -> Degraded I0920 15:15:37.570529 1979 daemon.go:673] Transitioned from degraded/unreconcilable reason -> error running rpm-ostree --version: exec: "rpm-ostree": executable file not found in $PATH I0920 15:15:37.574942 1979 daemon.go:2300] Not booted into a CoreOS variant, ignoring target OSImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e3128a8e42fb70ab6fc276f7005e3c0839795e4455823c8ff3eca9b1050798b9 I0920 15:15:37.591529 1979 daemon.go:760] Preflight config drift check successful (took 16.588912ms) I0920 15:15:37.591549 1979 daemon.go:2150] Performing layered OS update I0920 15:15:37.591562 1979 update.go:1970] Starting transition to "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/tc-67566@sha256:24ea4b12acf93095732ba457fc3e8c7f1287b669f2aceec65a33a41f7e8ceb01" I0920 15:15:37.594534 1979 update.go:1970] drain is already completed on this node I0920 15:15:37.597261 1979 rpm-ostree.go:307] Running captured: rpm-ostree --version E0920 15:15:37.597315 1979 writer.go:226] Marking Degraded due to: error running rpm-ostree --version: exec: "rpm-ostree": executable file not found in $PATH qI0920 15:16:37.613270 1979 daemon.go:2300] Not booted into a CoreOS variant, ignoring target OSImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e3128a8e42fb70ab6fc276f7005e3c0839795e4455823c8ff3eca9b1050798b9 And the worker pool is degraded with this error: - lastTransitionTime: "2023-09-20T15:14:44Z" message: 'Node ip-10-0-57-169.us-east-2.compute.internal is reporting: "error running rpm-ostree --version: exec: \"rpm-ostree\": executable file not found in $PATH"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded
Expected results:
The pool should not be degraded.
Additional info:
In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.
The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.
On-cluster, automated RHCOS Layering builds are important for multiple reasons:
The goal of this effort is to leverage OVN Kubernetes SDN to satisfy networking requirements of both traditional and modern virtualization. This Feature describes the envisioned outcome and tracks its implementation.
In its current state, OpenShift Virtualization provides a flexible toolset allowing customers to connect VMs to the physical network. It also has limited secondary overlay network capabilities and Pod network support.
It suffers from several gaps: Topology of the default pod network is not suitable for typical VM workload - due to that we are missing out on many of the advanced capabilities of OpenShift networking, and we also don't have a good solution for public cloud. Another problem is that while we provide plenty of tools to build a network solution, we are not very good in guiding cluster administrators configuring their network, making them rely on their account team.
Provide:
... while maintaining networking expectations of a typical VM workload:
Additionally, make our networking configuration more accessible to newcomers by providing a finite list of user stories mapped to recommended solutions.
You can find more info about this effort in https://docs.google.com/document/d/1jNr0E0YMIHsHu-aJ4uB2YjNY00L9TpzZJNWf3LxRsKY/edit
Provide IPAM to customers connecting VMs to OVN Kubernetes secondary networks.
Who | What | Reference |
---|---|---|
DEV | Upstream roadmap issue | <link to GitHub Issue> |
DEV | Upstream code and tests merged | <link to meaningful PR> |
DEV | Upstream documentation merged | <link to meaningful PR> |
DEV | gap doc updated | <name sheet and cell> |
DEV | Upgrade consideration | <link to upgrade-related test or design doc> |
DEV | CEE/PX summary presentation | label epic with cee-training and add a <link to your support-facing preso> |
QE | Test plans in Polarion | https://polarion.engineering.redhat.com/polarion/#/project/CNV/workitem?id=CNV-10864 |
QE | Automated tests merged | <link or reference to automated tests> |
DOC | Downstream documentation merged | <link to meaningful PR> |
Add a knob to CNO to control the installation of the IPAMClaim CRD.
Requires a new OpenShift feature gate only allowing the feature to be installed in Dev / Tech preview.
Placeholder feature for ccx-ocp-core maintenance tasks.
This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.
Description of problem:
InsightsRecommendationActive firing, description link results in "Invalid parameter: redirect_uri" on sso.redhat.com. Insights recommendation "OpenShift cluster with more or less than 3 control plane node replicas is not supported by Red Hat" with total risk "Moderate" was detected on the cluster. More information is available at https://console.redhat.com/openshift/insights/advisor/clusters/<UID>?first=ccx_rules_ocp.external.rules.control_plane_replicas|CONTROL_PLANE_NODE_REPLICAS.
Version-Release number of selected component (if applicable):
4.15.14
How reproducible:
unknown
Steps to Reproduce:
1. Install 4.15.14 on a cluster that triggers this alert 2. Log out of Red Hat SSO 3. Clink link in alert description
Actual results:
"Invalid parameter: redirect_uri" on sso.redhat.com
Expected results:
Link successfully navigates through SSO
Additional info:
Description of problem:
We have a test test_cluster_base_domain_obfuscation that checks that when we set the insights-config configmap with the obfuscation parameter set to "networking", we expect the archive to not have instance of the api_url or the base_hostname of the cluster. This is currently not happening in hypershift hosted clusters.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Run test_cluster_base_domain_obfuscation Or 1. Create insights_config configmap in the openshift-insights namespace, with dataReporting: obfuscation: Networking 2. wait until the obfuscation creation table exists 3. Download the archive 4. Check every path in the archive and search for instances of the api_url or the base_hostname (easier to do with automation than manually)
Actual results:
Instances are found.
Expected results:
No instances are found since they've all been obfuscated.
Additional info:
This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.
Insights operator should replaces %s in https://console.redhat.com/api/gathering/v2/%s/gathering_rules error messages like the failed-to-bootstrap:
$ jq -r .content osd-ccs-gcp-ad-install.log | sed 's/\\n/\n/g' | grep 'Cluster operator insights' time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights ClusterTransferAvailable is False with Unauthorized: failed to pull cluster transfer: OCM API https://api.openshift.com/api/accounts_mgmt/v1/cluster_transfers/?search=cluster_uuid+is+%REDACTED%27+and+status+is+%27accepted%27 returned HTTP 401: REDACTED" time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights Disabled is False with AsExpected: " time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights RemoteConfigurationAvailable is False with HttpStatus401: received HTTP 401 Unauthorized from https://console.redhat.com/api/gathering/v2/%s/gathering_rules" time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights RemoteConfigurationValid is Unknown with NoValidationYet: " time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights SCAAvailable is False with Unauthorized: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 401: REDACTED level=info msg=Cluster operator insights ClusterTransferAvailable is False with Unauthorized: failed to pull cluster transfer: OCM API https://api.openshift.com/api/accounts_mgmt/v1/cluster_transfers/?search=cluster_uuid+is+%27REDACTED%27+and+status+is+%27accepted%27 returned HTTP 401: REDACTED level=info msg=Cluster operator insights Disabled is False with AsExpected: level=info msg=Cluster operator insights RemoteConfigurationAvailable is False with HttpStatus401: received HTTP 401 Unauthorized from https://console.redhat.com/api/gathering/v2/%s/gathering_rules level=info msg=Cluster operator insights RemoteConfigurationValid is Unknown with NoValidationYet: level=info msg=Cluster operator insights SCAAvailable is False with Unauthorized: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 401: REDACTED level=info msg=Cluster operator insights UploadDegraded is True with NotAuthorized: Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired: {\"errors\":[{\"meta\":{\"response_by\":\"gateway\"},\"detail\":\"UHC services authentication failed\",\"status\":401}]}
Seen in 4.17 RCs. Also in this comment.
Unknown
Unknown.
ClusterOperator conditions talking about https://console.redhat.com/api/gathering/v2/%s/gathering_rules
URIs we expose in customer-oriented messaging to not have %s placeholders.
Seems like the template is coming in as conditionalGathererEndpoint here. Seems like insights-operator#964 introduced the %s, but I'm not finding the logic that's supposed to populate that placeholder.
Rapid recommendations enhancement defines this built-in configuration when the operator cannot reach the remote endpoint.
The issue is that the built-in configuration (though currently empty) is no taken into account - i.e the data requested in the built-configuration is not gathered.
Goal:
Track Insights Operator Data Enhancements epic in 2024
INSIGHTOCP-1557 is a rule to check for any custom Prometheus instances that may impact the management of corresponding resources.
Resource to gather: Prometheus and Alertmanager in all namespaces
apiVersion: monitoring.coreos.com/v1 kind: Prometheus
apiVersion: monitoring.coreos.com/v1 kind: Alertmanager
Backport: OCP 4.12.z; 4.13.z; 4.14.z; 4.15.z
Additional info:
1) Get the Prometheus and Alertmanager in all namespaces
$ oc get prometheus -A NAMESPACE NAME VERSION DESIRED READY RECONCILED AVAILABLE AGE openshift-monitoring k8s 2.39.1 2 1 True Degraded 712d test custom-prometheus 1 0 True False 25d
$ oc get alertmanager -A NAMESPACE NAME VERSION DESIRED READY RECONCILED AVAILABLE AGE openshift-monitoring main 2.39.1 2 1 True Degraded 712d test custom-alertmanager 1 0 True False 25d
Business required:
We had a recommendation to check the certificate of the default ingress controller expiration after it has expired. From the referenced KCS, it seems that many customers(hundreds) hit this issue. So, Oscar Arribas Arribas suggests that if we can have a recommendation to alert customers before certificate expiration.
Gathering method:
1. Gather all the ingresscontroller objects(we already gathered the default ingresscontroller) with commands:
oc get ingresscontrollers -n openshift-ingress-operator
2. Gather operator auto-generated certificate's validate dates with commands:
$ oc get ingresscontrollers -n openshift-ingress-operator -o yaml | grep -A1 defaultCertificate #### empty output here when certificate created by the operator
$ oc get secret router-ca -n openshift-ingress-operator -o yaml | grep crt | awk '{print $2}' | base64 -d | openssl x509 -noout -dates notBefore=Dec 28 00:00:00 2022 GMT notAfter=Jan 22 23:59:59 2024 GMT
$ oc get secret router-certs-default -n openshift-ingress -o yaml | grep crt | awk '{print $2}' | base64 -d | openssl x509 -noout -dates notBefore=Dec 28 00:00:00 2022 GMT notAfter=Jan 22 23:59:59 2024 GMT
3. Gather custom certificates' validate dates with commands:
$ oc get ingresscontrollers -n openshift-ingress-operator -o yaml | grep -A1 defaultCertificate defaultCertificate: name: [custom-cert-secret-1]
#### for each [custom-cert-secret] above $ oc get secret [custom-cert-secret-1] -n openshift-ingress -o yaml | grep crt | awk '{print $2}' | base64 -d | openssl x509 -noout -dates notBefore=Dec 28 00:00:00 2022 GMT notAfter=Jan 22 23:59:59 2024 GMT
Other Information:
An RFE to create a cluster alert is under reveiwing: https://issues.redhat.com/browse/RFE-4269
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision PowerVS infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Customer has escalated the following issues where ports don't have TLS support. This Feature request lists all the components port raised by the customer.
Details here https://docs.google.com/document/d/1zB9vUGB83xlQnoM-ToLUEBtEGszQrC7u-hmhCnrhuXM/edit
Currently, we are serving the metrics as http on 9537 we need to upgrade to use TLS
Related to https://docs.google.com/document/d/1zB9vUGB83xlQnoM-ToLUEBtEGszQrC7u-hmhCnrhuXM/edit
GA support for a generic interface for administrators to define custom reboot/drain suppression rules.
Follow up epic to https://issues.redhat.com/browse/MCO-507, aiming to graduate the feature from tech preview and GA'ing the functionality.
For tech preview we only allow files/units/etc. There are two potential use cases for directories:
Which would allow the MCO to generally allow anything under a path to apply the policy. We should adapt the API and MCO logic to also allow paths.
Add e2e tests that fit the guidelines mentioned in the openshift/api docs https://github.com/openshift/api/blob/master/README.md#defining-featuregate-e2e-tests
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision OpenShift on the existing supported providers' infrastructure without the use of Terraform.
This feature will be used to track all the CAPI preparation work that is common for all the supported providers
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
This feature is about providing workloads within an HCP KubeVirt cluster access to gpu devices. This is an important use case that expands usage of HCP KubeVirt to AL and ML workloads.
GOAL:
Support running workloads within HCP KubeVirt clusters which need access to GPUs.
Accomplishing this involves multiple efforts
Diagram of multiple nvidia operator layers
https://docs.google.com/document/d/1HwXVL_r9tUUwqDct8pl7Zz4bhSRBidwvWX54xqXaBwk/edit
1. Design and implement an API at the NodePool (platform.kubevirt) that will allow exposing GPU passthrough or vGPU slicing from the infra cluster to the guest cluster.
2. Implement logic that sets up the GPU resources to be available to the guest cluster's workloads (by using nvidia-gpu-operator?)
Currently the maximum number of snapshots per volume in vSphere CSI is set to 3 and cannot be configured. Customers find this default limit too low and are asking us to make this setting configurable.
Maximum number of snapshot is 32 per volume
Customers can override the default (three) value and set it to a custom value.
Make sure we document (or link) the VMWare recommendations in terms of performances.
https://kb.vmware.com/s/article/1025279
The setting can be easily configurable by the OCP admin and the configuration is automatically updated. Test that the setting is indeed applied and the maximum number of snapshots per volume is indeed changed.
No change in the default
As an OCP admin I would like to change the maximum number of snapshots per volumes.
Anything outside of
The default value can't be overwritten, reconciliation prevents it.
Make sure the customers understand the impact of increasing the number of snapshots per volume.
https://kb.vmware.com/s/article/1025279
Document how to change the value as well as a link to the best practice. Mention that there is a 32 hard limit. Document other limitations if any.
N/A
Epic Goal*
The goal of this epic is to allow admins to configure the maximum number of snapshots per volume in vSphere CSI and find an way how to add such extension to OCP API.
Possible future candidates:
Why is this important? (mandatory)
Currently the maximum number of snapshots per volume in vSphere CSI is set to 3 and cannot be configured. Customers find this default limit too low and are asking us to make this setting configurable.
Maximum number of snapshot is 32 per volume
https://kb.vmware.com/s/article/1025279
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
1) Write OpenShift enhancement (STOR-1759)
2) Extend ClusterCSIDriver API (TechPreview) (STOR-1803)
3) Update vSphere operator to use the new snapshot options (STOR-1804)
4) Promote feature from Tech Preview to Accessible-by-default (STOR-1839)
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Configure the maximum number of snapshot to a higher value. Check the config has been updated and verify that the maximum number of snapshots per volume maps to the new setting value.
Drawbacks or Risk (optional)
Setting this config setting with a high value can introduce performances issues. This needs to be documented.
https://kb.vmware.com/s/article/1025279
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
The etc-ca must be rotatable both on-demand and automatically when expiry approaches.
Requirements (aka. Acceptance Criteria):
Deliver rotation and recovery requirements from OCPSTRAT-714
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
As cluster admin I would like to configure machinesets to allocate instances from pre-existing Capacity Reservation in Azure.
I want to create a pool of reserved resources that can be shared between clusters of different teams based on their priorities. I want this pool of resources to remain available for my company and not get allocated to another Azure customer.
Additional background on the feature for considering additional use cases
Machine API support for Azure Capacity Reservation Groups
The customer would like to configure machinesets to allocate instances from pre-existing Capacity Reservation Groups, see Azure docs below
This would allow the customer to create a pool of reserved resources which can be shared between clusters of different priorities. Imagine a test and prod cluster where the demands of the prod cluster suddenly grow. The test cluster is scaled down freeing resources and the prod cluster is scaled up with assurances that those resources remain available, not allocated to another Azure customer.
MAPI/CAPI Azure
In this use case, there's no immediate need for install time support to designate reserved capacity group for control plane resources, however we should consider whether that's desirable from a completeness standpoint. We should also consider whether or not this should be added as an attribute for the installconfig compute machinepool or whether altering generated MachineSet manifests is sufficient, this appears to be a relatively new Azure feature which may or may not see wider customer demand. This customer's primary use case is centered around scaling up and down existing clusters, however others may have different uses for this feature.
Additional background on the feature for considering additional use cases
As a developer I want to add the field "CapacityReservationGroupID" to "AzureMachineProviderSpec" in openshift/api so that Azure capacity reservation can be supported.
CFE-1036 adds the support of Capacity Reservation in upstream CAPZ (PR). The same support is needed to be added downstream also. Please refer the upstream PR for adding support downstream.
Slack discussion regarding the same: https://redhat-internal.slack.com/archives/CBZHF4DHC/p1713249202780119?thread_ts=1712582367.529309&cid=CBZHF4DHC_
Update the vendor to update in cluster-control-plane-machine-set-operator repository for capacity reservation Changes.
As a developer I want to add support of capacity reservation group in openshift/machine-api-provider-azure so that azure VMs can be associated to a capacity reservation group during the VM creation.
CFE-1036 adds the support of Capacity Reservation in upstream CAPZ (PR). The same support is needed to be added downstream also. Please refer the upstream PR for adding support downstream.
As a developer I want to add the webhook validation for the "CapacityReservationGroupID" field of "AzureMachineProviderSpec" in openshift/machine-api-operator so that Azure capacity reservation can be supported.
CFE-1036 adds the support of Capacity Reservation in upstream CAPZ (PR). The same support is needed to be added downstream also. Please refer the upstream PR for adding support downstream.
Slack discussion regarding the same: https://redhat-internal.slack.com/archives/CBZHF4DHC/p1713249202780119?thread_ts=1712582367.529309&cid=CBZHF4DHC_
Add support for standalone secondary networks for HCP kubevirt.
Advanced multus integration involves the following scenarios
1. Secondary network as single interface for VM
2. Multiple Secondary Networks as multiple interfaces for VM
Users of HCP KubeVirt should be able to create a guest cluster that is completely isolated on a secondary network outside of the default pod network.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | self-managed |
Classic (standalone cluster) | na |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | na |
Connected / Restricted Network | yes |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86 |
Operator compatibility | na |
Backport needed (list applicable versions) | na |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | na |
Other (please specify) | na |
ACM documentation should include how to configure secondary standalone networks.
This is a continuation of CNV-33392.
Multus Integration for HCP KubeVirt has three scenarios.
1. Secondary network as single interface for VM
2. Multiple Secondary Networks as multiple interfaces for VM
3. Secondary network + pod network (default for kubelet) as multiple interfaces for VM
Item 3 is the simplest use case because it does not require any addition considerations for ingress and load balancing. This scenario [item 3] is covered by CNV-33392.
Items [1,2] are what this epic is tracking, which we are considering advanced use cases.
Develop hypershift e2e test that exercises attaching a secondary network to a HCP KubeVirt nodepool
The default OpenShift installation on AWS uses multiple IPv4 public IPs which Amazon will start charging for starting in February 2024. As a result, there is a requirement to find an alternative path for OpenShift to reduce the overall cost of a public cluster while this is deployed on AWS public cloud.
Provide an alternative path to reduce the new costs associated with public IPv4 addresses when deploying OpenShift on AWS public cloud.
There is a new path for "external" OpenShift deployments on AWS public cloud where the new costs associated with public IPv4 addresses have a minimum impact on the total cost of the required infrastructure on AWS.
Ongoing discussions on this topic are happening in Slack in the #wg-aws-ipv4-cost-mitigation private channel
Usual documentation will be required in case there are any new user-facing options available as a result of this feature.
*Resources which consumes public IPv4: bootstrap, API Public NLB, Nat Gateways
OCP 4 clusters still maintain pinned boot images. We have numerous clusters installed that have boot media pinned to first boot images as early as 4.1. In the future these boot images may not be certified by the OEM and may fail to boot on updated datacenter or cloud hardware platforms. These "pinned" boot images should be updateable so that customers can avoid this problem and better still scale out nodes with boot media that matches the running cluster version.
In phase 1 provided tech preview for GCP.
In phase 2, GCP support goes to GA. Support for other IPI footprints is new and tech preview.
This would involve updating the AMIs stored within the AWSProviderConfig(encapsulated within a MachineSet. The "updated" AMI values should be available in the golden configmap.
This involves create a new feature gate for AWS boot image updates in openshift/api
This will pick up stories left off from the initial Tech Preview(Phase 1): https://issues.redhat.com/browse/MCO-589
Done when:
PR to API to remove featuregate
Alert docs team
Currently errors are propagated via a prometheus alert. Before GA, we will need to make sure that we are placing a condition on the configuration object in addition to the current Prometheus mechanism. This will be done by the MSBIC, but it should be mindful as to not stomp on the operator, which updates the MachineConfiguration Status as well.
We currently have a kube version string used in the call to setup envtest. We should either git rid of this reference and grab it from elsewhere or update it with every kube bump we do.
In addition, the setup call now requires an additional argument to factor for openshift/api's kubebuilder divergence. So the value being used here may not be valid for every kube bump as the archive is not generated for every kube version. (Doing a bootstrap test run should be able to sus this out, if it doesn't error with the new version you should be ok)
This was a temporary change and should be reverted. This was done because setup-envtest did a go bump that the MCO will probably not do in time for 4.16.
PR that pinned the tag: https://github.com/openshift/machine-config-operator/pull/4280
Slack thread: https://redhat-internal.slack.com/archives/GH7G2MANS/p1711372261617039?thread_ts=1711372068.123039&cid=GH7G2MANS
Please review the following PR: https://github.com/openshift/machine-config-operator/pull/4380
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Epic Goal*
Drive the technical part of the Kubernetes 1.29 upgrade, including rebasing openshift/kubernetes repositiry and coordination across OpenShift organization to get e2e tests green for the OCP release.
Why is this important? (mandatory)
OpenShift 4.17 cannot be released without Kubernetes 1.30
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
PRs:
Networking Definition of Planned
Epic Template descriptions and documentation
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
1. …
1. …
Networking Definition of Planned
Epic Template descriptions and documentation
make sure we deliver a 1.30 kube-proxy standalone image
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
1. …
1. …
Networking Definition of Planned
Epic Template descriptions and documentation
Bump kube to 1.30 in CNCC
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
1. …
1. …
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
upgrade all OpenShift and Kubernetes components that cloud-credential-operator uses to v1.30 which keeps it on par with rest of the OpenShift components and the underlying cluster version.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To align with the 4.17 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.17 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
Other teams:
If needed this card can be broken down into more cards with sublists, each card assigned to a different assignee.
While installing OpenShift on AWS add support to use of existing IAM instance profiles
Allow a user to use existing an IAM instance profile while deploying OpenShift on AWS.
When using existing IAM role, the Installer tries to create a new IAM instance profile. As of today, the installation will fail if the user does not have permission to create instance profiles.
The Installer will provide an option to the user to use an existing IAM instance profile instead trying to create a new one if this is provided.
This work is important not only for self-manage customers who want to reduce the required permissions needed for the IAM accounts but also for the IC regions and ROSA customers.
https://github.com/dmc5179/installer/commit/8699caa952d4a9ce5012cca3f86aeca70c499db4
The Installer will provide an option to the user to use an existing IAM instance profile instead of trying to create a new one if this is provided.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Graduate the etcd tuning profiles feature delivered in https://issues.redhat.com/browse/ETCD-456 to GA
Remove the feature gate flag and ,ake the feature accessible to all customers
Requires fixes to apiserver to handle etcd client retries correctly
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | yes |
Classic (standalone cluster) | yes |
Hosted control planes | no |
Multi node, Compact (three node), or Single node (SNO), or all | Multi node and compact clusters |
Connected / Restricted Network | Yes |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | Yes |
Operator compatibility | N/A |
Backport needed (list applicable versions) | N/A |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | N/A |
Other (please specify) | N/A |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Epic Goal*
Graduate the etcd tuning profiles feature delivered in https://issues.redhat.com/browse/ETCD-456 to GA
https://github.com/openshift/api/pull/1538
https://github.com/openshift/enhancements/pull/1447
Why is this important? (mandatory)
Graduating the feature to GA makes it accessible to all customers and not hidden behind a feature gate.
As further outlined in the linked stories the major roadblock for this feature to GA is to ensure that the API server has the necessary capability to configure its etcd client for longer retries on platforms with slower latency profiles. See: https://issues.redhat.com/browse/OCPBUGS-18149
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Once the cluster is installed, we should be able to change the default latency profile on the API to a slower one and verify that etcd is rolled out with the updated leader election and heartbeat timeouts. During this rollout there should be no disruption or unavailability to the control-plane.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Once https://issues.redhat.com/browse/ETCD-473 is done this story will track the work required to move the "operator/v1 etcd spec.hardwareSpeed" field from behind the feature gate to GA.
Graduate the etcd tuning profiles feature delivered in https://issues.redhat.com/browse/ETCD-456 to GA
Remove the feature gate flag and ,ake the feature accessible to all customers
Requires fixes to apiserver to handle etcd client retries correctly
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | yes |
Classic (standalone cluster) | yes |
Hosted control planes | no |
Multi node, Compact (three node), or Single node (SNO), or all | Multi node and compact clusters |
Connected / Restricted Network | Yes |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | Yes |
Operator compatibility | N/A |
Backport needed (list applicable versions) | N/A |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | N/A |
Other (please specify) | N/A |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Note: There is no work pending from OTA team. The Jira tracks the work pending from other teams.
We started the feature with the assumption that CVO has to implement sigstore key verification like we do with gpg keys.
After investigation we found that sigstore key verification is done at node level and there is no CVO work. From that point this feature became a tracking feature for us to help other teams to do "sigstore key verification" tasks . Specifically Node team. The "sigstore key verification" roadmap is here https://docs.google.com/presentation/d/16dDwALKxT4IJm7kbEU4ALlQ4GBJi14OXDNP6_O2F-No/edit#slide=id.g547716335e_0_2075
Add sigstore signatures to core OCP payload and enable verification. Verification is now done via CRIO.
There is no CVO work in this feature and this is a Tech Preview change.
OpenShift Release Engineering can leverage a mature signing and signature verification stack instead of relying on simple signing
Customers can leverage OpenShift to create trust relationships for running OCP core container images
Specifically, customers can trust signed images from a Red Hat registry and OCP can verify those signatures
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
– Kubelet/CRIO to verify RH images & release payload sigstore signatures
– ART will add sigstore signatures to core OCP images
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
These acceptance criteria are for all deployment flavors of OpenShift.
Deployment considerations | List applicable specific needs (N/A = not applicable) | |
Self-managed, managed, or both | both | |
Classic (standalone cluster) | yes | |
Hosted control planes | yes | |
Multi node, Compact (three node), or Single node (SNO), or all | ||
Connected / Restricted Network | ||
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | ||
Operator compatibility | ||
Backport needed (list applicable versions) | Not Applicable | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | none, | |
Other (please specify) |
Add documentation for sigstore verification and gpg verification
For folks mirroring release images (e.g. disconnected/restricted-network):
OCP clusters need to add the ability to validate Sigstore signatures for OpenShift release images.
This is part of Red Hat's overall Sigstore strategy.
Today, Red Hat uses "simple signing" which uses an OpenPGP/GPG key and a separate file server to host signatures for container images.
Cosign is on track to be an industry standard container signing technique. The main difference is that, instead of signatures being stored in a separate file server, the signature is stored in the same registry that hosts the image.
Design document / discussion from software production: https://docs.google.com/document/d/1EPCHL0cLFunBYBzjBPcaYd-zuox1ftXM04aO6dZJvIE/edit
Demo video: https://drive.google.com/file/d/1bpccVLcVg5YgoWnolQxPu8gXSxoNpUuQ/view
Software production will be migrating to the cosign over the course of 2024.
ART will continue to sign using simple signing in combination with sigstore signatures until SP stops using it and product documentation exists to help customers migrate from the simple signing signature verification.
Currently this epic is primarily supporting the Node implementation work in OCPNODE-2231. There's a minor CVO UX tweak planned in OTA-1307 that's definitely OTA work. There's also the enhancement proposal in OTA-1294 and the cluster-update-keys in OTA-1304, which Trevor happens to be doing for intertial reasons, but which he's happy to hand off to OCPNODE and/or shift under OCPNODE-2231.
seeing various MCO issue across different platforms and repos on techpreview
@David Joshy found that it might be https://github.com/openshift/cluster-update-keys/pull/58
@jerzhang found diffs in sigstore-registries and policy.json. Which may be coming from this manifest. Is this available during bootstrap?
As described in the OTA-1294 enhancement. The cluster-update-keys repository isn't actually managed by the OTA team, but I expect it will be me opening the pull, and there isn't a dedicated Jira project covering cluster-update-keys, so I'm creating this ticket under the OTA Epic just because I can't think of a better place to put it.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
The goal of this EPIC is to either ship a cluster wide policy (not enabled by default) to verify OpenShift release/payload images or document how end users can create their own policy to verify them.
We shipped cluster wide policy support in OCPNODE-1628 which should be used for internal components as well.
As Miloslav Trmač reported upstream, when a ClusterImagePolicy is set on a scope to accept sigstore signatures, the underlying registry needs to be configured with use-sigstore-attachments: true. The current code:
func generateSigstoreRegistriesdConfig(clusterScopePolicies map[string]signature.PolicyRequirements) ([]byte, error) {
does do that for the configured scope; but the use-sigstore-attachments option applies not to the "logical name", but to each underlying mirror individually.
I.e. the option needs to be on every mirror of the scope. Without that, if the image is found on one of such mirrors, the c/image code will not be looking for signatures on the mirror, and policy enforcement is likely to fail.
Seen in 4.17.0-0.nightly-2024-06-25-162526, but likely all releases which implement ClusterImagePolicy so far, because this is unlikely to be a regression.
Every time.
Apply the ClusterImagePolicy suggested in OTA-1294's enhancements#1633:
$ cat <<EOF >policy.yaml apiVersion: config.openshift.io/v1alpha1 kind: ClusterImagePolicy metadata: name: openshift annotations: kubernetes.io/description: Require Red Hat signatures for quay.io/openshift-release-dev/ocp-release container images. exclude.release.openshift.io/internal-openshift-hosted: "true" include.release.openshift.io/self-managed-high-availability: "true" release.openshift.io/feature-set: TechPreviewNoUpgrade spec: scopes: - quay.io/openshift-release-dev/ocp-release policy: rootOfTrust: policyType: PublicKey publicKey: keyData: LS0tLS1CRUdJTiBQVUJMSUMgS0VZLS0tLS0KTUlJQ0lqQU5CZ2txaGtpRzl3MEJBUUVGQUFPQ0FnOEFNSUlDQ2dLQ0FnRUEzQzJlVGdJQUo3aGxveDdDSCtIcE1qdDEvbW5lYXcyejlHdE9NUmlSaEgya09ZalRadGVLSEtnWUJHcGViajRBcUpWYnVRaWJYZTZKYVFHQUFER0VOZXozTldsVXpCby9FUUEwaXJDRnN6dlhVbTE2cWFZMG8zOUZpbWpsVVovaG1VNVljSHhxMzR2OTh4bGtRbUVxekowR0VJMzNtWTFMbWFEM3ZhYmd3WWcwb3lzSTk1Z1V1Tk81TmdZUHA4WDREaFNoSmtyVEl5dDJLTEhYWW5BMExzOEJlbG9PWVJlTnJhZmxKRHNzaE5VRFh4MDJhQVZSd2RjMXhJUDArRTlZaTY1ZE4zKzlReVhEOUZ6K3MrTDNjZzh3bDdZd3ZZb1Z2NDhndklmTHlJbjJUaHY2Uzk2R0V6bXBoazRjWDBIeitnUkdocWpyajU4U2hSZzlteitrcnVhR0VuVGcyS3BWR0gzd3I4Z09UdUFZMmtqMnY1YWhnZWt4V1pFN05vazNiNTBKNEpnYXlpSnVSL2R0cmFQMWVMMjlFMG52akdsMXptUXlGNlZnNGdIVXYwaktrcnJ2QUQ4c1dNY2NBS00zbXNXU01uRVpOTnljTTRITlNobGNReG5xU1lFSXR6MGZjajdYamtKbnAxME51Z2lVWlNLeVNXOHc0R3hTaFNraGRGbzByRDlkVElRZkJoeS91ZHRQWUkrK2VoK243QTV2UVV4Wk5BTmZqOUhRbC81Z3lFbFV6TTJOekJ2RHpHellSNVdVZEVEaDlJQ1I4ZlFpMVIxNUtZU0h2Tlc3RW5ucDdZT2d5dmtoSkdwRU5PQkF3c1pLMUhhMkJZYXZMMk05NDJzSkhxOUQ1eEsrZyszQU81eXp6V2NqaUFDMWU4RURPcUVpY01Ud05LOENBd0VBQVE9PQotLS0tLUVORCBQVUJMSUMgS0VZLS0tLS0K EOF $ oc apply -f policy.yaml
Set up an ImageContentSourcePolicy such as the ones Cluster Bot jobs have by default:
cat <<EOF >mirror.yaml apiVersion: operator.openshift.io/v1alpha1 kind: ImageContentSourcePolicy metadata: name: pull-through-mirror spec: repositoryDigestMirrors: - mirrors: - quayio-pull-through-cache-us-west-2-ci.apps.ci.l2s4.p1.openshiftapps.com source: quay.io EOF $ oc apply -f mirror.yaml
Set CRI-O debug logs, following these docs:
$ cat <<EOF >custom-loglevel.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: name: custom-loglevel spec: machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/master: '' containerRuntimeConfig: logLevel: debug EOF $ oc create -f custom-loglevel.yaml
Wait for that to roll out, as described in docs:
$ oc get machineconfigpool master
Launch a Sigstore-signed quay.io/openshift-release-dev/ocp-release image, by asking the cluster to update to 4.16.1:
$ oc adm upgrade --allow-explicit-upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:c17d4489c1b283ee71c76dda559e66a546e16b208a57eb156ef38fb30098903a
Check the debug CRI-O logs:
$ oc adm node-logs --role=master -u crio | grep -i1 sigstore | tail -n5
Not looking for sigstore attachments: disabled by configuration entries like:
$ oc adm node-logs --role=master -u crio' | grep -i1 sigstore | tail -n5 -- Jun 28 19:06:34.317335 ip-10-0-43-59 crio[2154]: time="2024-06-28 19:06:34.317169116Z" level=debug msg=" Using transport \"docker\" specific policy section quay.io/openshift-release-dev/ocp-release" file="signature/policy_eval.go:150" Jun 28 19:06:34.317335 ip-10-0-43-59 crio[2154]: time="2024-06-28 19:06:34.317207897Z" level=debug msg="Reading /var/lib/containers/sigstore/openshift-release-dev/ocp-release@sha256=c17d4489c1b283ee71c76dda559e66a546e16b208a57eb156ef38fb30098903a/signature-1" file="docker/docker_image_src.go:479" Jun 28 19:06:34.317335 ip-10-0-43-59 crio[2154]: time="2024-06-28 19:06:34.317240227Z" level=debug msg="Not looking for sigstore attachments: disabled by configuration" file="docker/docker_image_src.go:556" Jun 28 19:06:34.317335 ip-10-0-43-59 crio[2154]: time="2024-06-28 19:06:34.317277208Z" level=debug msg="Requirement 0: denied, done" file="signature/policy_eval.go:285"
Something about "we're going to look for Sigstore signatures on quayio-pull-through-cache-us-west-2-ci.apps.ci.l2s4.p1.openshiftapps.com, since that's where we found the quay.io/openshift-release-dev/ocp-release@sha256:c17d4489c1b283ee71c76dda559e66a546e16b208a57eb156ef38fb30098903a image". At this point, it doesn't matter whether the retrieved signature is accepted or not, just that a signature lookup is attempted.
Remove the openshift/api ClusterImagePolicy document about the restriction of scope field on OpenShift release image repository and install the updated manifest in MCO.
enhancements#1633 is still in flight, but there seams to be some consensus around its API Extensions proposal to drop the following Godocs from ClusterImagePolicy and ImagePolicy:
// Please be aware that the scopes should not be nested under the repositories of OpenShift Container Platform images. // If configured, the policies for OpenShift Container Platform repositories will not be in effect.
The backing implementation will also be removed. This guard was initially intended to protect cluster adminstrators from breaking their clusters by configuring policies that blocked critical images. And before Red Hat was publishing signatures for quay.io/openshift-release-dev/ocp-release releases, that made sense. But now that Red Hat is almost (OTA-1267) publishing Sigstore signatures for those release images, it makes sense to allow policies covering those images. And even if a cluster administrator creates a policy that blocks critical image pulls, PodDisriptionBudgets should keep the Kubernetes API server and related core workloads running for long enough for the cluster administrator to use the Kube API to remove or adjust the problematic policy.
There's a possibility that we replace the guard with some kind of pre-rollout validation, but that doesn't have to be part of the initial work.
We want this guard in place to unblock testing of enhancements#1633's proposed ClusterImagePolicy, so we can decide if it works as expected, or if it needs tweaks before being committed as a cluster-update-keys manifest. And we want that testing to establish confidence in the approach before we start in on the installer's internalTestingImagePolicy and installer-caller work.
Update the manifest through MCO script:https://github.com/openshift/machine-config-operator/blob/master/hack/crds-sync.sh
Make sure the document of ClusterImagePolicy and featuregate in the MCO manifest is up to date with API generated CRD.
These seem like dup's, and we should remove ImagePolicy and consolidate around SigstoreImageVerification for clarity.
Enable Hosted Control Planes guest clusters to support up to 500 worker nodes. This enable customers to have clusters with large amount of worker nodes.
Max cluster size 250+ worker nodes (mainly about control plane). See XCMSTRAT-371 for additional information.
Service components should not be overwhelmed by additional customer workloads and should use larger cloud instances and when worker nodes are lesser than the threshold it should use smaller cloud instances.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Managed |
Classic (standalone cluster) | N/A |
Hosted control planes | Yes |
Multi node, Compact (three node), or Single node (SNO), or all | N/A |
Connected / Restricted Network | Connected |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86_64 ARM |
Operator compatibility | N/A |
Backport needed (list applicable versions) | N/A |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | N/A |
Other (please specify) |
Check with OCM and CAPI requirements to expose larger worker node count.
As a service provider, I want to be able to:
so that I can achieve
Description of criteria:
This does not require a design proposal.
This does not require a feature gate.
Description of problem:
In some instances the request serving node autoscaler helper fails to kick off a reconcile when there are pending pods.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
sometimes
Steps to Reproduce:
1. Setup a mgmt cluster with size tagging 2. Create a hosted cluster and get it scheduled 3. Scale down the machinesets of the request serving nodes for the hosted cluster. 4. Wait for the hosted cluster to recover
Actual results:
A placeholder pod is created for the missing node of the hosted cluster, but does not cause a scale up of the corresponding machineset.
Expected results:
The hosted cluster recovers by scaling up the corresponding machinesets.
Additional info:
Description of problem:
On Running PerfScale test on staging sectors, the script creates 1 HC per minute to load up a Management Cluster to its maximum capacity(64 HC). There were 2 clusters trying to use same serving node pair and got in to a deadlock
# oc get nodes -l osd-fleet-manager.openshift.io/paired-nodes=serving-12 NAME STATUS ROLES AGE VERSION ip-10-0-4-127.us-east-2.compute.internal Ready worker 34m v1.27.11+d8e449a ip-10-0-84-196.us-east-2.compute.internal Ready worker 34m v1.27.11+d8e449a Each node got assigned to 2 different cluster # oc get nodes -l hypershift.openshift.io/cluster=ocm-staging-2bcimf68iudmq2pctkj11os571ahutr1-mukri-dysn-0017 NAME STATUS ROLES AGE VERSION ip-10-0-4-127.us-east-2.compute.internal Ready worker 33m v1.27.11+d8e449a # oc get nodes -l hypershift.openshift.io/cluster=ocm-staging-2bcind28698qgrugl87laqerhhb0u2c2-mukri-dysn-0019 NAME STATUS ROLES AGE VERSION ip-10-0-84-196.us-east-2.compute.internal Ready worker 36m v1.27.11+d8e449a Taints were missing on those nodes, so metric-forwarder pod from other hostedclusters got scheduled on serving nodes. # oc get pods -A -o wide | grep ip-10-0-84-196.us-east-2.compute.internal ocm-staging-2bcind28698qgrugl87laqerhhb0u2c2-mukri-dysn-0019 kube-apiserver-86d4866654-brfkb 5/5 Running 0 40m 10.128.48.6 ip-10-0-84-196.us-east-2.compute.internal <none> <none> ocm-staging-2bcins06s2acm59sp85g4qd43g9hq42g-mukri-dysn-0020 metrics-forwarder-6d787d5874-69bv7 1/1 Running 0 40m 10.128.48.7 ip-10-0-84-196.us-east-2.compute.internal <none> <none> and few more
Version-Release number of selected component (if applicable):
MC Version 4.14.17 HC version 4.15.10 HO Version quay.io/acm-d/rhtap-hypershift-operator:c698d1da049c86c2cfb4c0f61ca052a0654e2fb9
How reproducible:
Not Always
Steps to Reproduce:
1. Create an MC with prod config (non-dynamic serving node) 2. Create HCs on them at 1 HCP per minutes 3. Cluster stuck at installing for more than 30 minutes
Actual results:
Only one replica of Kube-apiserver pods were up and the second stuck at pending state, upon checking the machine API has scaled both nodes in that machineset(serving-12) but only one got assigned(labelled). Further checking that node from one zone(serving-12a) was assigned to a specific hosted cluster(0017), and the other one(serving-12b) was assigned to a different hosted cluster(0019)
Expected results:
Kube-apiserver replica should be on the same machinesets and those node should be tainted.
Additional info: Slack
Description of problem:
When creating a cluster with OCP < 4.16 and nodepools with a number of workers larger than smallest size, the placeholder pods for the hosted cluster get continually recycled and the hosted cluster is never scheduled.
Version-Release number of selected component (if applicable):
4.17.0
How reproducible:
Always
Steps to Reproduce:
0. Install hypershift with size tagging enabled 1. Create a hosted cluster with nodepools large enough to not be able to use the default placeholder pods (or simply configure no placeholder pods)
Actual results:
The cluster never schedules
Expected results:
The cluster is scheduled and the control plane can come up
Additional info:
CRIO wipe is existing feature in Openshift . When node reboots CRIO wipe goes and clear the node of all images so that node boots clean . When node come back up it need access to image registry to get all images and it takes time to get all images . For telco and edge situation node might not have access to image registry and takes time to come up .
Goal of this feature is to adjust CRIO wipe to wipe only images that has been corrupted because of sudden reboot not all images
Phase 2 of the enclave support for oc-mirror with the following goals
For 4.17 timeframe
Currently the operator catalog image is always being deleted in the delete feature. It can leads to catalogs broken in the clusters.
It is necessary to change the implementation to skip the deletion of the operator catalog image according with the following conditions:
It is necessary to create a data structure that contains in which operator each related image is encountered, it is possible to get this information from the current loop already present in the collection phase.
Having this data structure will allow to tell customers which operators failed based on a image that failed during the mirroring.
For example: related image X failed during the mirroring, this related image is present in the operators a, b and c, so in the mirroring errors file already being generated is going to include the name of the operators instead of the name of the related image only.
oc-mirror to notify users when when minVersion and maxVersion have a difference of more than one major version (e.g 4.14 to 4.16), to advice users to include the interim version in the channels to be mirrored.
This is required when planning to allow upgrades between Extended Upgrade Support (EUS) releases, which require the interim version between the two (e.g. 4.15 is required in the mirrored content to upgrade 4.14 to 4.16).
oc-mirror will inform clearly users via the command line about this requirement so that users can select the appropriate versions for their upgrade plans.
When doing OCP upgrade on EUS versions, sometimes it is required to add a middle version between the current and target version.
For example:
current OCP version 4.14
target OCP version 4.16
Sometimes in order to upgrade from 4.14 to 4.16 it is required a version in the middle like 4.15.8 and this version needs to be included in the ImageSetConfiguration when using oc-mirror.
The current algorithm in oc-mirror is not accurate enough to give this information, so the proposal is to add a warning in the command line and in the docs about using the cincinnati graph web page to check if there are versions in the middle when upgrading OCP EUS versions and adding it to the ImageSetConfiguration.
oc-mirror needs to identify when an OCP on EUS version is trying to upgrade and it is skiping one version (For example going from 4.12.14 to 4.14.18)
When this condition is identified, oc-mirror needs to show a warning in the log saying to customer to use the cincinnati web tool (upgrade tool) to identify versions required in the middle.
Adding nodes to on-prem clusters in OpenShift in general is a complex task. We have numerous methods and the field keeps adding automation around these methods with a variety of solutions, sometimes unsupported (see "why is this important below"). Making cluster expansions easier will let users add nodes often and fast, leading to an much improved UX.
This feature adds nodes to any on-prem clusters, regardless of their installation method (UPI, IPI, Assisted, Agent), by booting an ISO image that will add the node to the cluster specified by the user, regardless of how the cluster was installed.
1. Create image:
$ export KUBECONFIG=kubeconfig-of-target-cluster $ oc adm node-image -o agent.iso --network-data=worker-n.nmstate --role=worker
2. Boot image
3. Check progress
$ oc adm add-node
An important goal of this feature is to unify and eliminate some of the existing options to add nodes, aiming to provide much simpler experience (See "Why is this important below"). We have official and field-documented ways to do this, that could be removed once this feature is in place, simplifying the experience, our docs and the maintenance of said official paths:
With this proposed workflow we eliminate the need of using the UPI method in the vast majority of the cases. We also eliminate the field-documented methods that keep popping up trying to solve this in multiple formats, and the need to recommend using MCE to all on-prem users, and finally we add a simpler option for IPI-deployed clusters.
In addition, all the built-in validations in the assisted service would be run, improving the installation the success rate and overall UX.
This work would have an initial impact on bare metal, vSphere, Nutanix and platform-agnostic clusters, regardless of how they were installed.
This feature is essential for several reasons. Firstly, it enables easy day2 installation without burdening the user with additional technical knowledge. This simplifies the process of scaling the cluster resources with new nodes, which today is overly complex and presents multiple options (https://docs.openshift.com/container-platform/4.13/post_installation_configuration/cluster-tasks.html#adding-worker-nodes_post-install-cluster-tasks).
Secondly, it establishes a unified experience for expanding clusters, regardless of their installation method. This streamlines the deployment process and enhances user convenience.
Another advantage is the elimination of the requirement to install the Multicluster Engine and Infrastructure Operator , which besides demanding additional system resources, are an overkill for use cases where the user simply wants to add nodes to their existing cluster but aren't managing multiple clusters yet. This results in a more efficient and lightweight cluster scaling experience.
Additionally, in the case of IPI-deployed bare metal clusters, this feature eradicates the need for nodes to have a Baseboard Management Controller (BMC) available, simplifying the expansion of bare metal clusters.
Lastly, this problem is often brought up in the field, where examples of different custom solutions have been put in place by redhatters working with customers trying to solve the problem with custom automations, adding to inconsistent processes to scale clusters.
This feature will solve the problem cluster expansion for OCI. OCI doesn't have MAPI and CAPI isn't in the mid term plans. Mitsubishi shared their feedback making solving the problem of lack of cluster expansion a requirement to Red Hat and Oracle.
We already have the basic technologies to do this with the assisted-service and the agent-based installer, which already do this work for new clusters, and from which we expect to leverage the foundations for this feature.
Day 2 node addition with agent image.
Yet Another Day 2 Node Addition Commands Proposal
Enable day2 add node using agent-install: AGENT-682
This warning message is displayed by client-go library when running in cluster. Code: https://github.com/kubernetes/client-go/blob/e4e31fd32c91e1584fd774f58b2d39f135d23571/tools/clientcmd/client_config.go#L659
This message should be disabled by either passing in the kubconfig file correctly or building the config in a different way.
By default (and if available) the same pub ssh key used for the installation is reused to add a new node.
It could be useful to allow the user to (optionally) specify a different key in the nodes-config.yaml configuration file
Allow monitoring simultaneously more than one node.
It may be necessary to coalesce the different assisted-service pre-flight validations output accordingly
Currently Assisted Service, when adding a new node, configure the bootstrap ignition to fetch the node ignition from the insecure port (22624), even though it would be possible to use the secure one (22623). This could be an issue for the existing users that didn't want to use the insecure port for the add node operation.
Implementation notes
Extend the ClusterInfo asset to retrieve the initial ignition details (url and ca certificate) from the openshift-machine-api/worker-user-data-managed secret, if available in the target cluster. These information will then be used by the agent-installer-client when importing a new cluster, to configure the cluster ignition_endpoint
(see more context in comment https://github.com/openshift/installer/pull/8242#discussion_r1571664023)
To allow running the node-joiner in a pod into a cluster where FIPS was enabled:
Performs all the required preliminary validations for adding new nodes:
Epic Goal*
Provide a simple commands for almost all users to add a node to a cluster where scaling up a MachineSet isn't an option - whether they have installed using UPI, Assisted or the agent-based installer, or can't use MachineSets for some other reason.
Why is this important? (mandatory)
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
Contributing Teams(and contacts) (mandatory)
The installer team is developing the main body of the feature, which will run in the cluster to be expanded, as well as a prototype client-side script in AGENT-682. They will then be able to translate the client-side into native oc-adm subcommands.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Allow running the node-image commands only for 4.17+ clusters.
Full support for 4.16 clusters will be implemented via https://issues.redhat.com/browse/OCPSTRAT-1528
This is the command responsible for monitoring the activity when adding new node(s) to the target cluster.
Similarly to the add-nodes-image command, also this one will be a simpler wrapper around the node-joiner's monitor command.
A list of the expected operations can be found in https://github.com/openshift/installer/blob/master/docs/user/agent/add-node/node-joiner-monitor.sh
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift
prerequisite work Goals completed in OCPSTRAT-1122
{}Complete the design of the Cluster API (CAPI) architecture and build the core operator logic needed for Phase-1, incorporating the assets from different repositories to simplify asset management.
Phase 1 & 2 covers implementing base functionality for CAPI.
There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open
On clusters where we use Cluster API instead of Machine API we need to create an empty file to report that the bootstrapping was successful. The file should be place in "/run/cluster-api/bootstrap-success.complete"
Normally there is a special controller for it, but in OpenShift we use MCO to bootstrap machines, so we have to create this file directly.
ToDo:
Links:
https://cluster-api.sigs.k8s.io/developer/providers/bootstrap.html
https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/azure/defaults.go#L81-L83
As an OpenShift engineer I want the CAPI Providers repositories to use the new generator tool so that they can independently generate CAPI Provider transport ConfigMaps
Once the new CAPI manifests generator tool is ready, we want to make use of that directly from the CAPI Providers repositories so we can avoid storing the generated configuration centrally and independently apply that based on the running platform.
Console enhancements based on customer RFEs that improve customer user experience.
Requirement | Notes | isMvp? |
---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
Description of problem:
Port exposed in Dockerfile not observed in the Ports Dropdown in Git Import Form
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
Use the https://github.com/Lucifergene/knative-do-demo/ to create application with the Dockerfile strateg
Actual results:
Ports not displayed
Expected results:
Should display
Additional info:
As a user, I don't want to see jumping UI when the Import from Git flow (or other forms) waits that network call is handled correctly.
The PatternFly Button components have a progress indicator for that: https://www.patternfly.org/components/button/#progress-indicators
This should be preferred over the manually displayed indicator that is shown sometimes (esp on the Import from Git page) in the next line.
As a user, I want to quickly select a Git Type if Openshift cannot identify it. Currently, the UI involves opening a Dropdown and selecting the desired Git type.
But it would be great if I could select the Git Type directly without opening any dropdown. This will reduce the number of clicks required to complete the action.
The OpenShift Developer Console supports an easy way to import source code from a Git repository and automatically creates a BuildConfig or a Pipeline for the user.
GitEA is an open-source alternative to GitHub, similar to GitLab. Customers who use or try GitEA will see warnings while importing their Git repository. We got already the first bug around missing GitEA support OCPBUGS-31093
None
Not required
None
As a developer, I want to create a new Gitea service to be able to perform all kinds of import operations on repositories hosted in Gitea.
OLM users can easily see in the console if an installed operator package is deprecated and learn how to stay within the support boundary by viewing the alerts/notifications that OLM emits, or by reviewing the operator status rendered by the console with visual representation.
As a user, I would like to customize the modal displayed when a 'Create Project' button is clicked in the UI.
Acceptance criteria
Provide a simplified view of config files belonging to MachineConfig objects, to provide more convenient user experience. simpler management.
Current state:
' | sed "s@+@ @g;s@%@\\\\x@g" | xargs -0 printf "%b\n
Desired state:
AC:
OLM users can easily see in the console if an installed operator package is deprecated and learn how to stay within the support boundary by viewing the alerts/notifications that OLM emits, or by reviewing the operator status rendered by the console with visual representation.
If there is any PDB with SufficientPods have allowed disruptions equal to 0, then when cluster admin tries to drain a node it will through error as Pod disruption budget is violated. So in order to avoid this, add a warning message in Topology page similar to Resource quota warning message to let the user know about this violation.
Internal doc for reference - https://docs.google.com/document/d/1pa1jaYXPPMc-XhHt_syKecbrBaozhFg2_gKOz7E2JWw/edit
This about GAing the work we started with OCPSTRAT-1040
Goal is to remove experiment tag in command and document this
This is about GA-ing the work we started in OCPSTRAT-1040.
Goal is to remove the experimental keyword from the new command flag and document this.
Acceptance Criteria
As a customer of self managed OpenShift or an SRE managing a fleet of OpenShift clusters I should be able to determine the progress and state of an OCP upgrade and only be alerted if the cluster is unable to progress. Support a cli-status command and status-API which can be used by cluster-admin to monitor the progress. status command/API should also contain data to alert users about potential issues which can make the updates problematic.
Here are common update improvements from customer interactions on Update experience
oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0 True True 16s Working towards 4.12.4: 9 of 829 done (1% complete)
Update docs for UX and CLI changes
Reference : https://docs.google.com/presentation/d/1cwxN30uno_9O8RayvcIAe8Owlds-5Wjr970sDxn7hPg/edit#slide=id.g2a2b8de8edb_0_22
Epic Goal*
Add a new command `oc adm upgrade status` command which is backed by an API. Please find the mock output of the command output attached in this card.
Why is this important? (mandatory)
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
An update is in progress for 28m42s: Working towards 4.14.1: 700 of 859 done (81% complete), waiting on network = Control Plane = ... Completion: 91%
1. Inconsistent info: CVO message says "700 of 859 done (81% complete)" but control plane section says "Completion: 91%"
2. Unclear measure of completion: CVO message counts manifest applied and control plane section says "Completion: 91%" which counts upgraded COs. Both messages do not state what they count. Manifest count is an internal implementation detail which users likely do not understand. COs are less so, but we should be more clear in what the completion means.
3. We could take advantage of this line and communicate progress with more details
We'll only remove CVO message once the rest of the output functionally covers it, so the inconsistency stays until OTA-1154. Otherwise:
= Control Plane = ... Completion: 91% (30 operators upgraded, 1 upgrading, 2 waiting)
Upgraded operators are COs that updated its version, no matter its conditions
Upgrading operators are COs that havent updated its version and are Progressing=True
Waiting operators are COs that havent updated its version and are Progressing=False
=Control Plane Upgrade= ... Completion: 45% (Est Time Remaining: 35m) ^^^^^^^^^^^^^^^^^^^^^^^^^
Do not worry too much about the precision, we can make this more precise in the future. I am thinking of
1. Assigning a fixed amount of time per CO remaining for COs that do not have daemonsets
2. Assign an amount of time proportional to # of workers to each remaining CO that has daemonsets (network, dns)
3. Assign a special amount of time proportional to # of workers to MCO
We can probably take into account the "how long are we upgrading this operator right now" exposed by CVO in OTA-1160
Discovered by Evgeni Vakhonin during OTA-1245, the 4.16 code did not take nodes that are members of multiple pools into account. This surfaced in several ways:
Duplicate insights (=we iterate over nodes over pools, so we see problematic edges in each pool it is a member of):
= Update Health = SINCE LEVEL IMPACT MESSAGE - Error Update Stalled Node ip-10-0-26-198.us-east-2.compute.internal is degraded - Error Update Stalled Node ip-10-0-26-198.us-east-2.compute.internal is degraded
Such node is present in all pool listings, and in some cases such as paused pools the output is confusing (paused-ness is a property of a pool, so we list a node as paused in one pool but outdated pending in another):
= Worker Pool = Worker Pool: mcpfoo Assessment: Excluded ... Worker Pool Node NAME ASSESSMENT PHASE VERSION EST MESSAGE ip-10-0-26-198.us-east-2.compute.internal Excluded Paused 4.15.12 - = Worker Pool = Worker Pool: worker ... Worker Pool Nodes NAME ASSESSMENT PHASE VERSION EST MESSAGE ip-10-0-26-198.us-east-2.compute.internal Outdated Pending 4.15.12 ?
It is not clear to me what would be the correct presentation of this case. Because this is an update status (and not node or cluster status) command, and only a single pool drives an update of a node, I'm thinking that maybe the best course of action would be to show only nodes whose version is driven by a given pool, or maybe come up with a "externally driven"-like assessment or whatever.
As an OTA engineer,
I would like to make sure the node in a single-node cluster is handled correctly in the upgrade-status command.
Context:
According to the discussion with the MCO team,
the node is in MCP/master but not worker.
This card is to make sure that the node are displayed that way too. My feeling is that the current code probably does the job already. In that case, we should add test coverage for the case to avoid regression in the future.
AC:
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
Enable GCP Workload Identity Webhook
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
Provide GCP workload identity webhook so a customer can more easily configure their applications to use the service account tokens minted by clusters that use GCP Workload Identity.{}
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Both, the scope of this is for self-managed |
Classic (standalone cluster) | Classic |
Hosted control planes | N/A |
Multi node, Compact (three node), or Single node (SNO), or all | All |
Connected / Restricted Network | All |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86_x64 |
Operator compatibility | TBD |
Backport needed (list applicable versions) | N/A |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | TBD |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Just like AWS STS and ARO Entra Workload ID, we want to provide the GCP workload identity webhook so a customer can more easily configure their applications to use the service account tokens minted by clusters that use GCP Workload Identity.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Will require following
Background
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer - specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision GCP infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Description of problem:
failed to create control-plane machines using GCP marketplace image
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-multi-2024-06-11-205940 / 4.16.0-0.nightly-2024-06-10-211334
How reproducible:
Always
Steps to Reproduce:
1. "create install-config", and then edit it to insert osImage settings (see [1]) 2. "create cluster" (see [2])
Actual results:
1. The bootstrap machine and the control-plane machines are not created. 2. Although it says "Waiting up to 15m0s (until 10:07AM CST)" for control-plane machines being provisioned, it did not time out until around 10:35AM CST.
Expected results:
The installation should succeed.
Additional info:
FYI a PROW CI test also has the issue: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/52816/rehearse-52816-periodic-ci-openshift-verification-tests-master-installer-rehearse-4.16-installer-rehearse-debug/1800431930391400448
Once support for private/internal clusters is added to CAPG in CORS-3252, we will need to integrate those changes into the installer:
Description of problem:
installing into Shared VPC stuck in waiting for network infrastructure ready
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-06-10-225505
How reproducible:
Always
Steps to Reproduce:
1. "create install-config" and then insert Shared VPC settings (see [1]) 2. activate the service account which has the minimum permissions in the host project (see [2]) 3. "create cluster" FYI The GCP project "openshift-qe" is the service project, and the GCP project "openshift-qe-shared-vpc" is the host project.
Actual results:
1. Getting stuck in waiting for network infrastructure to become ready, until Ctrl+C is pressed. 2. 2 firewall-rules are created in the service project unexpectedly (see [3]).
Expected results:
The installation should succeed, and there should be no any firewall-rule getting created either in the service project or in the host project.
Additional info:
When installconfig.platform.gcp.userTags is specified, all taggable resources should have the specified user tags.
This requires setting TechPreviewNoUpgrade featureSet to configure tags.
Use CAPG by default and remove it from TechPreview in 4.17.
Once https://github.com/openshift/installer/pull/8359 merges, which adds a call to CAPI DestroyBootstrap, the GCP bootstrap firewall rule should be removed. This rule was added in https://github.com/openshift/installer/pull/8374.
When creating a Private Cluster with CAPG the cloud-controller-manager generates an error when the instance-group is created:
I0611 00:04:34.998546 1 event.go:376] "Event occurred" object="openshift-ingress/router-default" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Resource 'projects/openshift-dev-installer/zones/us-east1-b/instances/bfournie-capg-test-6vn69-worker-b-rghf7' is expected to be in the subnetwork 'projects/openshift-dev-installer/regions/us-east1/subnetworks/bfournie-capg-test-6vn69-master-subnet' but is in the subnetwork 'projects/openshift-dev-installer/regions/us-east1/subnetworks/bfournie-capg-test-6vn69-worker-subnet'., wrongSubnetwork"
Three "k8s-ig" instance-groups were created for the Internal LoadBlancer. Of the 3, the first one is using the master subnet
subnetwork: https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/regions/us-east1/subnetworks/bfournie-capg-test-6vn69-master-subnet
while the other two are using the worker-subnet. Since this IG uses the master-subnet and the instance is using the worker-subnet it results in this mismatch.
This looks similar to issue tracked (and closed) with cloud-provider-gcp
https://github.com/kubernetes/cloud-provider-gcp/pull/605
Once a new version of CAPG is released we'll need to pick it up.
Related to https://issues.redhat.com/browse/CORS-3445, we need to ensure that pre-created ServiceAccounts can be passed in and used by a CAPG installation.
We are occasionally seeing this error when using GCP with TechPreview, i.e. using CAPG.
waiting for api to be available
level=warning msg=FeatureSet "TechPreviewNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster.
level=info msg=Creating infrastructure resources...
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed during pre-provisioning: failed to add worker roles: failed to set project IAM policy: googleapi: Error 409: There were concurrent policy changes. Please retry the whole read-modify-write with exponential backoff. The request's ETag '\007\006\033\255\347+\335\210' did not match the current policy's ETag '\007\006\033\255\347>%\332'., aborted
Installer exit with code 4
Install attempt 3 of 3
This is a clone of issue OCPBUGS-38152. The following is the description of the original issue:
—
Description of problem:
Shared VPC installation using service account having all required permissions failed due to cluster operator ingress degraded, by telling error "error getting load balancer's firewall: googleapi: Error 403: Required 'compute.firewalls.get' permission for 'projects/openshift-qe-shared-vpc/global/firewalls/k8s-fw-a5b1f420669b3474d959cff80e8452dc'"
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-multi-2024-08-07-221959
How reproducible:
Always
Steps to Reproduce:
1. "create install-config", then insert the interested settings (see [1]) 2. "create cluster" (see [2])
Actual results:
Installation failed, because cluster operator ingress degraded (see [2] and [3]). $ oc get co ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE ingress False True True 113m The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: error getting load balancer's firewall: googleapi: Error 403: Required 'compute.firewalls.get' permission for 'projects/openshift-qe-shared-vpc/global/firewalls/k8s-fw-a5b1f420669b3474d959cff80e8452dc', forbidden... $ In fact the mentioned k8s firewall-rule doesn't exist in the host project (see [4]), and, the given service account does have enough permissions (see [6]).
Expected results:
Installation succeeds, and all cluster operators are healthy.
Additional info:
Customers have requested the ability to have the ability to apply tolerations to the HCP control plane pods. This provides the flexibility to have the HCP pods scheduled to nodes with taints applied to them that are not currently tolerated by default.
API
Add new field to HostedCluster. hc.Spec.Tolerations
Tolerations []corev1.Toleration `json:"tolerations,omitempty"`
Implementation
In support/config/deployment.go, add hc.spec.tolerations from hc when generating the default config. This will cause the toleration to naturally get spread to the deployments and statefulsets.
CLI
Add new cli argument called –tolerations to the hcp cli tool during cluster creation. This argument should be able to be set multiple times. The syntax of the field should follow the convention set by the kubectl client tool when setting a taint on a node.
For example, the kubectl client tool can be used to set the following taint on a node.
kubectl taint nodes node1 key1=value1:NoSchedule
And then the hcp cli tool should be able to add a toleration for this taint during creation with the following cli arg.
hcp cluster create kubevirt –toleration “key1=value1:noSchedule” …
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Customers have requested the ability to have the ability to apply tolerations to the HCP control plane pods. This provides the flexibility to have the HCP pods scheduled to nodes with taints applied to them that are not currently tolerated by default.
API
Add new field to HostedCluster. hc.Spec.Tolerations
Tolerations []corev1.Toleration `json:"tolerations,omitempty"`
Implementation
In support/config/deployment.go, add hc.spec.tolerations from hc when generating the default config. This will cause the toleration to naturally get spread to the deployments and statefulsets.
CLI
Add new cli argument called –tolerations to the hcp cli tool during cluster creation. This argument should be able to be set multiple times. The syntax of the field should follow the convention set by the kubectl client tool when setting a taint on a node.
For example, the kubectl client tool can be used to set the following taint on a node.
kubectl taint nodes node1 key1=value1:NoSchedule
And then the hcp cli tool should be able to add a toleration for this taint during creation with the following cli arg.
hcp cluster create kubevirt –toleration “key1=value1:noSchedule” …
The OCP snapshot controller needs to be updated in the pkg/operator/starter.go file to account for HCP tolerations
The cluster-network-operator needs to be HCP tolerations aware, otherwise controllers (like multus and ovn) won't be deployed by the CNO with the correct tolerations.
The code that looks at the HostedControlPlane within the CNO can be found in pkg/hypershift/hypershift.go. https://github.com/openshift/cluster-network-operator/blob/33070b57aac78118eea34060adef7f2fb7b7b4bf/pkg/hypershift/hypershift.go#L134
API
Add new field to HostedCluster. hc.Spec.Tolerations
Tolerations []corev1.Toleration `json:"tolerations,omitempty"`
Implementation
In support/config/deployment.go, add hc.spec.tolerations from hc when generating the default config. This will cause the toleration to naturally get spread to the deployments and statefulsets.
The objective is to create a comprehensive backup and restore mechanism for HCP OpenShift Virtualization Provider. This feature ensures both the HCP state and the worker node state are backed up and can be restored efficiently, addressing the unique requirements of KubeVirt environments.
The HCP team has delivered OADP backup and restore steps for the Agent and AWS provider here. We need to add the steps necessary to make these steps work for HCP KubeVirt clusters.
The etcd-operator should automatically rotate the etcd-signer and etcd-metrics-signer certs as they approach expiry.
Requirements (aka. Acceptance Criteria):
Deliver rotation and recovery requirements from OCPSTRAT-714
Epic Goal*
The etcd cert rotation controller should automatically rotate the etcd-signer and etcd-metrics-signer certs (and re-sign leaf certs) as they approach expiry.
Why is this important? (mandatory)
Automatic rotation of the signer certs will reduce the operational burden of having to manually rotate the signer certs.
Scenarios (mandatory)
etcd-signer and etcd-metrics-signer certs are rotated as they approach the end of their validity period. For the signer certs this is 4.5 years.
https://github.com/openshift/cluster-etcd-operator/blob/d8f87ecf9b3af3cde87206762a8ca88d12bc37f5/pkg/tlshelpers/tlshelpers.go#L32
Dependencies (internal and external) (mandatory)
None
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
We shall never allow new leaf certificates to be generated when a revision rollout is in progress AND when the bundle was just changed.
From ETCD-606 we know when a bundle has changed, so we can save the current revision in the operator status and only allow leaf updates on the next higher revision.
NOTE: this assumes etcd rolls out slower than apiserver in practice. We should also think about how we can in-cooperate the revision rollout on the apiserver static pods.
in ETCD-565 we have added tests to manually rotate certificate.
In the recovery test suite, depending on the order of execution we have the following failures:
1. : [sig-etcd][Feature:CertRotation][Suite:openshift/etcd/recovery] etcd can recreate trust bundle [Timeout:15m]
Here the tests usually time out waiting for a revision rollout - couldn't find a deeper cause, maybe the timeout is not large enough.
2. : [sig-etcd][Feature:CertRotation][Suite:openshift/etcd/recovery] etcd can recreate dynamic certificates [Timeout:15m]
The recovery test suite creates several new nodes. When choosing a peer secret, we sometimes choose one that has no member/node anymore and thus it will never be recreated.
3. after https://github.com/openshift/cluster-etcd-operator/pull/1269
After the leaf gating has merged, some certificates are not in their original place anymore, which invalidates the manual rotation procedure
For backward compatibility we tried to keep the previous named certificates the way they were:
https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/operator/starter.go#L614-L639
Many of those are currently merely copied with the ResourceSyncController and could be replaced with their source configmap/secret.
This should help with easier understanding and mental load of the codebase.
Some replacement suggestions:
AC:
All openshift TLS artifacts (secrets and configmaps) now have a requirement to have an annotation for user facing descriptions per the metadata registry for TLS artifacts.
https://github.com/openshift/origin/tree/master/tls
There is a guideline for how these descriptions must be written:
https://github.com/openshift/origin/blob/master/tls/descriptions/descriptions.md#how-to-meet-the-requirement
The descriptions for the etcd's TLS artifacts don't meet that requirement and should be updated to point out the required details e.g hostnames, subjects and what kind of certificates the signer is signing.
https://github.com/openshift/origin/blob/8ffdb0e38af1319da4a67e391ee9c973d865f727/tls/descriptions/descriptions.md#certificates-22-1
https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/tlshelpers/tlshelpers.go#L74
See also:
https://github.com/openshift/origin/blob/master/tls/descriptions/descriptions.md#Certificates-85
Currently a new revision is created when the ca bundle configmaps (etcd-signer / metrics-signer) have changed.
As of today, this change is not transactional across invocations of EnsureConfigMapCABundle, meaning that four revisions (at most, one for each function call) could be created.
For gating the leaf cert generation on a fixed revision number, it's important to ensure that any bundle change will only ever result in exactly one revision change.
We currently ensure this for leaf certificates by a single update to "etcd-all-certs", we can use the exact same trick again.
AC:
additional salvage/refactoring from previously reverted ETCD-579
This feature request proposes adding the Partition Number within a Placement Group for OpenShift MachineSets & in CAPI. Currently, OCP 4.14 supports pre-created Placement Groups (RFE-2194). But the feature to define the Partition Number within those groups is missing.
Partition placement groups offer a more granular approach to instance allocation within an Availability Zone on AWS, particularly beneficial for deployments on AWS Outpost (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups-outpost.html). It also allows users to further enhance high availability by distributing instances across isolated hardware units within the chosen Placement Group. This improves fault tolerance by minimizing the impact of hardware failures to only instances within the same partition.
Some Benefits are listed below.
Update MAPI to Support AWS Placement Group Partition Number
Based on RFE-2194, support for pre-created Placement Groups was added in OCP. Following that, it is requested in RFE-4965 to have the ability to specify the Partition Number of the Placement Group as this allows more precise allocation.
NOTE: Placement Group (and Partition) will be pre-created by the user. User should be able to specify Partition Number along with PlacementGroupName on EC2 level to improve availability.
References
Upstream changes: CFE-1041
Add a new field (PlacementGroupPartition) in AWSMachineProviderConfig to allow users specify the Partition Number for AWSMachine
Description of problem:
Create machineset with invalid placementGroupPartition 0, it will be cleared in machineset, machine created successfully and placed inside an auto-chosen partition number, but should create failed
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-07-01-124741
How reproducible:
Always
Steps to Reproduce:
1.Create a machineset with a pre-created partition placement group and placementGroupPartition = 0 placementGroupName: pgpartition placementGroupPartition: 0 2.The placementGroupPartition was cleared in the machineset, and the machine created successfully in pgpartition placement group and placed inside an auto-chosen partition number
Actual results:
The machine created and placed inside an auto-chosen partition number.
Expected results:
The machine should fail and give error message like other invalid values. errorMessage: 'error launching instance: Value ''8'' is not a valid value for PartitionNumber. Use a value between 1 and 7.'
Additional info:
It's a new feature test for https://issues.redhat.com/browse/CFE-1066
Implement changes in machine-api-provider-aws to support partition number while creating instance.
Acceptance criteria:
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
We have multiple customers that are asking us to disable the vsphere CSI driver/operator as a day 2 operation. The goal of this epic is to provide a safe API that will remove the vSphere CSI/Operator. This will also silent the VPD alerts as we received several complaints about VPD raising to many.
IMPORTANT: As disabling a storage driver from a running environment can be risky, the use of this API will only be allowed through a RH customer support case. Support will ensure that it is safe to proceed and guide the customer through the process.
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
Customers want to disable vSphere CSI because this requires several vsphere permissions that they don't want to allow at the OCP level, not setting these permissions ends up with constant and recurring alerting. These customers don't want to use vsphere CSI usually because they use another storage solution.
The goal is to provide an API that disables vSphere storage intergration as well as the VPD alerts which will still be present in the logs but not raise (no alerts, lower frequency of checks, lower severity).
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
vsphere CSI/Operator is disabled and no VDP alerts are raised. Logs are still present.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | self managed |
Classic (standalone cluster) | yes vsphere only |
Hosted control planes | N/A |
Multi node, Compact (three node), or Single node (SNO), or all | vsphere only usually not SNO |
Connected / Restricted Network | both |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | all applicable to vsphere |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | none |
Other (please specify) | Available only through RH support case |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
As an admin I want to disable the vsphere CSI driver because I am using another storage solution.
As an admin I want to disable the vsphere CSI driver because it requires too many vsphere permissions and keep raising OCP alerts.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
What to do if there is existing PVs?
How do we manage general alerts if VPD alert are silenced?
What do we do if customer tries to install the upstream vsphere CSI?
High-level list of items that are out of scope. Initial completion during Refinement status.
Replace the Red Hat vSphere CSI with the vmware upstream driver. We can consider this use case in a second phase if there is an actual demand.
Public availability. To begin with this will be only possible through RH support.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Several customers requests asking to disable vsphere CSI drivers.
see https://issues.redhat.com/browse/RFE-3821
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Understand why the customer wants to disable it in the first. Be extra careful with pre-flight checks in order to make sure that it is safe to proceed with.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
No public doc, we need a detailed documentation for our support organisation that includes, pre-flight checks, the differents steps, confirmation that everything works as expected and basic troubleshooting guide. Likely an internal KB or whatever works best for support.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Applies to vsphere only
Epic Goal*
Provide a nice and user-friendly API to disable integration between OCP storage and vSphere.
Why is this important? (mandatory)
This is continuation of STOR-1766. In the old releases we want to provide a dirty way to disable vSphere CSI driver in 4.12 - 4.16.
This epic provides a nice and explicit API to disable the CSI driver in 4.17 (or where is this epic implemented), and ensures the cluster can be upgraded to any future OCP version.
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
Allow customer to enabled EFS CSI usage metrics.
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
OCP already supports exposing CSI usage metrics however the EFS metrics are not enabled by default. The goal of this feature is to allows customers to optionally turn on EFS CSI usage metrics in order to see them in the OCP console.
The EFS metrics are not enabled by default for a good reason as it can potentially impact performances. It's disabled in OCP, because the CSI driver would walk through the whole volume, and that can be very slow on large volumes. For this reason, the default will remain the same (no metrics), customers would need explicitly opt-in.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Clear procedure on how to enable it as a day 2 operation. Default remains no metrics. Once enabled the metrics should be available for visualisation.
We should also have a way to disable metrics.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | yes |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | AWS only |
Connected / Restricted Network | both |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | all AWS/EFS supported |
Operator compatibility | EFS CSI operator |
Backport needed (list applicable versions) | No |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | Should appear in OCP UI automatically |
Other (please specify) | OCP on AWS only |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
As an OCP user i want to be able to visualise the EFS CSI metrics
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
Additional metrics
Enabling metrics by default.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Customer request as per
https://issues.redhat.com/browse/RFE-3290
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
We need to be extra clear on the potential performance impact
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Document how to enable CSI metrics + warning about the potential performance impact.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
It can benefit any cluster on AWS using EFS CSI including ROSA
Epic Goal*
This goal of this epic is to provide a way to admin to turn on EFS CSI usage metrics. Since this could lead to performance because the CSI driver would walk through the whole volume this option will not be enabled by default; admin will need to explicitly opt-in.
Why is this important? (mandatory)
Turning on EFS metrics allows users to monitor how much EFS space is being used by OCP.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
None
Contributing Teams(and contacts) (mandatory)
Acceptance Criteria (optional)
Enable CSI metrics via the operator - ensure the driver is started with the proper cmdline options. Verify that the metrics are sent and exposed to the users.
Drawbacks or Risk (optional)
Metrics are calculated by walking through the whole volume which can impact performances. For this reason enabling CSI metrics will need an explicit opt-in from the admin. This risk needs to be explicitly documented.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Description of problem:
The original PR had all the labels, but it didn't merge on time for code freeze duo to CI flakes.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Some customers have expressed the need to have a control plane with nodes of a different architecture from the compute nodes in a cluster. This may be to realise cost or power savings or maybe just to benefit from cloud providers in-house hardware.
While this config can be achieved with Hosted Control Planes the customers also want to use Multi-architecture compute to achieve these configs, ideally at install time.
This feature is to track the implementation of this offering with Arm nodes running in the AWS cloud.
Customers will be able to install OpenShift clusters that contain control plane and compute nodes of different architectures
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Yes |
Classic (standalone cluster) | Yes |
Hosted control planes | Yes |
Multi node, Compact (three node), or Single node (SNO), or all | Yes |
Connected / Restricted Network | Yes |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86 and Arm |
Operator compatibility | n/a |
Backport needed (list applicable versions) | n/a |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | n/a |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Epic Goal
Acceptance Criteria
Allow mixing Control Plane and Compute CPU archs, bypass with warnings if the user overrides the release image. Put behind a feature gate.
Add validation in the Installer to not allow install with multi-arch nodes using a single-arch release payload.
The validation needs to be skipped (or just a warning) when the release payload architecture cannot be determined (e.g. when using OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE).
Currently the installer doesn't expose if the release payload is multi or single arch:
./openshift-install version ./openshift-install 4.y.z built from commit xx release image quay.io/openshift-release-dev/ocp-release@sha256:xxx release architecture amd64
Or delete the feature gate altogether if we are allowed to do so.
Some customers have expressed the need to have a control plane with nodes of a different architecture from the compute nodes in a cluster. This may be to realise cost or power savings or maybe just to benefit from cloud providers in-house hardware.
While this config can be achieved with Hosted Control Planes the customers also want to use Multi-architecture compute to achieve these configs, ideally at install time.
This feature is to track the implementation of this offering with Arm nodes running in the AWS cloud.
Customers will be able to install OpenShift clusters that contain control plane and compute nodes of different architectures
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Yes |
Classic (standalone cluster) | Yes |
Hosted control planes | Yes |
Multi node, Compact (three node), or Single node (SNO), or all | Yes |
Connected / Restricted Network | Yes |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86 and Arm |
Operator compatibility | n/a |
Backport needed (list applicable versions) | n/a |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | n/a |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Epic Goal
Why is this important?
Scenarios
1. …
Acceptance Criteria
Dependencies (internal and external)
1. …
Previous Work (Optional):
1. …
Open questions::
1.
Done Checklist
Or delete the feature gate altogether if we are allowed to do so.
Enable CPU manager on s390x.
Why is this important?
CPU manager is an important component to manage performance of OpenShift and utilize the respective platforms.
Enable CPU manager on s390x.
CPU manager works on s390x.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Y |
Classic (standalone cluster) | Y |
Hosted control planes | Y |
Multi node, Compact (three node), or Single node (SNO), or all | Y |
Connected / Restricted Network | Y |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | IBM Z |
Operator compatibility | n/a |
Backport needed (list applicable versions) | n/a |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | n/a |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
oc-mirror to include the RHCOS image for HyperShift KubeVirt provider when mirroring the OCP release payload
When using the KubeVirt (OpenShift Virtualization) provider for HyperShift, the KubeVirt VMs that are going to serve as nodes for the hosted clusters will consume an RHOCS image shipped as a container-disk image for KubeVirt.
In order to have it working on disconnected/air-gapped environments, its image must be part of the mirroring process.
Overview
Refer to RFE-5468
The coreos image is needed to ensure seamless deployment of HyperShift KubeVirt functionality in disconnected/air-gapped environments.
Solution
This story will address this issue, in that oc-mirror will include the cores kubevirt container image in the release payload
The image is found in the file release-manifests/0000_50_installer_coreos-bootimages.yaml
A field kubeVirtContainer (default false) will be added to the current v2 imagesetconfig and if set to true, the release collector will have logic to read and parse the yaml file correctly to extract and add the "DigestRef" (digest) to the release payload
As a product manager or business owner of OpenShift Lightspeed. I want to track who is using what feature of OLS and WHY. I also want to track the product adoption rate so that I can make decision about the product ( add/remove feature , add new investment )
Enable moniotring of OLS by defult when a user install OLS operator ---> check the box by defualt
Users will have the ability to disable the monitoring by . ----> in check the box
Refer to this slack conversation :https://redhat-internal.slack.com/archives/C068JAU4Y0P/p1723564267962489
Description of problem:
When installing the OpenShift Lightspeed operator, cluster monitoring should be enabled by default.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Click OpenShift Lightspeed in operator catalog 2. Click Install
Actual results:
"Enable Operator recommended cluster monitoring on this Namespace" checkbox is not selected by default.
Expected results:
"Enable Operator recommended cluster monitoring on this Namespace" checkbox should be selected by default.
Additional info:
This ticket focuses on a reduced scope compared to the initial Tech Preview outlined in OCPSTRAT-1327.
Specifically, the console in the 4.17 Tech Preview release allows customers to:
1) Pre-installation:
2) Post-installation:
All the expected user outcomes and the acceptance criteria in the engineering epics are covered.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Our customers will experience a streamlined approach to managing layered capabilities and workloads delivered through operators, operators packaged in Helm charts, or even plain Helm charts. The next generation OLM will power this central distribution mechanism within the OpenShift in the future.
Customers will be able to explore and discover the layered capabilities or workloads, and then install those offerings and make them available on their OpenShift clusters. Similar to the experience with the current OperatorHub, customers will be able to sort and filter the available offerings based on the delivery mechanism (i.e., operator-backed or plain helm charts), source type (i.e., from Red Hat or ISVs), valid subscriptions, infrastructure features, etc. Once click on a specific offering, they see the details which include the description, usage, and requirements of the offering, the provided services in APIs, and the rest of the relevant metadata for making the decisions.
The next-gen OLM aims to unify workload management. This includes operators packaged for current OLM, operators packaged in Helm charts, and even plain Helm charts for workloads. We want to leverage the current support for managing plain Helm charts within OpenShift and the console for leveraging our investment over the years.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Refer to the “Documentation Considerations” section of the OLM v1 GA feature.
This epic contains all the OLM related stories for OCP release-4.17. This was cloned from the original epic which contained a spike to create stories and a plan to support the OLM v1 api
Some docs that detail the OLM v1 Upgrade: https://docs.google.com/document/d/1D--lL8gnoDvs0vl72WZl675T23XcsYIfgW4yhuneCug/edit#heading=h.3l98owr87ve
AC: Implement a catalog view, which lists available OLM v1 packages
This ticket outlines the scope of the Tech Preview release for OCP 4.17
This Tech Preview release grants early access to upcoming features in the next-generation Operator Lifecycle Manager (OLM v1). Customers can now test these functionalities and provide valuable feedback during development.
Highlights of OLM v1 Phase 4 Preview:
All the expected user outcomes and the acceptance criteria in the engineering epics are covered.
Leveraging learnings and customer feedback since OCP 4's inception, OLM v1 is designed to be a major overhaul.
With OpenShift 4.17, we are one step closer to the highly anticipated general availability (GA) of the next-generation OLM.
See the OCPSTRAT feature for OLM v1 GA:
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Remove Rukpak from cluster-olm-operator
Stop building rukpak for OCP 4.17
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
operator-controller manifests will need to be updated to create a configmap for service-ca-operator to inject the CA bundle into. In order to not break the payload, cluster-olm-operator will need to be updated to have create, update, patch permissions for the configmap we are creating. Following the principle of least privilege, the permissions should be scoped to the resource name "operator-controller-openshift-ca" (this will be the name of the created configmap)
OVN Kubernetes Developer's Preview for BGP as a routing protocol for User Defined Network (Segmentation) pod and VM addressability via common data center networking removing the need to negotiate NAT at the cluster's edge.
OVN-Kubernetes currently has no native routing protocol integration, and relies on a Geneve overlay for east/west traffic, as well as third party operators to handle external network integration into the cluster. The purpose of this Developer's Preview enhancement is to introduce BGP as a supported routing protocol with OVN-Kubernetes. The extent of this support will allow OVN-Kubernetes to integrate into different BGP user environments, enabling it to dynamically expose cluster scoped network entities into a provider’s network, as well as program BGP learned routes from the provider’s network into OVN. In a follow-on release, this enhancement will provide support for EVPN, which is a common data center networking fabric that relies on BGP.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Importing Routes from the Provider Network
Today in OpenShift there is no API for a user to be able to configure routes into OVN. In order for a user to change how cluster traffic is routed egress into the cluster, the user leverages local gateway mode, which forces egress traffic to hop through the Linux host's networking stack, where a user can configure routes inside of the host via NM State. This manual configuration would need to be performed and maintained across nodes and VRFs within each node.
Additionally, if a user chooses to not manage routes within the host and use local gateway mode, then by default traffic is always sent to the default gateway. The only other way to affect egress routing is by using the Multiple External Gateways (MEG) feature. With this feature the user may choose to have multiple different egress gateways per namespace to send traffic to.
As an alternative, configuring BGP peers and which route-targets to import would eliminate the need to manually configure routes in the host, and would allow dynamic routing updates based on changes in the provider’s network.
Exporting Routes into the Provider Network
There exists a need for provider networks to learn routes directly to services and pods today in Kubernetes. Metal LB is already one solution whereby load balancer IPs are advertised by BGP to provider networks, and this feature development does not intend to duplicate or replace the function of Metal LB. Metal LB should be able to interoperate with OVN-Kubernetes, and be responsible for advertising services to a provider’s network.
However, there is an alternative need to advertise pod IPs on the provider network. One use case is integration with 3rd party load balancers, where they terminate a load balancer and then send packets directly to OCP nodes with the destination IP address being the pod IP itself. Today these load balancers rely on custom operators to detect which node a pod is scheduled to and then add routes into its load balancer to send the packet to the right node.
By integrating BGP and advertising the pod subnets/addresses directly on the provider network, load balancers and other entities on the network would be able to reach the pod IPs directly.
Extending OVN-Kubernetes VRFs into the Provider Network
This is the most powerful motivation for bringing support of EVPN into OVN-Kubernetes. A previous development effort enabled the ability to create a network per namespace (VRF) in OVN-Kubernetes, allowing users to create multiple isolated networks for namespaces of pods. However, the VRFs terminate at node egress, and routes are leaked from the default VRF so that traffic is able to route out of the OCP node. With EVPN, we can now extend the VRFs into the provider network using a VPN. This unlocks the ability to have L3VPNs that extend across the provider networks.
Utilizing the EVPN Fabric as the Overlay for OVN-Kubernetes
In addition to extending VRFs to the outside world for ingress and egress, we can also leverage EVPN to handle extending VRFs into the fabric for east/west traffic. This is useful in EVPN DC deployments where EVPN is already being used in the TOR network, and there is no need to use a Geneve overlay. In this use case, both layer 2 (MAC-VRFs) and layer 3 (IP-VRFs) can be advertised directly to the EVPN fabric. One advantage of doing this is that with Layer 2 networks, broadcast, unknown-unicast and multicast (BUM) traffic is suppressed across the EVPN fabric. Therefore the flooding domain in L2 networks for this type of traffic is limited to the node.
Multi-homing, Link Redundancy, Fast Convergence
Extending the EVPN fabric to OCP nodes brings other added benefits that are not present in OCP natively today. In this design there are at least 2 physical NICs and links leaving the OCP node to the EVPN leaves. This provides link redundancy, and when coupled with BFD and mass withdrawal, it can also provide fast failover. Additionally, the links can be used by the EVPN fabric to utilize ECMP routing.
OVN Kubernetes support for BGP as a routing protocol.
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
1. …
1. …
MetalLB or a cluster admin will set this flag so that CNO deploys FFR-K8S and activates BGP support in OVN-K
When the OCP API flag to enable BGP support in the cluster is set, CNO should deploy FRR-K8S. Depends on SDN-5086.
enhancement ref: https://github.com/openshift/enhancements/pull/1636
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
Introduce snapshots support for Azure File as Tech Preview
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
After introducing cloning support in 4.17, the goal of this epic is to add the last remaining piece to support snapshots support as tech preview
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Should pass all the regular CSI snapshot tests. All failing or known issues should be documented in the RN. Since this feature is TP we can still introduce it with knows issues.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | yes |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | all with Azure |
Connected / Restricted Network | all |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | all |
Operator compatibility | Azure File CSI |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | Already covered |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
As an OCP on Azure user I want to perform snapshots of my PVC and be able to restore them as a new PVC.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
Is there any known issues, if so they should be documented.
High-level list of items that are out of scope. Initial completion during Refinement status.
N/A
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
We have support for other cloud providers CSI snapshots, we need to align capabilities in Azure with their File CSI. Upstream support has lagged.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
User experience should be the same as other CSI drivers.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Add snapshot support in the CSI driver table, if there is any specific information to add, include it in the Azure File CSI driver doc. Any known issue should be documented in the RN.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Can be leveraged by ARO or OSD on Azure.
Epic Goal*
Add support for snapshots in Azure File.
Why is this important? (mandatory)
We should track upstream issues and ensure enablement in OpenShift. Snapshots are a standard feature of CSI and the reason we did not support it until now was lacking upstream support for snapshot restoration.
Snapshot restore feature was added recently in upstream driver 1.30.3 which we rebased to in 4.17 - https://github.com/kubernetes-sigs/azurefile-csi-driver/pull/1904
Furthermore we already included azcopy cli which is a depencency of cloning (and snapshots). Enabling snapshots in 4.17 is therefore just a matter of adding a sidecar, volumesnapshotclass and RBAC in csi-operator which is cheap compared to the gain.
However, we've observed a few issues with cloning that might need further fixes to be able to graduate to GA and intend releasing the cloning feature as Tech Preview in 4.17 - since snapshots are implemented with azcopy too we expect similar issues and suggest releasing snapshot feature also as Tech Preview first in 4.17.
Scenarios (mandatory)
Users should be able to create a snapshot and restore PVC from snapshots.
Dependencies (internal and external) (mandatory)
azcopy - already added in scope of cloning epic
upstream driver support for snapshot restore - already added via 4.17 rebase
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
Introduce snapshots support for Azure File as Tech Preview
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
After introducing cloning support in 4.17, the goal of this epic is to add the last remaining piece to support snapshots support as tech preview
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Should pass all the regular CSI snapshot tests. All failing or known issues should be documented in the RN. Since this feature is TP we can still introduce it with knows issues.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | yes |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | all with Azure |
Connected / Restricted Network | all |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | all |
Operator compatibility | Azure File CSI |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | Already covered |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
As an OCP on Azure user I want to perform snapshots of my PVC and be able to restore them as a new PVC.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
Is there any known issues, if so they should be documented.
High-level list of items that are out of scope. Initial completion during Refinement status.
N/A
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
We have support for other cloud providers CSI snapshots, we need to align capabilities in Azure with their File CSI. Upstream support has lagged.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
User experience should be the same as other CSI drivers.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Add snapshot support in the CSI driver table, if there is any specific information to add, include it in the Azure File CSI driver doc. Any known issue should be documented in the RN.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Can be leveraged by ARO or OSD on Azure.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
Develop tooling to support migrating CNS volumes between datastores in a safe way for Openshift users.
This tool relies on a new VMware CNS API and requires 8.0.2 or 7.0 Update 3o minimum versions
https://docs.vmware.com/en/VMware-vSphere/8.0/rn/vsphere-vcenter-server-802-release-notes/index.html
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
Often our customers are looking to migrate volumes between datastores because they are running out of space in current datastore or want to move to more performant datastore. Previously this was almost impossible or required modifying PV specs by hand to accomplish this. It was also very error prone.
As a first version, we develop a CLI tool that is shipped as part of the vsphere CSI operator. We keep this tooling internal for now, support can guide customers on a per request basis. This is to manage current urgent customer's requests, a CLI tool is easier and faster to develop it can also easily be used in previous OCP releases.
Ultimately we want to develop an operator that would take care of migrating CNS between datastores.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Tool is able to take a list of volumes and migrate from one datastore to another. It also performs the necessary pre-flight tests to ensure that the volume is safe to migrate.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | Yes |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | Yes |
Connected / Restricted Network | both |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | vsphere CSI operator |
Backport needed (list applicable versions) | no |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | no |
Other (please specify) | OCP on vsphere only |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
As a admin - want to migrate all my PVs or optional PVCs belonging to certain namespace to a different datastore within cluster without potentially requiring extended downtime.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
How to ship the binary?
Which versions of OCP can this tool support?
High-level list of items that are out of scope. Initial completion during Refinement status.
This feature tracks the implementation with a CLI binary. The operator implementation will be tracked by another Jira feature.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
We had a lot of requests to migrate volumes between datastore for multiple reason. Up until now it was not natively supported by VMware. In 8.0.2 they added a CNS API and a vsphere UI feature to perform volume migration.
We want to avoid customers to directly use the feature from the vsphere UI so we have to develop a wrapper for customers. It's easier to ship a CLI tool first to cover current request and then take some time to develop an official operator-led way.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Given this tool is shipped as part of the vsphere CSI operator and requires extraction and careful manipulation we are not going to document it publicly.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Will be documented as an internal KCS
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
OCP on vSphere only
Epic Goal*
Develop tooling to support migrating CNS volumes between datastores in a safe way for Openshift users.
As a first version, we develop a CLI tool that is shipped as part of the vsphere CSI operator. We keep this tooling internal for now, support can guide customers on a per request basis. This is to manage current urgent customer request, a CLI tool is easier and faster to develop it can also easily be used in previous OCP releases.
Ultimately we want to develop an operator that would take care of migrating CNS between datastores.
Why is this important? (mandatory)
Often our customers are looking to migrate volumes between datastores because they are running out of space in current datastore or want to move to more performant datastore. Previously this was almost impossible or required modifying PV specs by hand to accomplish this. It was also very error prone.
Scenarios (mandatory)
As a admin - want to migrate all my PVs or optional PVCs belonging to certain namespace to a different datastore within cluster without potentially requiring extended downtime.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Place holder epic to capture all azure tickets.
TODO: review.
As an end user of a hypershift cluster, I want to be able to:
so that I can achieve
From slack thread: https://redhat-external.slack.com/archives/C075PHEFZKQ/p1722615219974739
We need 4 different certs:
Add e2e tests to openshift/origin to test the improvement in integration between CoreDNS and EgressFirewall as proposed in the enhancement https://github.com/openshift/enhancements/pull/1335.
As the feature is currently targeted for Tech-Preview, the e2e tests should enable the feature set to test the feature.
The e2e test should create EgressFirewall with DNS rules after enabling Tech-Preview. The EgressFirewall rules should work correctly. E.g. https://github.com/openshift/origin/blob/master/test/extended/networking/egress_firewall.go
Goal
As an OpenShift installer I want to update the firmware of the hosts I use for OpenShift on day 1 and day 2.
As an OpenShift installer I want to integrate the firmware update in the ZTP workflow.
Description
The firmware updates are required in BIOS, GPUs, NICs, DPUs, on hosts that will often be used as DUs in Edge locations (commonly installed with ZTP).
Acceptance criteria
Out of Scope
Description of problem:
After running a firmware update the new version is not displayed in the status of the HostFirmwareComponents
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Execute a firmware update, after it succeeds check the Status to find the information about the new version installed.
Actual results:
Status only show the initial information about the firmware components.
Expected results:
Status should show the newer information about the firmware components.
Additional info:
When executing a firmware update for BMH, there is a problem updating the Status of the HostFirmwareComponents CRD, causing the BMH to repeat the update multiple times since it stays in Preparing state.
As a customer of self managed OpenShift or an SRE managing a fleet of OpenShift clusters I should be able to determine the progress and state of an OCP upgrade and only be alerted if the cluster is unable to progress. Support a cli-status command and status-API which can be used by cluster-admin to monitor the progress. status command/API should also contain data to alert users about potential issues which can make the updates problematic.
Here are common update improvements from customer interactions on Update experience
oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0 True True 16s Working towards 4.12.4: 9 of 829 done (1% complete)
Update docs for UX and CLI changes
Reference : https://docs.google.com/presentation/d/1cwxN30uno_9O8RayvcIAe8Owlds-5Wjr970sDxn7hPg/edit#slide=id.g2a2b8de8edb_0_22
Epic Goal*
Add a new command `oc adm upgrade status` command which is backed by an API. Please find the mock output of the command output attached in this card.
Why is this important? (mandatory)
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Description of problem:
The cluster version is not updating (Progressing=False). Reason: <none> Message: Cluster version is 4.16.0-0.nightly-2024-05-08-222442
When cluster is outside of update it shows Failing=True condition content which is potentially confusing. I think we can just show "The cluster version is not updating ".
Description of problem:
the newly available TP upgrade status command have formatting issue while expanding update health using --details flag, a plural s:<resource> is displayed, which according to dev supposed to be added to group.kind, but only the plural itself is displayed instead Resources: s: version
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-08-222442
How reproducible:
100%
Steps to Reproduce:
oc adm upgrade status --details=all while there is any health issue with the cluster
Actual results:
Resources: s: ip-10-0-76-83.us-east-2.compute.internal Description: Node is unavailable
Resources: s: version Description: Cluster operator control-plane-machine-set is not available
Resources: s: ip-10-0-58-8.us-east-2.compute.internal Description: failed to set annotations on node: unable to update node "&Node{ObjectMeta:{ 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},}": Patch "https://api-int.evakhoni-1514.qe.devcluster.openshift.com:6443/api/v1/nodes/ip-10-0-58-8.us-east-2.compute.internal": read tcp 10.0.58.8:48328->10.0.27.41:6443: read: connection reset by peer
Expected results:
should mention the correct <group.kind>s:<resource> ?
Additional info:
OTA-1246
slackl thread
Using the alerts-in-CLI PoC OTA-1080 show relevant firing alerts in the OTA-1087 section. Probably do not show all firing alerts.
I propose showing
Impact can be probably simple alertname -> impact type classifier. Message can be "Alert name: Alert message":
=Update Health= SINCE LEVEL IMPACT MESSAGE 3h Warning API Availability KubeDaemonSetRolloutStuck: DaemonSet openshift-ingress-canary/ingress-canary has not finished or progressed for at least 30 minutes.
Support deploying an OpenShift cluster across multiple vSphere clusters, i.e. configuring multiple vCenter servers in one OpenShift cluster.
Multiple vCenter support in the Cloud Provider Interface (CPI) and the Cloud Storage Interface (CSI).
Customers want to deploy OpenShift across multiple vSphere clusters (vCenters) primarily for high availability.
粗文本*h3. *Feature Overview
Support deploying an OpenShift cluster across multiple vSphere clusters, i.e. configuring multiple vCenter servers in one OpenShift cluster.
Multiple vCenter support in the Cloud Provider Interface (CPI) and the Cloud Storage Interface (CSI).
Customers want to deploy OpenShift across multiple vSphere clusters (vCenters) primarily for high availability.
This section contains all the test cases that we need to make sure work as part of the done^3 criteria.
This section contains all scenarios that are considered out of scope for this enhancement that will be done via a separate epic / feature / story.
As an OpenShift administrator, I would like vShere CSI Driver Operator to not become degraded due to vSphere Multi vCenter feature gate being enabled so that I can begin to install my cluster across multiple vcenters and create PVs.
The purpose of this story is to perform the needed changes to get vShere CSI Driver Operator allowing the configuration of the new Feature Gate for vSphere Multi vCenter support. By default, the operator will still only allow one vCenter definition and support that config; however, once the feature gate for vSphere Multi vCenter is enabled, we will allow more than one vCenter. Initially, the plan is to only allow a max of 3 vCenter definitions which will be controlled via the CRD for the vSphere infrastructure definitions.
The vShere CSI Driver Operator after install must not fail due to the number of vCenters configured. The operator will also need to allow the creation of PVs. Any other failure reported based on issues performing operator tasks is valid and should be addressed via a new story.
As an OpenShift administrator, I would like Machine API Operator (MAO) to not become degraded due to vSphere Multi vCenter feature gate being enabled so that I can begin to install my cluster across multiple vcenters.
The purpose of this story is to perform the needed changes to get MAO allowing the configuration of the new Feature Gate for vSphere Multi vCenter support. By default, the operator will still only allow one vCenter definition and support that config; however, once the feature gate for vSphere Multi vCenter is enabled, we will allow more than one vCenter. Initially, the plan is to only allow a max of 3 vCenter definitions which will be controlled via the CRD for the vSphere infrastructure definitions. Also, this operator will need to be enhanced to handle the new YAML format cloud config.
The vShere CSI Driver Operator after install must not fail due to the number of vCenters configured. The operator will also need to allow the creation of PVs. Any other failure reported based on issues performing operator tasks is valid and should be addressed via a new story.
As a cluster administrator, I would like to enhance the CAPI installer to support multiple vCenters for installation of cluster so that I can spread my cluster across several vcenters.
The purpose of this story is to enhance the installer to support multiple vcenters. Today OCP only allows the use of once vcenter to install the cluster into. With the development of this feature, cluster admins will be able to configure via the install-config multiple vCenters and allow creation of VMs in all specified vCenter instances. Failure Domains will encapsulate the vcenter definitions.
This will required changed in the API to provide the new feature gate. Once the feature gate is created, the installer can be enhanced to leverage this new feature gate to allow the user to install the VMs of the cluster across multiple vCenters.
We will need to verify how we are handling unit testing CAPI in the installer. The unit tests should cover the cases of checking for the new FeatureGate.
As an OpenShift administrator, I would like MCO to not become degraded due to vSphere Multi vCenter feature gate being enabled so that I can begin to install my cluster across multiple vcenters.
The purpose of this story is to perform the needed changes to get MCO allowing the configuration of the new Feature Gate for vSphere Multi vCenter support. There will be other stories created to track the functional improvements of MCO. By default, the operator will still only allow one vCenter definition; however, once the feature gate for vSphere Multi vCenter is enabled, we will allow more than one vCenter. Initially, the plan is to only allow a max of 3 vCenter definitions which will be controlled via the CRD for the vSphere infrastructure definitions.
The MCO after install must not fail due to the number of vCenters configured. Any other failure reported based on issues performing operator tasks is valid and should be addressed via a new story.
We will need to enhance all logic that has hard coded vCenter size to now look to see if vSphere Multi vCenter feature gate is enabled. If it is enabled, the vCenter count may be larger than 1, else it will still need to fail with the error message of vCenter count may not be greater than 1.
As an OpenShift administrator, I would like CSO to not become degraded due to multi vcenter feature gate being enabled so that I can begin to install my cluster across multiple vcenters.
The purpose of this story is to perform the needed changes to get CSO allowing the configuration of the new Feature Gate for vSphere Multi vCenter support. There will be other stories created to track the functional improvements of CSO. By default, the operator will still only allow one vcenter definition; however, once the feature gate for vSphere Multi vCenter is enabled, we will allow more than one vCenter. Initially, the plan is to only allow a max of 3 vCenter definitions which will be controlled via the CRD for the vSphere infrastructure definitions.
The CSO after install must not fail due to the number of vCenters configured. Any other failure reported based on issues performing operator tasks is valid and should be addressed via a new story.
We will need to enhance all logic that has hard coded vCenter size to now look to see if multi vcenter feature gate is enabled. If it is enabled, the vcenter count may be larger than 1, else it will still need to fail with the error message of vcenter count may not be greater than 1.
As an OpenShift administrator, I would like the vSphere Problem Detector (VPD) to not log error messages related to new YAML config so that I can begin to install my cluster across multiple vcenters and create PVs and have VPD verify all vCenters and their configs.
The purpose of this story is to perform the needed changes to get vSphere Problem Detector (VPD) allowing the configuration of the new Feature Gate for vSphere Multi vCenter support. This involves a new YAML config that needs to be supported. Also, we need to make sure the VPD checks all vCenters / failure domains for any and all checks that it performs.
The VPD after install must not fail due to the number of vCenters configured. The VPD may be logging error messages that are not causing the storage operator to become degraded. We should verify the logs and make sure all vCenters / FD are check as we expect.
Add authentication to the internal components of the Agent Installer so that the cluster install is secure.
Requirements
Are there any requirements specific to the auth token?
Actors:
Do we need more than one auth scheme?
Agent-admin - agent-read-write
Agent-user - agent-read
Options for Implementation:
Once the new auth type is implemented, update assisted-service-env.template from AUTH_TYPE:none to AUTH_TYPE: agent-installer-local
Read the generated local JWT token and ECDSA public key from the asset store and pass it to newly implemented auth type in assisted service to perform authentication in assisted service. Use seperate the auth headers for the API requests in the places where we make curl requests w.r.t. systemd services.
As a user using agent installer on day2 to add a new node to the cluster, I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a user, I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
When running `./openshift-install agent wait-for bootstrap-complete` and `./openshift-install agent wait-for install-complete`, read the generated local JWT token and ECDSA public key from the asset store and pass it to newly implemented auth type in assisted service to perform authentication in assisted service.
As an ABI user responsible for day-2 operations, I want to be able to:
so that I can
A new systemd service will be introduced to check and display the status of the authentication token—whether it is valid or expired. This service will run immediately after the agent-interactive-console systemd service. If the authentication token is expired, cluster installation or adding new nodes will be halted until a new node ISO is generated.
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Read the generated local JWT token and ECDSA public key from the asset store and pass it to newly implemented auth type in assisted service to perform authentication in assisted service. Use seperate the auth headers for agent API requests, similar to wait-for commands and internal systemd services.
As a user, I want to be able to
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.
The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.
On-cluster, automated RHCOS Layering builds are important for multiple reasons:
Create a GCP cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in GCP) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.
Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.
Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").
Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.
Goals
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
List any affected packages or components.
Dependent on https://issues.redhat.com/browse/CFE-918. Once driver is updated to support tags, operator should be added with the functionality to pass the user-defined tags found in Infrastructure as arg to the driver.
https://issues.redhat.com/browse/CFE-918 is for enabling tag functionality in the driver and driver will have the provision to pass user-defined tags to be added to the resources managed by it as process args and operator should read the user-defined tags found in Infrastructure object and pass as CSV to the driver.
Acceptance Criteria
TechPreview featureSet check added in machine-api-provider-gcp operator for userLabels and userTags.
And the new featureGate added in openshift/api should also be removed.
Acceptance Criteria
Installer would validate the existence of tags and fail the installation if the tags defined are not present. But the tags processed by installer is removed later, operator referencing these tags through Infrastructure would fail.
Enhance checks to identify not existent tags and insufficient permissions errors as GCP doesn't differentiate it.
Epic Goal*
GCP Filestore instances are not automatically deleted when the cluster is destroyed.
Why is this important? (mandatory)
The need to delete GCP Filestore instances is documented. This is however inconsistent with other Storage resources (GCP PD), which get removed automatically and also may lead to resource leaks
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
This requires changes in the GCP Filestore Operator, GCP Filestore Driver and the OpenShift Installer
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
As the GCP Filestore user I would like all the resources belonging to the cluster to be automatically deleted upon the cluster destruction. This is currently only working for GCP PD volumes but has to be done manually for GCP Filestore ones:
Exit criteria:
We need to ensure we have parity with OCP and support heterogeneous clusters
https://github.com/openshift/enhancements/pull/1014
Using a multi-arch Node requires the HC to be multi arch as well. This is an good to recipe to let users shoot on their foot. We need to automate the required input via CLI to multi-arch NodePool to work, e.g. on HC creation enabling a multi-arch flag which sets the right release image
Acceptance Criteria:
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a HyperShift/HCP CLI user, I want:
so that
Description of criteria:
N/A
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision Azure infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As an administrator, I want to be able to:
Description of criteria:
Storage account should be encrypted with installconfig.platform.azure.CustomerManagedKey
Description of problem:
Launched CAPI based installation on azure platform, the default HyperVGeneration on each master node is V1, the expected value should be V2 if instance type supports HyperVGeneration V2. $ az vm get-instance-view --name jimadisk01-xphq8-master-0 -g jimadisk01-xphq8-rg --query 'instanceView.hyperVGeneration' -otsv V1 Also, if setting instance type to Standard_DC4ds_v3 that only supports HyperVGeneration V2, install-config: ======================== controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: azure: type: Standard_DC4ds_v3 continued to create cluster, installer failed and was timeout when waiting for machine provision. INFO Waiting up to 15m0s (until 6:46AM UTC) for machines [jimadisk-nmkzj-bootstrap jimadisk-nmkzj-master-0 jimadisk-nmkzj-master-1 jimadisk-nmkzj-master-2] to provision... ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: control-plane machines were not provisioned within 15m0s: client rate limiter Wait returned an error: context deadline exceeded INFO Shutting down local Cluster API control plane... INFO Stopped controller: Cluster API WARNING process cluster-api-provider-azure exited with error: signal: killed INFO Stopped controller: azure infrastructure provider INFO Stopped controller: azureaso infrastructure provider In openshift-install.log, got below error: time="2024-06-25T06:42:57Z" level=debug msg="I0625 06:42:57.090269 1377336 recorder.go:104] \"failed to reconcile AzureMachine: failed to reconcile AzureMachine service virtualmachine: failed to create or update resource jimadisk-nmkzj-rg/jimadisk-nmkzj-master-2 (service: virtualmachine): PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jimadisk-nmkzj-rg/providers/Microsoft.Compute/virtualMachines/jimadisk-nmkzj-master-2\\n--------------------------------------------------------------------------------\\nRESPONSE 400: 400 Bad Request\\nERROR CODE: BadRequest\\n--------------------------------------------------------------------------------\\n{\\n \\\"error\\\": {\\n \\\"code\\\": \\\"BadRequest\\\",\\n \\\"message\\\": \\\"The selected VM size 'Standard_DC4ds_v3' cannot boot Hypervisor Generation '1'. If this was a Create operation please check that the Hypervisor Generation of the Image matches the Hypervisor Generation of the selected VM Size. If this was an Update operation please select a Hypervisor Generation '1' VM Size. For more information, see https://aka.ms/azuregen2vm\\\"\\n }\\n}\\n--------------------------------------------------------------------------------\\n\" logger=\"events\" type=\"Warning\" object={\"kind\":\"AzureMachine\",\"namespace\":\"openshift-cluster-api-guests\",\"name\":\"jimadisk-nmkzj-master-2\",\"uid\":\"c2cdabed-e19a-4e88-96d9-3f3026910403\",\"apiVersion\":\"infrastructure.cluster.x-k8s.io/v1beta1\",\"resourceVersion\":\"1600\"} reason=\"ReconcileError\"" time="2024-06-25T06:42:57Z" level=debug msg="E0625 06:42:57.090701 1377336 controller.go:329] \"Reconciler error\" err=<" time="2024-06-25T06:42:57Z" level=debug msg="\tfailed to reconcile AzureMachine: failed to reconcile AzureMachine service virtualmachine: failed to create or update resource jimadisk-nmkzj-rg/jimadisk-nmkzj-master-2 (service: virtualmachine): PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jimadisk-nmkzj-rg/providers/Microsoft.Compute/virtualMachines/jimadisk-nmkzj-master-2" time="2024-06-25T06:42:57Z" level=debug msg="\t--------------------------------------------------------------------------------" time="2024-06-25T06:42:57Z" level=debug msg="\tRESPONSE 400: 400 Bad Request" time="2024-06-25T06:42:57Z" level=debug msg="\tERROR CODE: BadRequest" time="2024-06-25T06:42:57Z" level=debug msg="\t--------------------------------------------------------------------------------" time="2024-06-25T06:42:57Z" level=debug msg="\t{" time="2024-06-25T06:42:57Z" level=debug msg="\t \"error\": {" time="2024-06-25T06:42:57Z" level=debug msg="\t \"code\": \"BadRequest\"," time="2024-06-25T06:42:57Z" level=debug msg="\t \"message\": \"The selected VM size 'Standard_DC4ds_v3' cannot boot Hypervisor Generation '1'. If this was a Create operation please check that the Hypervisor Generation of the Image matches the Hypervisor Generation of the selected VM Size. If this was an Update operation please select a Hypervisor Generation '1' VM Size. For more information, see https://aka.ms/azuregen2vm\"" time="2024-06-25T06:42:57Z" level=debug msg="\t }" time="2024-06-25T06:42:57Z" level=debug msg="\t}" time="2024-06-25T06:42:57Z" level=debug msg="\t--------------------------------------------------------------------------------"
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-06-23-145410
How reproducible:
Always
Steps to Reproduce:
1. set instance type to Standard_DC4ds_v3 which only supports HyperVGeneration V2 or without instance type setting in install-config 2. launched installation 3.
Actual results:
1. without instance type setting, default HyperVGeneraton on each master instances is V1 2. fail to create master instances with instance type to Standard_DC4ds_v3
Expected results:
1. without instance type setting, default HyperVGeneraton on each master instances is V2. 2. succeed to create cluster with instance type Standard_DC4ds_v3
Additional info:
Remove vendored terraform-provider-azure and not all the terraform code for Azure installs.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Description of problem:
Enable diskEncryptionSet under defaultMachinePlatform in install-config: ============= platform: azure: defaultMachinePlatform: encryptionAtHost: true osDisk: diskEncryptionSet: resourceGroup: jimades01-rg name: jimades01-des subscriptionId: 53b8f551-f0fc-4bea-8cba-6d1fefd54c8a Created cluster, checked diskEncryptionSet on each master instance's osDisk, all of them are empty. $ az vm list -g jimades01-8ktkn-rg --query '[].[name, storageProfile.osDisk.managedDisk.diskEncryptionSet]' -otable Column1 Column2 ------------------------------------ --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- jimades01-8ktkn-master-0 jimades01-8ktkn-master-1 jimades01-8ktkn-master-2 jimades01-8ktkn-worker-eastus1-9m8p5 {'id': '/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jimades01-rg/providers/Microsoft.Compute/diskEncryptionSets/jimades01-des', 'resourceGroup': 'jimades01-rg'} jimades01-8ktkn-worker-eastus2-cmcn7 {'id': '/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jimades01-rg/providers/Microsoft.Compute/diskEncryptionSets/jimades01-des', 'resourceGroup': 'jimades01-rg'} jimades01-8ktkn-worker-eastus3-nknss {'id': '/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jimades01-rg/providers/Microsoft.Compute/diskEncryptionSets/jimades01-des', 'resourceGroup': 'jimades01-rg'} same situation when setting diskEncryptionSet under controlPlane in install-config, no des setting in cluster api manifests 10_inframachine_jima24c-2cmlf_*.yaml. $ yq-go r 10_inframachine_jima24c-2cmlf-bootstrap.yaml 'spec.osDisk' cachingType: ReadWrite diskSizeGB: 1024 managedDisk: storageAccountType: Premium_LRS osType: Linux $ yq-go r 10_inframachine_jima24c-2cmlf-master-0.yaml 'spec.osDisk' cachingType: ReadWrite diskSizeGB: 1024 managedDisk: storageAccountType: Premium_LRS osType: Linux
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-06-23-145410
How reproducible:
Always
Steps to Reproduce:
1. Configure disk encryption set under controlPlane or defaultMachinePlatform in install-config 2. Create cluster 3.
Actual results:
DES does not take effect on master instances
Expected results:
DES should be configured on all master instances
Additional info:
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
We see that the MachinePool feaure gate has become default=true in a recent version of CAPZ. See https://github.com/openshift/installer/pull/8627#issuecomment-2178061050 for more context.
We should probably disable this feature gate. Here's an example of disabling a feature gate using a flag for the aws controller:
https://github.com/openshift/installer/blob/master/pkg/clusterapi/system.go#L153
Existing VNets can be specified in the cluster spec. Ensure that they are still tagged appropriately, which is in the PreTerraform code.https://github.com/openshift/installer/blob/master/pkg/asset/cluster/cluster.go#L111
Description of problem:
created Azure IPI cluster by using CAPI, interrupted the installer when running at the stage of waiting for bootstrapping to complete, then ran command "openshift-installer gather bootstrap --dir <install_dir>" to gather bootstrap log. $ ./openshift-install gather bootstrap --dir ipi --log-level debug DEBUG OpenShift Installer 4.17.0-0.test-2024-07-25-014817-ci-ln-rcc2djt-latest DEBUG Built from commit 91618bc6507416492d685c11540efb9ae9a0ec2e ... DEBUG Looking for machine manifests in ipi/.clusterapi_output DEBUG bootstrap manifests found: [ipi/.clusterapi_output/Machine-openshift-cluster-api-guests-jima25-m-4sq6j-bootstrap.yaml] DEBUG found bootstrap address: 10.0.0.7 DEBUG master machine manifests found: [ipi/.clusterapi_output/Machine-openshift-cluster-api-guests-jima25-m-4sq6j-master-0.yaml ipi/.clusterapi_output/Machine-openshift-cluster-api-guests-jima25-m-4sq6j-master-1.yaml ipi/.clusterapi_output/Machine-openshift-cluster-api-guests-jima25-m-4sq6j-master-2.yaml] DEBUG found master address: 10.0.0.4 DEBUG found master address: 10.0.0.5 DEBUG found master address: 10.0.0.6 ... DEBUG Added /home/fedora/.ssh/openshift-qe.pem to installer's internal agent DEBUG Added /home/fedora/.ssh/id_rsa to installer's internal agent DEBUG Added /home/fedora/.ssh/openshift-dev.pem to installer's internal agent DEBUG Added /tmp/bootstrap-ssh2769549403 to installer's internal agent INFO Failed to gather bootstrap logs: failed to connect to the bootstrap machine: dial tcp 10.0.0.7:22: connect: connection timed out ... Checked Machine-openshift-cluster-api-guests-jima25-m-4sq6j-bootstrap.yaml under capi artifact folder, only private IP is there. $ yq-go r Machine-openshift-cluster-api-guests-jima25-m-4sq6j-bootstrap.yaml status.addresses - type: InternalDNS address: jima25-m-4sq6j-bootstrap - type: InternalIP address: 10.0.0.7 From https://github.com/openshift/installer/pull/8669/, it creates an inbound nat rule that forwards port 22 on the public load balancer to the bootstrap host instead of creating public IP directly for bootstrap, and I tried and it was succeeded to ssh login bootstrap server by using frontend IP of public load balancer. But as no public IP saved in bootstrap machine CAPI artifact, installer failed to connect bootstrap machine with private ip.
Version-Release number of selected component (if applicable):
4.17 nightly build
How reproducible:
Always
Steps to Reproduce:
1. Create Azure IPI cluster by using CAPI 2. Interrupt installer when waiting for bootstrap complete 3. gather bootstrap logs
Actual results:
Only serial console logs and local capi artifacts are collected, logs on bootstrap and control plane fails to be collected due to ssh connection to bootstrap timeout.
Expected results:
succeed to gather bootstrap logs
Additional info:
Install with marketplace images specified in the install config.
Attach identities to VMs so that service principals are not placed on the VMs. CAPZ is issuing warnings about this when creating VMs.
Add configuration to support SSH'ing to bootstrap node.
Description of problem:
When creating cluster with service principal certificate, as known issues OCPBUGS-36360, installer exited with error. # ./openshift-install create cluster --dir ipi6 INFO Credentials loaded from file "/root/.azure/osServicePrincipal.json" WARNING Using client certs to authenticate. Please be warned cluster does not support certs and only the installer does. INFO Consuming Install Config from target directory WARNING FeatureSet "CustomNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster. INFO Creating infrastructure resources... INFO Started local control plane with envtest INFO Stored kubeconfig for envtest in: /tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig WARNING Using client certs to authenticate. Please be warned cluster does not support certs and only the installer does. INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:36847 --webhook-port=38905 --webhook-cert-dir=/tmp/envtest-serving-certs-941163289 --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig] INFO Running process: azure infrastructure provider with args [-v=2 --health-addr=127.0.0.1:44743 --webhook-port=35373 --webhook-cert-dir=/tmp/envtest-serving-certs-3807817663 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig] INFO Running process: azureaso infrastructure provider with args [-v=0 -metrics-addr=0 -health-addr=127.0.0.1:45179 -webhook-port=37401 -webhook-cert-dir=/tmp/envtest-serving-certs-1364466879 -crd-pattern= -crd-management=none] ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to run cluster api system: failed to run controller "azureaso infrastructure provider": failed to start controller "azureaso infrastructure provider": timeout waiting for process cluster-api-provider-azureaso to start successfully (it may have failed to start, or stopped unexpectedly before becoming ready) INFO Shutting down local Cluster API control plane... INFO Local Cluster API system has completed operations From output, local cluster API system is shut down. But when checking processes, only parent process installer exit, CAPI related processes are still running. When local control plane is running: # ps -ef|grep cluster | grep -v grep root 13355 6900 39 08:07 pts/1 00:00:13 ./openshift-install create cluster --dir ipi6 root 13365 13355 2 08:08 pts/1 00:00:00 ipi6/cluster-api/etcd --advertise-client-urls=http://127.0.0.1:41341 --data-dir=ipi6/.clusterapi_output/etcd --listen-client-urls=http://127.0.0.1:41341 --listen-peer-urls=http://127.0.0.1:34081 --unsafe-no-fsync=true root 13373 13355 55 08:08 pts/1 00:00:10 ipi6/cluster-api/kube-apiserver --allow-privileged=true --authorization-mode=RBAC --bind-address=127.0.0.1 --cert-dir=/tmp/k8s_test_framework_50606349 --client-ca-file=/tmp/k8s_test_framework_50606349/client-cert-auth-ca.crt --disable-admission-plugins=ServiceAccount --etcd-servers=http://127.0.0.1:41341 --secure-port=38483 --service-account-issuer=https://127.0.0.1:38483/ --service-account-key-file=/tmp/k8s_test_framework_50606349/sa-signer.crt --service-account-signing-key-file=/tmp/k8s_test_framework_50606349/sa-signer.key --service-cluster-ip-range=10.0.0.0/24 root 13385 13355 0 08:08 pts/1 00:00:00 ipi6/cluster-api/cluster-api -v=2 --diagnostics-address=0 --health-addr=127.0.0.1:36847 --webhook-port=38905 --webhook-cert-dir=/tmp/envtest-serving-certs-941163289 --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig root 13394 13355 6 08:08 pts/1 00:00:00 ipi6/cluster-api/cluster-api-provider-azure -v=2 --health-addr=127.0.0.1:44743 --webhook-port=35373 --webhook-cert-dir=/tmp/envtest-serving-certs-3807817663 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig After installer exited: # ps -ef|grep cluster | grep -v grep root 13365 1 1 08:08 pts/1 00:00:01 ipi6/cluster-api/etcd --advertise-client-urls=http://127.0.0.1:41341 --data-dir=ipi6/.clusterapi_output/etcd --listen-client-urls=http://127.0.0.1:41341 --listen-peer-urls=http://127.0.0.1:34081 --unsafe-no-fsync=true root 13373 1 45 08:08 pts/1 00:00:35 ipi6/cluster-api/kube-apiserver --allow-privileged=true --authorization-mode=RBAC --bind-address=127.0.0.1 --cert-dir=/tmp/k8s_test_framework_50606349 --client-ca-file=/tmp/k8s_test_framework_50606349/client-cert-auth-ca.crt --disable-admission-plugins=ServiceAccount --etcd-servers=http://127.0.0.1:41341 --secure-port=38483 --service-account-issuer=https://127.0.0.1:38483/ --service-account-key-file=/tmp/k8s_test_framework_50606349/sa-signer.crt --service-account-signing-key-file=/tmp/k8s_test_framework_50606349/sa-signer.key --service-cluster-ip-range=10.0.0.0/24 root 13385 1 0 08:08 pts/1 00:00:00 ipi6/cluster-api/cluster-api -v=2 --diagnostics-address=0 --health-addr=127.0.0.1:36847 --webhook-port=38905 --webhook-cert-dir=/tmp/envtest-serving-certs-941163289 --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig root 13394 1 0 08:08 pts/1 00:00:00 ipi6/cluster-api/cluster-api-provider-azure -v=2 --health-addr=127.0.0.1:44743 --webhook-port=35373 --webhook-cert-dir=/tmp/envtest-serving-certs-3807817663 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig Another scenario, ran capi-based installer on the small disk, and installer stuck there and didn't exit until interrupted until <Ctrl> + C. Then checked that all CAPI related processes were still running, only installer process was killed. [root@jima09id-vm-1 jima]# ./openshift-install create cluster --dir ipi4 INFO Credentials loaded from file "/root/.azure/osServicePrincipal.json" INFO Consuming Install Config from target directory WARNING FeatureSet "CustomNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster. INFO Creating infrastructure resources... INFO Started local control plane with envtest INFO Stored kubeconfig for envtest in: /tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:42017 --webhook-port=41085 --webhook-cert-dir=/tmp/envtest-serving-certs-1774658110 --kubeconfig=/tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig] INFO Running process: azure infrastructure provider with args [-v=2 --health-addr=127.0.0.1:38387 --webhook-port=37783 --webhook-cert-dir=/tmp/envtest-serving-certs-1319713198 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig] FATAL failed to extract "ipi4/cluster-api/cluster-api-provider-azureaso": write ipi4/cluster-api/cluster-api-provider-azureaso: no space left on device ^CWARNING Received interrupt signal ^C[root@jima09id-vm-1 jima]# [root@jima09id-vm-1 jima]# ps -ef|grep cluster | grep -v grep root 12752 1 0 07:38 pts/1 00:00:00 ipi4/cluster-api/etcd --advertise-client-urls=http://127.0.0.1:38889 --data-dir=ipi4/.clusterapi_output/etcd --listen-client-urls=http://127.0.0.1:38889 --listen-peer-urls=http://127.0.0.1:38859 --unsafe-no-fsync=true root 12760 1 4 07:38 pts/1 00:00:09 ipi4/cluster-api/kube-apiserver --allow-privileged=true --authorization-mode=RBAC --bind-address=127.0.0.1 --cert-dir=/tmp/k8s_test_framework_3790461974 --client-ca-file=/tmp/k8s_test_framework_3790461974/client-cert-auth-ca.crt --disable-admission-plugins=ServiceAccount --etcd-servers=http://127.0.0.1:38889 --secure-port=44429 --service-account-issuer=https://127.0.0.1:44429/ --service-account-key-file=/tmp/k8s_test_framework_3790461974/sa-signer.crt --service-account-signing-key-file=/tmp/k8s_test_framework_3790461974/sa-signer.key --service-cluster-ip-range=10.0.0.0/24 root 12769 1 0 07:38 pts/1 00:00:00 ipi4/cluster-api/cluster-api -v=2 --diagnostics-address=0 --health-addr=127.0.0.1:42017 --webhook-port=41085 --webhook-cert-dir=/tmp/envtest-serving-certs-1774658110 --kubeconfig=/tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig root 12781 1 0 07:38 pts/1 00:00:00 ipi4/cluster-api/cluster-api-provider-azure -v=2 --health-addr=127.0.0.1:38387 --webhook-port=37783 --webhook-cert-dir=/tmp/envtest-serving-certs-1319713198 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig root 12851 6900 1 07:41 pts/1 00:00:00 ./openshift-install destroy cluster --dir ipi4
Version-Release number of selected component (if applicable):
4.17 nightly build
How reproducible:
Always
Steps to Reproduce:
1. Run capi-based installer 2. Installer failed to start some capi process and exited 3.
Actual results:
Installer process exited, but capi related processes are still running
Expected results:
Both installer and all capi related processes are exited.
Additional info:
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Currently, CAPZ only allows a single API load balancer to be specified. OpenShift requires both a public and private load balancer. It is desirable to allow multiple load balancers to be specified in the API load balancer field.
We need to modify the Azure CAPI NetworkSpec to add support for an array of load balancers. For each LB in the array, the existing behavior of a load balancer needs to be implemented (adding VM's into the backend pool).
Description of problem:
Created VM instances on Azure, and assign managed identity to it, then created cluster in this VM, installer got error as below: # ./openshift-install create cluster --dir ipi --log-level debug ... time="2024-07-01T00:52:43Z" level=info msg="Waiting up to 15m0s (until 1:07AM UTC) for network infrastructure to become ready..." ... time="2024-07-01T00:52:58Z" level=debug msg="I0701 00:52:58.528931 7149 recorder.go:104] \"failed to create scope: failed to configure azure settings and credentials for Identity: failed to create credential: secret can't be empty string\" logger=\"events\" type=\"Warning\" object={\"kind\":\"AzureCluster\",\"namespace\":\"openshift-cluster-api-guests\",\"name\":\"jima0701-hxtzd\",\"uid\":\"63aa5b17-9063-4b33-a471-1f58c146da8a\",\"apiVersion\":\"infrastructure.cluster.x-k8s.io/v1beta1\",\"resourceVersion\":\"1083\"} reason=\"CreateClusterScopeFailed\""
Version-Release number of selected component (if applicable):
4.17 nightly build
How reproducible:
Always
Steps to Reproduce:
1. Created VM and assigned managed identity to it 2. Create cluster in this VM 3.
Actual results:
cluster is created failed
Expected results:
cluster is installed successfully
Additional info:
These are install config fields that have existing counterparts in CAPZ. All we should need to do is plumb the values through, so should be relatively low effort.
Description of problem:
Specify controlPlane.architecture as arm64 in install-config === controlPlane: architecture: arm64 name: master platform: azure: type: null compute: - architecture: arm64 name: worker replicas: 3 platform: azure: type: Standard_D4ps_v5 Launch installer to create cluster, installer exit with below error: time="2024-07-26T06:11:00Z" level=debug msg="\tfailed to reconcile AzureMachine: failed to reconcile AzureMachine service virtualmachine: failed to create or update resource ci-op-wtm3h6km-72f4b-fdwtz-rg/ci-op-wtm3h6km-72f4b-fdwtz-bootstrap (service: virtualmachine): PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/ci-op-wtm3h6km-72f4b-fdwtz-rg/providers/Microsoft.Compute/virtualMachines/ci-op-wtm3h6km-72f4b-fdwtz-bootstrap" time="2024-07-26T06:11:00Z" level=debug msg="\t--------------------------------------------------------------------------------" time="2024-07-26T06:11:00Z" level=debug msg="\tRESPONSE 400: 400 Bad Request" time="2024-07-26T06:11:00Z" level=debug msg="\tERROR CODE: BadRequest" time="2024-07-26T06:11:00Z" level=debug msg="\t--------------------------------------------------------------------------------" time="2024-07-26T06:11:00Z" level=debug msg="\t{" time="2024-07-26T06:11:00Z" level=debug msg="\t \"error\": {" time="2024-07-26T06:11:00Z" level=debug msg="\t \"code\": \"BadRequest\"," time="2024-07-26T06:11:00Z" level=debug msg="\t \"message\": \"Cannot create a VM of size 'Standard_D8ps_v5' because this VM size only supports a CPU Architecture of 'Arm64', but an image or disk with CPU Architecture 'x64' was given. Please check that the CPU Architecture of the image or disk is compatible with that of the VM size.\"" time="2024-07-26T06:11:00Z" level=debug msg="\t }" time="2024-07-26T06:11:00Z" level=debug msg="\t}" time="2024-07-26T06:11:00Z" level=debug msg="\t--------------------------------------------------------------------------------" time="2024-07-26T06:11:00Z" level=debug msg=" > controller=\"azuremachine\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"AzureMachine\" AzureMachine=\"openshift-cluster-api-guests/ci-op-wtm3h6km-72f4b-fdwtz-bootstrap\" namespace=\"openshift-cluster-api-guests\" name=\"ci-op-wtm3h6km-72f4b-fdwtz-bootstrap\" reconcileID=\"60b1d513-07e4-4b34-ac90-d2a33ce156e1\"" Checked that gallery image definitions (Gen1 & Gen2), the architecture is still x64. $ az sig image-definition show --gallery-image-definition ci-op-wtm3h6km-72f4b-fdwtz -g ci-op-wtm3h6km-72f4b-fdwtz-rg --gallery-name gallery_ci_op_wtm3h6km_72f4b_fdwtz { "architecture": "x64", "hyperVGeneration": "V1", "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/ci-op-wtm3h6km-72f4b-fdwtz-rg/providers/Microsoft.Compute/galleries/gallery_ci_op_wtm3h6km_72f4b_fdwtz/images/ci-op-wtm3h6km-72f4b-fdwtz", "identifier": { "offer": "rhcos", "publisher": "RedHat", "sku": "basic" }, "location": "southcentralus", "name": "ci-op-wtm3h6km-72f4b-fdwtz", "osState": "Generalized", "osType": "Linux", "provisioningState": "Succeeded", "resourceGroup": "ci-op-wtm3h6km-72f4b-fdwtz-rg", "tags": { "kubernetes.io_cluster.ci-op-wtm3h6km-72f4b-fdwtz": "owned" }, "type": "Microsoft.Compute/galleries/images" } $ az sig image-definition show --gallery-image-definition ci-op-wtm3h6km-72f4b-fdwtz-gen2 -g ci-op-wtm3h6km-72f4b-fdwtz-rg --gallery-name gallery_ci_op_wtm3h6km_72f4b_fdwtz { "architecture": "x64", "hyperVGeneration": "V2", "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/ci-op-wtm3h6km-72f4b-fdwtz-rg/providers/Microsoft.Compute/galleries/gallery_ci_op_wtm3h6km_72f4b_fdwtz/images/ci-op-wtm3h6km-72f4b-fdwtz-gen2", "identifier": { "offer": "rhcos-gen2", "publisher": "RedHat-gen2", "sku": "gen2" }, "location": "southcentralus", "name": "ci-op-wtm3h6km-72f4b-fdwtz-gen2", "osState": "Generalized", "osType": "Linux", "provisioningState": "Succeeded", "resourceGroup": "ci-op-wtm3h6km-72f4b-fdwtz-rg", "tags": { "kubernetes.io_cluster.ci-op-wtm3h6km-72f4b-fdwtz": "owned" }, "type": "Microsoft.Compute/galleries/images" }
Version-Release number of selected component (if applicable):
4.17 nightly build
How reproducible:
Always
Steps to Reproduce:
1. Configure controlPlane.architecture as arm64 2. Create cluster by using multi nightly build
Actual results:
Installation fails as unable to create bootstrap/master machines
Expected results:
Installation succeeds.
Additional info:
Outbound Type defines how egress is provided for the cluster. Currently 3 options: Load Balancer (default), User Defined Routing and NAT Gateway (tech preview) are supported.
As part of the move away from terraform, the `UserDefinedRouting` outboundType needs to be supported.
Description of problem:
Failed to create second cluster in shared vnet, below error is thrown out during creating network infrastructure when creating 2nd cluster, installer timed out and exited. ============== 07-23 14:09:27.315 level=info msg=Waiting up to 15m0s (until 6:24AM UTC) for network infrastructure to become ready... ... 07-23 14:16:14.900 level=debug msg= failed to reconcile cluster services: failed to reconcile AzureCluster service loadbalancers: failed to create or update resource jima0723b-1-x6vpp-rg/jima0723b-1-x6vpp-internal (service: loadbalancers): PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-1-x6vpp-rg/providers/Microsoft.Network/loadBalancers/jima0723b-1-x6vpp-internal 07-23 14:16:14.900 level=debug msg= -------------------------------------------------------------------------------- 07-23 14:16:14.901 level=debug msg= RESPONSE 400: 400 Bad Request 07-23 14:16:14.901 level=debug msg= ERROR CODE: PrivateIPAddressIsAllocated 07-23 14:16:14.901 level=debug msg= -------------------------------------------------------------------------------- 07-23 14:16:14.901 level=debug msg= { 07-23 14:16:14.901 level=debug msg= "error": { 07-23 14:16:14.901 level=debug msg= "code": "PrivateIPAddressIsAllocated", 07-23 14:16:14.901 level=debug msg= "message": "IP configuration /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-1-x6vpp-rg/providers/Microsoft.Network/loadBalancers/jima0723b-1-x6vpp-internal/frontendIPConfigurations/jima0723b-1-x6vpp-internal-frontEnd is using the private IP address 10.0.0.100 which is already allocated to resource /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/frontendIPConfigurations/jima0723b-49hnw-internal-frontEnd.", 07-23 14:16:14.902 level=debug msg= "details": [] 07-23 14:16:14.902 level=debug msg= } 07-23 14:16:14.902 level=debug msg= } 07-23 14:16:14.902 level=debug msg= -------------------------------------------------------------------------------- Install-config for 1st cluster: ========= metadata: name: jima0723b platform: azure: region: eastus baseDomainResourceGroupName: os4-common networkResourceGroupName: jima0723b-rg virtualNetwork: jima0723b-vnet controlPlaneSubnet: jima0723b-master-subnet computeSubnet: jima0723b-worker-subnet publish: External Install-config for 2nd cluster: ======== metadata: name: jima0723b-1 platform: azure: region: eastus baseDomainResourceGroupName: os4-common networkResourceGroupName: jima0723b-rg virtualNetwork: jima0723b-vnet controlPlaneSubnet: jima0723b-master-subnet computeSubnet: jima0723b-worker-subnet publish: External shared master subnet/worker subnet: $ az network vnet subnet list -g jima0723b-rg --vnet-name jima0723b-vnet -otable AddressPrefix Name PrivateEndpointNetworkPolicies PrivateLinkServiceNetworkPolicies ProvisioningState ResourceGroup --------------- ----------------------- -------------------------------- ----------------------------------- ------------------- --------------- 10.0.0.0/24 jima0723b-master-subnet Disabled Enabled Succeeded jima0723b-rg 10.0.1.0/24 jima0723b-worker-subnet Disabled Enabled Succeeded jima0723b-rg internal lb frontedIPConfiguration on 1st cluster: $ az network lb show -n jima0723b-49hnw-internal -g jima0723b-49hnw-rg --query 'frontendIPConfigurations' [ { "etag": "W/\"7a7531ca-fb02-48d0-b9a6-d3fb49e1a416\"", "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/frontendIPConfigurations/jima0723b-49hnw-internal-frontEnd", "inboundNatRules": [ { "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/inboundNatRules/jima0723b-49hnw-master-0", "resourceGroup": "jima0723b-49hnw-rg" }, { "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/inboundNatRules/jima0723b-49hnw-master-1", "resourceGroup": "jima0723b-49hnw-rg" }, { "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/inboundNatRules/jima0723b-49hnw-master-2", "resourceGroup": "jima0723b-49hnw-rg" } ], "loadBalancingRules": [ { "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/loadBalancingRules/LBRuleHTTPS", "resourceGroup": "jima0723b-49hnw-rg" }, { "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/loadBalancingRules/sint-v4", "resourceGroup": "jima0723b-49hnw-rg" } ], "name": "jima0723b-49hnw-internal-frontEnd", "privateIPAddress": "10.0.0.100", "privateIPAddressVersion": "IPv4", "privateIPAllocationMethod": "Static", "provisioningState": "Succeeded", "resourceGroup": "jima0723b-49hnw-rg", "subnet": { "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-rg/providers/Microsoft.Network/virtualNetworks/jima0723b-vnet/subnets/jima0723b-master-subnet", "resourceGroup": "jima0723b-rg" }, "type": "Microsoft.Network/loadBalancers/frontendIPConfigurations" } ] From above output, privateIPAllocationMethod is static and always allocate privateIPAddress to 10.0.0.100, this might cause the 2nd cluster installation failure. Checked the same on cluster created by using terraform, privateIPAllocationMethod is dynamic. =============== $ az network lb show -n wxjaz723-pm99k-internal -g wxjaz723-pm99k-rg --query 'frontendIPConfigurations' [ { "etag": "W/\"e6bec037-843a-47ba-a725-3f322564be58\"", "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wxjaz723-pm99k-rg/providers/Microsoft.Network/loadBalancers/wxjaz723-pm99k-internal/frontendIPConfigurations/internal-lb-ip-v4", "loadBalancingRules": [ { "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wxjaz723-pm99k-rg/providers/Microsoft.Network/loadBalancers/wxjaz723-pm99k-internal/loadBalancingRules/api-internal-v4", "resourceGroup": "wxjaz723-pm99k-rg" }, { "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wxjaz723-pm99k-rg/providers/Microsoft.Network/loadBalancers/wxjaz723-pm99k-internal/loadBalancingRules/sint-v4", "resourceGroup": "wxjaz723-pm99k-rg" } ], "name": "internal-lb-ip-v4", "privateIPAddress": "10.0.0.4", "privateIPAddressVersion": "IPv4", "privateIPAllocationMethod": "Dynamic", "provisioningState": "Succeeded", "resourceGroup": "wxjaz723-pm99k-rg", "subnet": { "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wxjaz723-rg/providers/Microsoft.Network/virtualNetworks/wxjaz723-vnet/subnets/wxjaz723-master-subnet", "resourceGroup": "wxjaz723-rg" }, "type": "Microsoft.Network/loadBalancers/frontendIPConfigurations" }, ... ]
Version-Release number of selected component (if applicable):
4.17 nightly build
How reproducible:
Always
Steps to Reproduce:
1. Create shared vnet / master subnet / worker subnet 2. Create 1st cluster in shared vnet 3. Create 2nd cluster in shared vnet
Actual results:
2nd cluster installation failed
Expected results:
Both clusters are installed successfully.
Additional info:
Description of problem:
Whatever vmNetworkingType setting under ControlPlane in install-config, "Accelerated networking" on master instances are always disabled. In install-config.yaml, set controlPlane.platform.azure.vmNetworkingType to 'Accelerated' or without such setting on controlPlane ======================= controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: azure: vmNetworkingType: 'Accelerated' create cluster, and checked "Accelerated networking" on master instances, all are disabled. $ az network nic show --name jima24c-tp7lp-master-0-nic -g jima24c-tp7lp-rg --query 'enableAcceleratedNetworking' false After creating manifests, checked capi manifests, acceleratedNetworking is set as false. $ yq-go r 10_inframachine_jima24c-qglff-master-0.yaml 'spec.networkInterfaces' - acceleratedNetworking: false
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-06-23-145410
How reproducible:
Always
Steps to Reproduce:
1. Set vmNetworkingType to 'Accelerated' or without vmNetworkingType setting under controlPlane in install-config 2. Create cluster 3.
Actual results:
AcceleratedNetworking on all master instances are always disabled.
Expected results:
1. Without vmNetworkingType setting in install-config, AcceleratedNetworking on all master instances should be enabled by default, which keeps the same behavior as terraform based installation. 2. AcceleratedNetworking on all master instances should be consistent with setting in install-config.
Additional info:
CAPZ expects a VM extension to report back that capi bootstapping is successful but rhcos does not support extensions (because it, by design, does not support the azure linux agent).
We need https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/4792 to be able to disable the default extension in capz.
Much like core OpenShift operators, a standardized flow exists for OLM-managed operators to interact with the cluster in a specific way to leverage GCP Workload Identity Federation-based authorization when using GCP APIs as opposed to insecure static, long-lived credentials. OLM-managed operators can implement integration with the CloudCredentialOperator in well-defined way to support this flow.
Enable customers to easily leverage OpenShift's capabilities around GCP WIF with layered products, for increased security posture. Enable OLM-managed operators to implement support for this in well-defined pattern.
See Operators & STS slide deck.
The CloudCredentialsOperator already provides a powerful API for OpenShift's cluster core operator to request credentials and acquire them via short-lived tokens for other cloud providers like AWS. This capabilitiy is now also being implemented for GCP as part of CCO-1898 and CCO-285. The support should be expanded to OLM-managed operators, specifically to Red Hat layered products that interact with GCP APIs. The process today is cumbersome to none-existent based on the operator in question and seen as an adoption blocker of OpenShift on GCP.
This is particularly important for OSD on GCP customers. Customers are expected to be asked to pre-create the required IAM roles outside of OpenShift, which is deemed acceptable.
CCO needs to support the CloudCredentialRequestAPI with GCP Workload Identity (just like we did for AWS STS and Azure Entra Workload ID) to enable OCPSTRAT-922 (CloudCredentialOperator-based workflows for OLM-managed operators and GCP WIF).
Similar to the AWS and Azure actuators, we need to add the STS OLM functionality to the GCP actuator.
When the internal oauth-server and oauth-apiserver are removed and replaced with an external OIDC issuer (like azure AD), the console must work for human users of the external OIDC issuer.
An end user can use the openshift console without a notable difference in experience. This must eventually work on both hypershift and standalone, but hypershift is the first priority if it impacts delivery
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Description of problem:
When a cluster is configured for direct OIDC configuration (authentication.config/cluster .spec.type=OIDC), console pods will be in crashloop until an OIDC client is configured for the console.
Version-Release number of selected component (if applicable):
4.15.0
How reproducible:
100% in Hypershift; 100% in TechPreviewNoUpgrade featureset on standalone OpenShift
Steps to Reproduce:
1. Update authentication.config/cluster so that Type=OIDC
Actual results:
The console operator tries to create a new console rollout, but the pods crashloop. This is because the operator sets the console pods to "disabled". This would normally actually mean a privilege escalation, fortunately the configuration prevents a successful deploy.
Expected results:
Console pods are healthy, they show a page which says that no authentication is currently configured.
Additional info:
This feature is now re-opened because we want to run z-rollback CI. This feature doesn't block the release of 4.17.This is not going to be exposed as a customer-facing feature and will not be documented within OpenShift documentation. This is strictly going to be covered as a RH Support guided solution with KCS article providing guidance. A public facing KCS will basically point to contacting Support for help on Z-stream rollback, and Y-stream rollback is not supported.
NOTE:
Previously this was closed as "won't do" because didn't have a plan to support y-stream and z-stream rollbacks is standalone openshift.
For Single node openshift please check TELCOSTRAT-160 . "won't do" decisions was after further discussion with leadership.
The e2e tests https://docs.google.com/spreadsheets/d/1mr633YgQItJ0XhbiFkeSRhdLlk6m9vzk1YSKQPHgSvw/edit?gid=0#gid=0 We have identified a few bugs that need to be resolved before the General Availability (GA) release. Ideally, these should be addressed in the final month before GA when all features are development complete. However, asking component teams to commit to fixing critical rollback bugs during this time could potentially delay the GA date.
------
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
Red Hat Support assisted z-stream rollback from 4.16+
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
Red Hat Support may, at their discretion, assist customers with z-stream rollback once it’s determined to be the best option for restoring a cluster to the desired state whenever a z-stream rollback compromises cluster functionality.
Engineering will take a “no regressions, no promises” approach, ensuring there are no major regressions between z-streams, but not testing specific combinations or addressing case-specific bugs.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed | all |
Multi node, Compact (three node) | all |
Connected and Restricted Network | all |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | all |
Release payload only | all |
Starting with 4.16, including all future releases | all |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
As an admin who has determined that a z-stream update has compromised cluster functionality I have clear documentation that explains that unassisted rollback is not supported and that I should consult with Red Hat Support on the best path forward.
As a support engineer I have a clear plan for responding to problems which occur during or after a z-stream upgrade, including the process for rolling back specific components, applying workarounds, or rolling the entire cluster back to the previously running z-stream version.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
Should we allow rollbacks whenever an upgrade doesn’t complete? No, not without fully understanding the root cause. If it’s simply a situation where workers are in process of updating but stalled, that should never yield a rollback without credible evidence that rollback will fix that.
Similar to our “foolproof command” to initiate rollback to previous z-stream should we also craft a foolproof command to override select operators to previous z-stream versions? Part of the goal of the foolproof command is to avoid potential for moving to an unintended version, the same risk may apply at single operator level though impact would be smaller it could still be catastrophic.
High-level list of items that are out of scope. Initial completion during Refinement status.
Non-HA clusters, Hosted Control Planes – those may be handled via separately scoped features
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Occasionally clusters either upgrade successfully and encounter issues after the upgrade or may run into problems during the upgrade. Many customers assume that a rollback will fix their concerns but without understanding the root cause we cannot assume that’s the case. Therefore, we recommend anyone who has encountered a negative outcome associated with a z-stream upgrade contact support for guidance.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
It’s expected that customers should have adequate testing and rollout procedures to protect against most regressions, i.e. roll out a z-stream update in pre-production environments where it can be adequately tested prior to updating production environments.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
This is largely a documentation effort, i.e. we should create either a KCS article or new documentation section which describes how customers should respond to loss of functionality during or after an upgrade.
KCS Solution : https://access.redhat.com/solutions/7083335
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Given we test as many upgrade configurations as possible and for whatever reason the upgrade still encounters problems, we should not strive to comprehensively test all configurations for rollback success. We will only test a limited set of platforms and configurations necessary to ensure that we believe the platform is generally able to roll back a z-stream update.
KCS : https://access.redhat.com/solutions/7089715
OTA-941 landed a rollback guard in 4.14 that blocked all rollbacks. OCPBUGS-24535 drilled a hole in that guard to allow limited rollbacks to the previous release the cluster had been aiming at, as long as that previous release was part of the same 4.y z stream. We decided to block that hole back up in OCPBUGS-35994. And now folks want the hole re-opened in this bug. We also want to bring back the oc adm upgrade rollback ... subcommand. Hopefully this new plan sticks
Folks want the guard-hole and rollback subcommand restored for 4.16 and 4.17.
Every time.
Try to perform the rollbacks that OCPBUGS-24535 allowed.
They stop working, with reasonable ClusterVersion conditions explaining that even those rollback requests will not be accepted.
They work, as verified in OCPBUGS-24535.
Networking Definition of Planned
Epic Template descriptions and documentation
openshift-sdn is no longer part of OCP in 4.17, so CNO must stop referring to its image
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
1. …
1. …
This feature is to track automation in ODC, related packages, upgrades and some tech debts
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | No |
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
This won't impact documentation and this feature is to mostly enhance end to end test and job runs on CI
Questions to be addressed:
Here is our overall tech debt backlog: ODC-6711
See included tickets, we want to clean up in 4.16.
This is a follow-up on https://github.com/openshift/console/pull/13931.
We should move test-cypress.sh from the root of the console project into the frontend folder or a new frontend integration-tests folder to allow more people to approve changes in the test-cypress.sh script.
Provide mechanisms for the builder service account to be made optional in core OpenShift.
< Who benefits from this feature, and how? What is the difference between today’s current state and a world with this feature? >
Requirements | Notes | IS MVP |
Disable service account controller related to Build/BuildConfig when Build capability is disabled | When the API is marked as removed or disabled, stop creating the "builder" service account and its associated RBAC | Yes |
Option to disable the "builder" service account | Even if the Build capability is enabled, allow admins to disable the "builder" service account generation. Admins will need to bring their own service accounts/RBAC for builds to work | Yes |
< What are we making, for who, and why/what problem are we solving?>
<Defines what is not included in this story>
< Link or at least explain any known dependencies. >
Background, and strategic fit
< What does the person writing code, testing, documenting need to know? >
< Are there assumptions being made regarding prerequisites and dependencies?>
< Are there assumptions about hardware, software or people resources?>
< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>
< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >
< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact?>
< Are there assumptions being made regarding prerequisites and dependencies?>
< Are there assumptions about hardware, software or people resources?>
< If the feature is ordered with other work, state the impact of this feature on the other work>
Description of problem:
When a cluster is deployed with no capabilities enabled, and the Build capability is later enabled, it's related cluster configuration CRD is not installed. This prevents admins from fine-tuning builds and ocm-o from fully reconciling its state.
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
Always
Steps to Reproduce:
1. Launch a cluster with no capabilities enabled (via cluster-bot: launch 4.16.0.0-ci aws,no-capabilities 2. Edit the clusterversion to enable the build capability: oc patch clusterversion/version --type merge -p '{"spec":{"capabilities":{"additionalEnabledCapabilities":["Build"]}}}' 3. Wait for the openshift-apiserver and openshift-controller-manager to roll out
Actual results:
APIs for BuildConfig (build.openshift.io) are enabled. Cluster configuration API for build system is not: $ oc api-resources | grep "build" buildconfigs bc build.openshift.io/v1 true BuildConfig builds build.openshift.io/v1 true Build
Expected results:
Cluster configuration API is enabled. $ oc api-resources | grep "build" buildconfigs bc build.openshift.io/v1 true BuildConfig builds build.openshift.io/v1 true Build builds config.openshift.io/v1 true Build
Additional info:
This causes list errors in openshift-controller-manager-operator, breaking the controller that reconciles state for builds and the image registry. W0523 18:23:38.551022 1 reflector.go:539] k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: failed to list *v1.Build: the server could not find the requested resource (get builds.config.openshift.io) E0523 18:23:38.551334 1 reflector.go:147] k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: Failed to watch *v1.Build: failed to list *v1.Build: the server could not find the requested resource (get builds.config.openshift.io)
As a cluster admin trying to disable the Build, DeploymentConfig, and Image Registry capabilities I want the RBAC controllers for the builder and deployer service accounts and default image-registry rolebindings disabled when their respective capability is disabled.
<Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it? How does it improve the customer's experience?>
<Describes the context or background related to this story>
In WRKLDS-695, ocm-o was enhanced to disable the Build and DeploymentConfig controllers when the respective capability was disabled. This logic should be extended to include the controllers that set up the service accounts and role bindings for these respective features.
<Defines what is not included in this story>
<Description of the general technical path on how to achieve the goal of the story. Include details like json schema, class definitions>
<Describes what this story depends on. Dependent Stories and EPICs should be linked to the story.>
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Unknown
Verified
Unsatisfied
In OCP 4.16.0, the default role bindings for image puller, image pusher, and deployer are created, even if the respective capabilities are disabled on the cluster.
When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release
Design a secure device lifecycle from provisioning and on-boarding of devices, attesting their integrity, rotating device certificates, to decommissioning with a frictionless user experience in a way that facilitates later IEC 62443 certification.
Implement the MVP parts of it, namely the secure device enrollment, certificate rotation, and decommissioning, preparing the way to use a hardware root-of-trust for device identity where possible. Certificates must be sign-able by a user-provided, production-grade CA.
User stories:
Assisted installer
team's contact Riccardo Piccoli
spreadsheel template: https://docs.google.com/spreadsheets/d/1ft6REuaFaA-6fJw93BjjxLJvhLpFCR4rhPrFjf_20eM/edit#gid=0
As a stakeholder aiming to adopt KubeSaw as a Namespace-as-a-Service solution, I want the project to provide streamlined tooling and a clear code-base, ensuring seamless adoption and integration into my clusters.
Efficient adoption of KubeSaw, especially as a Namespace-as-a-Service solution, relies on intuitive tooling and a transparent codebase. Improving these aspects will empower stakeholders to effortlessly integrate KubeSaw into their Kubernetes clusters, ensuring a smooth transition to enhanced namespace management.
As a Stakeholder, I want streamlined setup of the KubeSaw project and fully automated way of upgrading this setup aling with the updates of the installation.
The expected outcome within the market is both growth and retention. The improved tooling and codebase will attract new stakeholders (growth) and enhance the experience for existing users (retention) by providing a straightforward path to adopting KubeSaw's Namespace-as-a-Service features in their clusters.
This epic is to track all the unplanned work related to security incidents, fixing flaky e2e tests, and other urgent and unplanned efforts that may arise during the sprint.
“In order to have a consistent metrics UI/UX, we as the Observability UI Team need to reuse the metrics code from the admin console in the dev console”
The metrics page is the one that receives a promQL query and is able to display a line chart with the results
Product Requirements:
In order to keep the dev and admin console metrics consistent. Users need to be able to select from a list a predefined query. The dev perspective metrics is blocked by the current selected namespace, we should adjust the code so the current namespace is used in the soft-tenancy requests for thanos querier.
The admin console's alert details page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.
The UX of the two pages differs somewhat, so we will need to decide whether we can change the dev console to use the same UX as the admin page or whether we need to keep some differences. This is an opportunity to bring the improved PromQL editing UX from the admin console to the dev console.
—
Duplicate issue of https://issues.redhat.com/browse/CONSOLE-4187
To pass the CI/CD requirements of the openshift/console each PR needs to have a issue in a OCP own Jira board.
This issue migrates the rendering of the Developer Perspective > Observe > Metrics page from the openshift/console to openshift/monitioring-plugin.
openshift/console PR#4187: Removes the Metrics Page.
openshift/monitoring-plugin PR#138: Add the Metrics Page & consolidates the code to use the same components as the Administrative > Observe > Metrics Page.
—
Testing
Both openshift/console PR#4187 & openshift/monitoring-plugin PR#138 need to be launched to see the full feature. After launching both the PRs you should see a page like the screenshot attached below.
—
The admin console's alerts list page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.
There are also some UX differences between the two pages, but we want to change the dev console to have the same UX as the admin console.
That dev console page is loaded from monitoring-plugin and the code for the page is removed from the console codebase.
The dev console version of the page has the project selector dropdown, but the admin console page doesn't, so monitoring-plugin will need to be changed to support that difference.
The admin console's alerts list page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.
There are also some UX differences between the two pages, but we want to change the dev console to have the same UX as the admin console.
The admin console's alert details page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.
That dev console page is loaded from monitoring-plugin and the code for the page is removed from the console codebase.
The admin console's alert details page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.
That dev console page is loaded from monitoring-plugin and the code for the page is removed from the console codebase.
Proposed title of this feature request
Fleet / Multicluster Alert Management User Interface
What is the nature and description of the request?
Large enterprises are drowning in cluster alerts.
side note: Just within my demo RHACM Hub environment, across 12 managed clusters (OCP, SNO, ARO, ROSA, self-managed HCP, xKS), I have 62 alerts being reported! And I have no idea what to do about them!
Customers need the ability to interact with alerts in a meaningful way, to leverage a user interface that can filter, display, multi-select, sort, etc. To multi-select and take actions, for example:
Why does the customer need this? (List the business requirements)
Platform engineering (sys admin; SRE etc) must maintain the health of the cluster and ensure that the business applications are running stable. There might indeed be another tool and another team which focuses on the Application health itself, but for sure the platform team is interested to ensure that the platform is running optimally and all critical alerts are responded to.
As of TODAY, what the customer must do is perform alert management via CLI. This is tedious, ad-hoc, and error prone. see blog link
The requirements are:
List any affected packages or components.
OCP console Observe dynamic plugin
ACM Multicluster observability (MCO operator)
"In order to provide ACM with the same monitoring capabilities OCP has, we as the Observability UI Team need to allow the monitoring plugin to be installed and work in ACM environments."
Product Requirements:
UX Requirements:
In order to enable/disable features for monitoring in different OpenShift flavors, the monitoring plugin should support feature flags
Proposed title of this feature request
Q2 - Rapid Recommendations Iterration1 - Containers/Pod logs gathering
What is the nature and description of the request?
As an Insights/Observability user I'd like the collection mechanism to be more dynamic in order to cover more scenarios and provide recommendations faster.
Why does the customer need this? (List the business requirements)
Rapid recommendations is a set of collection mechanism changes that enables Insights Rule development and Analytics functions to request data collection enhancement and trim time for actual implementation of rule/dashboard from a month+ to days.
List any affected packages or components.
Insights Operator
The main goal is to implement Rapid Recommendations Iteration 1 (Containers/Pod logs gathering) in Insights Operator as per Openshift Enhancement Proposal - PR link
In more details:
This improvement has a huge potential in terms of additional value IO data could bring on the table.
The conditional data gathering can be considered as the previous work. This idea generally builds on it.
Scoping is done in previous epic and described in the Openshift Enhancement Proposal - PR link
There are several unknowns:
even if the scope was reduced to Pod logs only it remains a sizable chunk of work and might require project management involvement (not necessarily a project manager)
By having the OpenShift Insights Operator gathering data from OSP CRDs, this data can be ingested into Insights and plug in into the existing tools and processes under the Insights/CCX teams.
This will allow us to create dedicated Superset dashboards or query the data using SQL via the Trino API to for example:
Besides, customers will benefit from any Insights rules that we'll be adding over time to for example anticipate issues or detect misconfigurations, suggest parameter tunings, etcetera.
Examples of how OCP Insights uses this data can be seen in the "Let's Do The Numbers" series of monthly presentations.
This epic is targeted only for the RHOSO (So OSP18 and newer). There are no any changes nor support for that planned in OSP-17.1 or older.
It is implementation of the solution 1 from the document https://docs.google.com/document/d/1r3sC_7ZU7qkxvafpEkAJKMTmtcWAwGOI6W_SZGkvP6s/edit#heading=h.kfjcs2uvui3g
We need to, base on the Yatin Karel's patch make proper integration of our CRs with the insights-operator. It needs to collect data from the 'OpenstackControlPlane', 'OpenstackDataPlaneDeployment' and 'OpenstackDataPlaneNodeSet' CRs with proper anonymization of the data like IP addresses etc. It also needs to set somehow "good" ID to identify Openstack cluster as we cannot rely on the Openshitf clusterID because we may have more than one Openstack cluster on the same OCP cluster.
To identify openstack cluster maybe uuid of the OpenstackControlPlane CR can be used. If no we will need to figure out something else there.
Definition of done:
As a cluster-admin, I want to run update in discrete steps. Update control plane and worker nodes independently.
I also want to back-up and restore incase of a problematic upgrade.
Background:
This Feature is a continuation of https://issues.redhat.com/browse/OCPSTRAT-180.
Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.Below is the list of done tasks.
These are alarming conditions which may frighten customers, and we don't want to see them in our own, controlled, repeatable update CI. This example job had logs like:
: [bz-Image Registry] clusteroperator/image-registry should not change condition/Available expand_less Run #0: Failed expand_less 1h58m17s { 0 unexpected clusteroperator state transitions during e2e test run, as desired. 3 unwelcome but acceptable clusteroperator state transitions during e2e test run. These should not happen, but because they are tied to exceptions, the fact that they did happen is not sufficient to cause this test-case to fail: Jan 09 12:43:04.348 E clusteroperator/image-registry condition/Available reason/NoReplicasAvailable status/False Available: The deployment does not have available replicas\nNodeCADaemonAvailable: The daemon set node-ca has available replicas\nImagePrunerAvailable: Pruner CronJob has been created (exception: We are not worried about Available=False or Degraded=True blips for stable-system tests yet.) Jan 09 12:43:04.348 - 56s E clusteroperator/image-registry condition/Available reason/NoReplicasAvailable status/False Available: The deployment does not have available replicas\nNodeCADaemonAvailable: The daemon set node-ca has available replicas\nImagePrunerAvailable: Pruner CronJob has been created (exception: We are not worried about Available=False or Degraded=True blips for stable-system tests yet.) Jan 09 12:44:00.860 W clusteroperator/image-registry condition/Available reason/Ready status/True Available: The registry is ready\nNodeCADaemonAvailable: The daemon set node-ca has available replicas\nImagePrunerAvailable: Pruner CronJob has been created (exception: Available=True is the happy case) }
And the job still passed.
Definition of done:
In order to continue with the evolution of rpm-ostree in RHEL CoreOS, we should adopt bootc. This will keep us aligned with future development work in bootc and provide operational consistence between RHEL image mode and RHEL CoreOS.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Operator compatibility | Does an operator use `rpm-ostree install`? |
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
The whole bootc integration into the MCO can be divided into three epics, each announced in an enhancement
Epic 1: Bootc Update Path
Summary: An additional image update path added to the backend of the MCO. Keeping synced with CoreOS, MCO currently utilizes a rpm-ostree-based system to engage in base image specification and layering. By adding a bootc update path, MCO, in the future, will embrace the Image Mode for RHEL and become ready to support image-based upgrade process.
Proposal:
Phase 1 Get bootc working in the MCO
Phase 2 Get bootc working with OCL
Phase 3 GA graduation criteria
Epic 2: Unified Update Interface @ https://issues.redhat.com/browse/MCO-1189
Summary: Currently, there are mainly three update paths built in parallel within the MCO. They separately take care of non-image updates, image updates, and updates for pools that have opted in to On Cluster Layering. As a new bootc update path will be added in with the introduction of this enhancement, MCO is looking for a better way to better manage these four update paths who manage different types of update but also shares a lot of things in common (e.g. check reconcilability). Interest and Proposals in refactoring the MCD functions and creating a unified update interface has been raised several times in previous discussions:
Epic 3: Bootc Day-2 Configuration Tools @ https://issues.redhat.com/browse/MCO-1190
Summary: Bootc has opened a door for disk image customization via configmap. Switching from rpm-ostree to bootc, MCO should not only make sure all the functionality remains but also proactively extending its support to make sure all the customization power brought in by bootc are directed to the user, allowing them to fully maximize these advantages.This will involve creating a new user-side API for fetching admin-defined configuration and pairing MCO with a bootc day-2 configuration tool for applying the customizations. Interest and Proposals in which has been raised several times in previous discussions:
Several discussions have happened on the question : is Go binding for bootc a necessity for adapting a bootc update path in the MCO?
Action item:
(1) Mimic the rpmostree.go we have for shelling out rpm-ostree commands and have something similar for bootc to be able to call bootc commands for os updates
(2 This is a merging card for result in https://issues.redhat.com/browse/MCO-1191
Enable installation and lifecycle support of OpenShift 4 on Oracle Cloud Infrastructure (OCI) Bare metal
Use scenarios
Why is this important
Requirement | Notes |
---|---|
OCI Bare Metal Shapes must be certified with RHEL | It must also work with RHCOS (see iSCSI boot notes) as OCI BM standard shapes require RHCOS iSCSI to boot ( Certified shapes: https://catalog.redhat.com/cloud/detail/249287 |
Successfully passing the OpenShift Provider conformance testing – this should be fairly similar to the results from the OCI VM test results. | Oracle will do these tests. |
Updating Oracle Terraform files | |
Making the Assisted Installer modifications needed to address the CCM changes and surface the necessary configurations. | Support Oracle Cloud in Assisted-Installer CI: |
RFEs:
Any bare metal Shape to be supported with OCP has to be certified with RHEL.
From the certified Shapes, those that have local disks will be supported. This is due to the current lack of support in RHCOS for the iSCSI boot feature. OCPSTRAT-749 is tracking adding this support and remove this restriction in the future.
As of Aug 2023 this excludes at least all the Standard shapes, BM.GPU2.2 and BM.GPU3.8, from the published list at: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#baremetalshapes
Please describe what this feature is going to do.
Please describe what conditions must be met in order to mark this feature as "done".
If the answer is "yes", please make sure to check the corresponding option.
Support network isolation and multiple primary networks (with the possibility of overlapping IP subnets) without having to use Kubernetes Network Policies.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
OVN-Kubernetes today allows multiple different types of networks per secondary network: layer 2, layer 3, or localnet. Pods can be connected to different networks without discretion. For the primary network, OVN-Kubernetes only supports all pods connecting to the same layer 3 virtual topology.
As users migrate from OpenStack to Kubernetes, there is a need to provide network parity for those users. In OpenStack, each tenant (analog to a Kubernetes namespace) by default has a layer 2 network, which is isolated from any other tenant. Connectivity to other networks must be specified explicitly as network configuration via a Neutron router. In Kubernetes the paradigm is the opposite; by default all pods can reach other pods, and security is provided by implementing Network Policy.
Network Policy has its issues:
With all these factors considered, there is a clear need to address network security in a native fashion, by using networks per user to isolate traffic instead of using Kubernetes Network Policy.
Therefore, the scope of this effort is to bring the same flexibility of the secondary network to the primary network and allow pods to connect to different types of networks that are independent of networks that other pods may connect to.
Test scenarios:
See https://github.com/ovn-org/ovn-kubernetes/pull/4276#discussion_r1628111584 for more details
In order for the nework API related CRDs be installed and usable out-of-the-box, the new CRDs manifests should be replicated to CNO repository in a way it will install them along with other OVN-K CRDs.
Example https://github.com/openshift/cluster-network-operator/pull/1765
Goal of this task is to simply add a feature gate both upstream to OVNK and to downstream in ocp/api to then leverage via CNO once the entire feature merges. This is going to be a huge EPIC so with the break down, this card is intentionally ONLY tracking the glue work to have a feature gate piece done in both places.
This card DOES NOT HAVE TO USE THE FEATURE GATE. It is meant to allow other cards to use this.
Crun is GA as non default since OCP 4.14 . We want to make it as default in 4.18 while still supporting runc as non-default
Benefits of Crun is covered here https://github.com/containers/crun
FAQ.: https://docs.google.com/document/d/1N7tik4HXTKsXS-tMhvnmagvw6TE44iNccQGfbL_-eXw/edit
***Note -> making Crun default does not means we will remove the support for runc nor we have any plans in foreseeable future to do that
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Description of problem:
Upgrading from 4.17 to 4.18 results in crun as the default runtime. If a user didn't have a ContainerRuntimeConfig set to set crun, then they should continue to use runc
Version-Release number of selected component (if applicable):
4.17.z
How reproducible:
100%
Steps to Reproduce:
1. upgrade 4.17 to 4.18
Actual results:
crun is the default
Expected results:
runc should be the default
Additional info:
Check with ACS team; see if there are external repercussions.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
Support kubevirt csi volume cloning.
As an application developer on a KubeVirt-HCP hosted cluster I want to use CSI clone workflows when provisioning PVCs when my infrastructure storage supports it.
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
<your text here>
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Support kubevirt csi volume cloning.
As an application developer on a KubeVirt-HCP hosted cluster I want to use CSI clone workflows when provisioning PVCs when my infrastructure storage supports it.
In order to implement csi-clone in the tenant cluster, we can simply pass the csi-clone request to the infra cluster (if it supports csi-clone) and have that take care of creating the csi-clone.
This goals of this features are:
Given Microsoft's constraints on IPv4 usage, there is a pressing need to optimize IP allocation and management within Azure-hosted environments.
Interoperability Considerations
There's currently multiple ingress strategies we support for hosted cluster service endpoints (kas, nodePort, router...).
In a context of uncertainty about what use cases would be more critical to support, we initially exposed this in a flexible API that enables to potentially choose any combination of ingress strategies and endpoints.
ARO has internal restrictions on IPv4 usage. Because of this, to simplify the above and to be more cost effective in terms of infra we'd want to have a common shared ingress solution for all hosted clusters fleet.
Current implementation reproduces https://docs.google.com/document/d/1o6kd61gBVvUtYAqTN6JqJGlUAsatmlq2mmoBkQThEFg/edit for simplicity.
This has a caveat: when the management KAS SVC changes it would require changes to the .sh files setting binding the IP to the lo interface and the haproxies running in the dataplane.
We should find a way to either
current implementation reproduces https://docs.google.com/document/d/1o6kd61gBVvUtYAqTN6JqJGlUAsatmlq2mmoBkQThEFg/edit for simplicity
We should explore if it's possible to use one single proxy data plane side. E.g:
We could change the kube endpoint IP advertised address in the dataplane to match the management kas svc ip, or relying on source ip instead of dst to discriminate.
As a consumer (managed services) I want hypershift to support sharedingress
As a dev I want to transition towards a prod ready solution iteratively
A shared ingress solution poc is merged and let Azure e2e pass https://docs.google.com/document/d/1o6kd61gBVvUtYAqTN6JqJGlUAsatmlq2mmoBkQThEFg/edit#heading=h.hsujpqw67xkr
The initial goal is to start transitioning the code structure (reconcilers, APIs, helpers...) towards a shared ingress oriented solution using haproxy as presented above for simplicity.
Then we can iteratively follow up to harden security, evaluate more sophisticated alternatives and progress towards being prod ready.
apiVersion: v1 kind: ConfigMap metadata: name: cluster-autoscaler-priority-expander namespace: kube-system data: priorities: |- 10: - .*t2\.large.* - .*t3\.large.* 50: - .*m4\.4xlarge.*
Documentation will need to be updated to point out the new maximum for ROSA HCP clusters, and any expectations to set with customers.
the following flag needs to be set on the cluster-auto-scaler `--expander=priority` .
The Configuration is based on the values stored in a ConfigMap called `cluster-autoscaler-priority-expander` which will be created by the user/ocm.
Goal:
Graduate to GA (full support) Gateway API with Istio to unify the management of cluster ingress with a common, open, expressive, and extensible API.
Description:
Gateway API is the evolution of upstream Kubernetes Ingress APIs. The upstream project is part of Kubernetes, working under SIG-NETWORK. OpenShift is contributing to the development, building a leadership position, and preparing OpenShift to support Gateway API, with Istio as our supported implementation.
The plug-able nature of the implementation of Gateway API enables support for additional and optional 3rd-party Ingress technologies.
Continue work for E2E tests for the implementation of "gatewaycontroller" in https://issues.redhat.com/browse/NE-1206
Some ideas for the test include:
Create E2E tests for the implementation of "gatewaycontroller" in https://issues.redhat.com/browse/NE-1206
Some ideas for the test include:
Networking Definition of Planned
Epic Template descriptions and documentation
Track the stories that cannot be completed before live migration GA.
These tasks shall not block the live migration GA, but we still need to get them done.
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
1. ...
1. ...
As, the live migration process may take hours for a large cluster. The workload in the cluster may trigger cluster extension by adding new nodes. We need to support adding new nodes when an SDN live migration is running in progress.
We need to backport this to 4.15.
The SDN live migration can not work properly in a cluster with specific configurations. CNO shall refuse proceeding the live migration in such a case. We need to add the pre-migration validation to CNO
The live migration shall be blocked for clusters with the following configuration
SD team manage many clusters. Metrics can help them to monitor status of many cluster at the time. There is something which has been done for the cluster upgrade, we may want to follow the same recipe.
This is Image mode on OpenShift. It uses the rpm-ostree native containers interface and not bootc but that is an implementation detail.
In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.
The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.
On-cluster, automated RHCOS Layering builds are important for multiple reasons:
This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.
As a cluster admin of user provided infrastructure,
when I apply the machine config that opts a pool into On Cluster Layering,
I want to also be able to remove that config and have the pool revert back to its non-layered state with the previously applied config.
As a cluster admin using on cluster layering,
when an image build has failed,
I want it to retry 3 times automatically without my intervention and show me where to find the log of the failure.
As a cluster admin,
when I enable On Cluster Layering,
I want to know that the builder image I am building with is stable and will not change unless I change it
so that I keep the same API promises as we do elsewhere in the platform.
To test:
As a cluster admin using on cluster layering,
when I try to upgrade my cluster and the Cluster Version Operator is not available,
I want the upgrade operation to be blocked.
As a cluster admin,
when I use a disconnected environment,
I want to still be able to use On Cluster Layering.
As a cluster admin using On Cluster layering,
When there has been config drift of any sort that degrades a node and I have resolved the issue,
I want to it to resync without forcing a reboot.
As a cluster admin using on cluster layering,
when a pool is using on cluster layering and references an internal registry
I want that registry available on the host network so that the pool can successfully scale up
(MCO-770, MCO-578, MCO-574 )
As a cluster admin using on cluster layering,
when a pool is using on cluster layering and I want to scale up nodes,
the nodes should have the same config as the other nodes in the pool.
Maybe:
Entitlements: MCO-1097, MCO-1099
Not Likely:
As a cluster admin using on cluster layering,
when I try to upgrade my cluster,
I want the upgrade operation to succeed at the same rate as non-OCL upgrades do.
Occasionally, transient network issues will cause image builds to fail. This can be remedied by adding retry capabilities to the Buildah build and push operations within the build phase. This will allow these operations to be automatically retried without cluster admin intervention.
Done When:
Description of problem:
The CurrentImagePullSecret field on the MachineOSConfig is not being consumed by the rollout process. This is evident when the designated image registry is private and the only way to pull an image is to present a secret.
How reproducible:
Always
Steps to Reproduce:
Actual results:
The node and MachineConfigPool will degrade because rpm-ostree is unable to pull the newly-built image because it does not have access to the credentials even though the MachineOSConfig has a field for them.
Expected results:
Rolling out the newly-built OS image should succeed.
Additional info:
It looks like we'll need make the getImageRegistrySecrets() function aware of all MachineOSConfigs and pull the secrets from there. Where this could be problematic is where there are two image registries with different secrets. This is because the secrets are merged based on the image registry hostname. Instead, what we may want to do is have the MCD write only the contents of the referenced secret to the nodes' filesystem before calling rpm-ostree to consume it. This could potentially also reduce or eliminate the overall complexity introduced by the getImageRegistrySecrets() while simultaneously resolving the concerns found under https://issues.redhat.com//browse/OCPBUGS-33803.
It is worth mentioning that even though we use a private image registry to test the rollout process in OpenShift CI, the reason it works is because it uses an Imagestream which the machine-os-puller service account and its image pull secret is associated with it. This secret is surfaced to all of the cluster nodes by the getImageRegistrySecrets() process. So in effect, it may appear that its working when it does not work as intended. A way to test this would be to create an ImageStream in a separate namespace along with a separate pull secret and then attempt to use that ImageStream and pull secret within a MachineOSConfig.
Finally, to add another wrinkle to this problem: If a cluster admin wants to use a different final image pull secret for each MachineConfigPool, merging those will get more difficult. Assuming the image registry has the same hostname, this would lead to the last secret being merged as the winner. And the last secret that gets merged would be the secret that gets used; which may be the incorrect secret.
For the custom pod builder, we have a hardcoded dependency upon the Buildah container image (quay.io/buildah/stable:latest). This causes two problems: 1) It breaks the guarantees that OpenShift makes about everything being built and stable together since that image can change at any time. 2) This makes on-cluster builds not work in a disconnected environment since that image cannot be pulled.
It is worth mentioning that we cannot simply delegate to the image that the OpenShift Image Builder uses and use a Buildah binary there. To remedy this, we'll need to decide on and implement an approach as defined below:
As part of our container build process, we'll need to install Buildah. Overall, this seems pretty straightforward since the package registry we'd be installing from (the default package registry for a given OCP release) has the appropriate package versions for a given OCP release.
This has the downside that the MCO image size will increase as a result.
The OS base image technically has Buildah within it albeit embedded inside Podman. By using this base image, we can effectively lifecycle Buildah in lockstep with the OCP release without significant cognitive or process overhead. Basically, we'd have an e2e test that would perform a build using this image and if it passes, we can be reasonably confident that it will continue to work.
However, it is worth mentioning that I encountered significant difficulty while attempting to make this work in an unprivileged pod.
Done When:
Today VMs for a single nodepool can “clump” together on a single node after the infra cluster is updated. This is due to live migration shuffling around the VMs in ways that can result in VMs from the same nodepool being placed next to each other.
Through a combination of TopologySpreadConstraints and the De-Scheduler, it should be possible to continually redistributed VMs in a nodepool (via live migration) when clumping occurs. This will provide stronger HA guarantees for nodepools
VMs within a nodepool should re-distribute via live migration in order to best satisfy topology spread constraints.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Today VMs for a single nodepool can “clump” together on a single node after the infra cluster is updated. This is due to live migration shuffling around the VMs in ways that can result in VMs from the same nodepool being placed next to each other.
Through a combination of TopologySpreadConstraints and the De-Scheduler, it should be possible to continually redistributed VMs in a nodepool (via live migration) when clumping occurs. This will provide stronger HA guarantees for nodepools
VMs within a nodepool should re-distribute via live migration in order to best satisfy topology spread constraints.
ETCD backup API was delivered behind a feature gate in 4.14. This feature is to complete the work for allowing any OCP customer to benefit from the automatic etcd backup capability.
The feature introduces automated backups of the etcd database and cluster resources in OpenShift clusters, eliminating the need for user-supplied configuration. This feature ensures that backups are taken and stored on each master node from the day of cluster installation, enhancing disaster recovery capabilities.
The current method of backing up etcd and cluster resources relies on user-configured CronJobs, which can be cumbersome and prone to errors. This new feature addresses the following key issues:
Complete work to auto-provision internal PVCs when using the local PVC backup option. (right now, the user needs to create PVC before enabling the service).
Out of Scope
The feature does not include saving cluster backups to remote cloud storage (e.g., S3 Bucket), automating cluster restoration, or providing automated backups for non-self-hosted architectures like Hypershift. These could be future enhancements (see OCPSTRAT-464)
Epic Goal*
Provide automated backups of etcd saved locally on the cluster on Day 1 with no additional config from the user.
Why is this important? (mandatory)
The current etcd automated backups feature requires some configuration on the user's part to save backups to a user specified PersistentVolume.
See: https://github.com/openshift/api/blob/ba11c1587003dc84cb014fd8db3fa597a3faaa63/config/v1alpha1/types_backup.go#L46
Before the feature can be shipped as GA, we would require the capability to save backups automatically by default without any configuration. This would help all customers have an improved disaster recovery experience by always having a somewhat recent backup.
Scenarios (mandatory)
Implementation details:
One issue we need to figure out during the design of this feature is how the current API might change as it is inherently tied to the configuration of the PVC name.
See:
https://github.com/openshift/api/blob/ba11c1587003dc84cb014fd8db3fa597a3faaa63/config/v1alpha1/types_backup.go#L99
and
https://github.com/openshift/api/blob/ba11c1587003dc84cb014fd8db3fa597a3faaa63/operator/v1alpha1/types_etcdbackup.go#L44
Additionally we would need to figure out how the etcd-operator knows about the available space on local storage of the host so it can prune and spread backups accordingly.
Dependencies (internal and external) (mandatory)
Depends on changes to the etcd-operator and the tech preview APIs
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Upon installing a tech-preview cluster backups must be saved locally and their status and path must be visible to the user e.g on the operator.openshift.io/v1 Etcd cluster object.
An e2e test to verify that the backups are being saved locally with some default retention policy.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
As a developer, I want to implement the logic of etcd backup so that backups are taken without configuration.
Feature description
Oc-mirror v2 is focuses on major enhancements that include making oc-mirror faster and more robust and introduces caching as well as address more complex air-gapped scenarios. OC mirror v2 is a rewritten version with three goals:
Currently in oc-mirror v2 the distribution/distribution being used is v3 which is not a stable version. This sub-task is to check the feasibility of using a stable version.
Customers who deploy a large number of OpenShift on OpenStack clusters want to minimise the resource requirements of their cluster control planes.
Customers deploying RHOSO (OpenShift services for OpenStack, i.e. OpenStack control plane on bare metal OpenShift) already have a bare metal management cluster capable of serving Hosted Control Planes.
We should enable self-hosted (i.e. on-prem) Hosted Control Planes to serve Hosted Control Planes to OpenShift on OpenStack clusters, with a specific focus of serving Hosted Control Planes from the RHOSO management cluster.
As an enterprise IT department and OpenStack customer, I want to provide self-managed OpenShift clusters to my internal customers with minimum cost to the business.
As an internal customer of said enterprise, I want to be able to provision an OpenShift cluster for myself using the business's existing OpenStack infrastructure.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
TBD
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
cloud-provider-openstack is not the only service needing access to the cloud credentials. The list also includes:
Normally this is solved by cloud-credentials-operator, but in HyperShift we don't have it. hosted-control-plane-operator needs to take care of this alone. The code goes here: https://github.com/openshift/hypershift/blob/1af078fe4b9ebd63a9b6e506f03abc9ae6ed4edd/control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go#L1156
We also need to pass CA here! It might be non-trivial!
This will allow to configure CAPO to create the nodepool ports allowing Ingress.
Note: this is a first iteration and we might improve how we do things later.
For example, I plan to work in openshift/release for this one. In the future we might add it directly to the HCP e2e framework.
CAPO is also responsible for creating worker nodes. Worker nodes are represented by NodePool CRDs in HyperShift. We need a way to translate NodePools into MachineTemplates. It goes here: https://github.com/openshift/hypershift/pull/3243/files#diff-c666a2a9dc48b0d3de33cc8fa3a054f853671f0428f9d56b263750dfc6d26e00R2536-R2538
This is a container Epic for tasks which we know need to be done for Tech Preview but which we don't intend to do now. It needs to be groomed before it is useful for planning.
When the management cluster runs on AWS, make sure we update the DNS record for *apps, so ingress can work out of the box.
We don't need to create another service for Ingress, so we can save a FIP.
Currently, CCM is not configured with a floating IP network:
oc get cm openstack-cloud-config -n clusters-openstack -o yaml
We need to change that because if a cloud has multiple external networks, CCM needs to know where to create the Floating IPs, especially since the user can specify which external network they want to use with --openstack-external-network cluster create CLI arguent.
HyperShift should be able to deploy the minimum useful OpenShift cluster on OpenStack. This is the minimum requirement to be able to test it. It is not sufficient for GA.
CAPO has a dependency on ipam from cluster-api, so we must inject these assets as well.
Only beta1 version is used in CAPO.
Combining the different tasks in this EPIC, we use that Jira task to track the first PR that is being submitted to Hypershift.
In HyperShift it's cluster-api responsible for deployment of cluster resources on the cloud and the Machines for worker nodes. This means we need to configure and deploy CAPO.
As CAPO is responsible for deploying OpenStack resources for each Hosted Cluster we need a way to translate HostedCluster into an OpenStackCluster, so that CAPO will take care of creating everything for the cluster. This goes here: https://github.com/openshift/hypershift/pull/3243/files#diff-0de67b02a8275f7a8227b3af5101786e70e5c651b852a87f6160f6563a9071d6R28-R31
The probably tedious part - SGs, let's make sure we understand this.
Support using a proxy in oc-mirror.
We have customers who want to use Docker Registries as proxies / pull-through cache's.
This means that customers would need a way to get the ICSP/IDMS/ITMS and image list which seems relevant to the "generating mapping files" for "V2 tooling". Would like to make sure this is addressed in your use cases
From our IBM sync
"We have customers who want to use Docker Registries as proxies / pull-through cache's. This means that customers would need a way to get the ICSP/IDMS/ITMS and image list which seems relevant to the "generating mapping files" for “V2 tooling”. Would like to make sure this is addressed in your use cases."
Description of problem:
When recovering signatures for releases, the http connection doesn't use the system proxy configuration
Version-Release number of selected component (if applicable):WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info
How reproducible:
Always Image set config kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: platform: graph: true channels: - name: stable-4.16 - name: stable-4.15
Steps to Reproduce:
1. Run oc-mirror with above imagesetconfig in mirror to mirror in an env that requires proxy setup
Actual results:
2024/07/15 14:02:11 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/07/15 14:02:11 [INFO] : 👋 Hello, welcome to oc-mirror 2024/07/15 14:02:11 [INFO] : ⚙️ setting up the environment for you... 2024/07/15 14:02:11 [INFO] : 🔀 workflow mode: mirrorToMirror 2024/07/15 14:02:11 [INFO] : 🕵️ going to discover the necessary images... 2024/07/15 14:02:11 [INFO] : 🔍 collecting release images... I0715 14:02:11.770186 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.16&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6 2024/07/15 14:02:12 [INFO] : detected minimum version as 4.16.1 2024/07/15 14:02:12 [INFO] : detected minimum version as 4.16.1 I0715 14:02:12.321748 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&arch=amd64&channel=stable-4.16&channel=stable-4.16&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6 I0715 14:02:12.485330 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&arch=amd64&arch=amd64&channel=stable-4.16&channel=stable-4.16&channel=stable-4.15&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6 2024/07/15 14:02:12 [INFO] : detected minimum version as 4.15.20 2024/07/15 14:02:12 [INFO] : detected minimum version as 4.15.20 I0715 14:02:12.844366 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&arch=amd64&arch=amd64&arch=amd64&channel=stable-4.16&channel=stable-4.16&channel=stable-4.15&channel=stable-4.15&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6 I0715 14:02:13.115004 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.15&id=00000000-0000-0000-0000-000000000000 I0715 14:02:13.784795 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&arch=amd64&channel=stable-4.15&channel=stable-4.15&id=00000000-0000-0000-0000-000000000000&id=00000000-0000-0000-0000-000000000000&version=4.15.20 I0715 14:02:13.965936 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&arch=amd64&arch=amd64&channel=stable-4.15&channel=stable-4.15&channel=stable-4.16&id=00000000-0000-0000-0000-000000000000&id=00000000-0000-0000-0000-000000000000&id=00000000-0000-0000-0000-000000000000&version=4.15.20 I0715 14:02:14.136625 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&arch=amd64&arch=amd64&arch=amd64&channel=stable-4.15&channel=stable-4.15&channel=stable-4.16&channel=stable-4.16&id=00000000-0000-0000-0000-000000000000&id=00000000-0000-0000-0000-000000000000&id=00000000-0000-0000-0000-000000000000&id=00000000-0000-0000-0000-000000000000&version=4.15.20&version=4.16.1 W0715 14:02:14.301982 2475426 core-cincinnati.go:282] No upgrade path for 4.15.20 in target channel stable-4.16 2024/07/15 14:02:14 [ERROR] : http request Get "https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=c17d4489c1b283ee71c76dda559e66a546e16b208a57eb156ef38fb30098903a/signature-1": dial tcp: lookup mirror.openshift.com: no such host panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x3d02c56] goroutine 1 [running]: github.com/openshift/oc-mirror/v2/internal/pkg/release.SignatureSchema.GenerateReleaseSignatures({{0x54bb930, 0xc000c80738}, {{{0x4c6edb1, 0x15}, {0xc00067ada0, 0x1c}}, {{{...}, {...}, {...}, {...}, ...}, ...}}, ...}, ...) /go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/release/signature.go:96 +0x656 github.com/openshift/oc-mirror/v2/internal/pkg/release.(*CincinnatiSchema).GetReleaseReferenceImages(0xc000fdc000, {0x54aef28, 0x74cf1c0}) /go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/release/cincinnati.go:208 +0x70a github.com/openshift/oc-mirror/v2/internal/pkg/release.(*LocalStorageCollector).ReleaseImageCollector(0xc000184e00, {0x54aef28, 0x74cf1c0}) /go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/release/local_stored_collector.go:62 +0x47f github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).CollectAll(0xc000ace000, {0x54aef28, 0x74cf1c0}) /go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:942 +0x115 github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).RunMirrorToMirror(0xc000ace000, 0xc0007a5800, {0xc000f3f038?, 0x17dcbb3?, 0x2000?}) /go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:748 +0x73 github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).Run(0xc000ace000, 0xc0004f9730?, {0xc0004f9730?, 0x0?, 0x0?}) /go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:443 +0x1b6 github.com/openshift/oc-mirror/v2/internal/pkg/cli.NewMirrorCmd.func1(0xc000ad0e00?, {0xc0004f9730, 0x1, 0x7}) /go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:203 +0x32a github.com/spf13/cobra.(*Command).execute(0xc0007a5800, {0xc000052110, 0x7, 0x7}) /go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:987 +0xaa3 github.com/spf13/cobra.(*Command).ExecuteC(0xc0007a5800) /go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:1115 +0x3ff github.com/spf13/cobra.(*Command).Execute(0x72d7738?) /go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:1039 +0x13 main.main() /go/src/github.com/openshift/oc-mirror/cmd/oc-mirror/main.go:10 +0x18
Expected results:
Additional info:
Stop using the openshift/installer-aro repo during installation of ARO cluster. installer-aro is a fork of openshift/installer with carried patches. Currently it is vendored into openshift/installer-aro-wrapper in place of the upstream installer.
Maintaining this fork requires considerable resources from the ARO team, and results in delays of offering new OCP releases through ARO. Removing the fork will eliminate the work involved in keeping it up to date from this process.
https://docs.google.com/document/d/1xBdl2rrVv0EX5qwhYhEQiCLb86r5Df6q0AZT27fhlf8/edit?usp=sharing
It appears that the only work required to complete this is to move the additional assets that installer-aro adds for the purpose of adding data to the ignition files. These changes can be directly added to the ignition after it is generated by the wrapper. This is the same thing that would be accomplished by OCPSTRAT-732, but that ticket involves adding a Hive API to do this in a generic way.
The OCP Installer team will contribute code changes to installer-aro-wrapper necessary to eliminate the fork. The ARO team will review and test changes.
The fork repo is no longer vendored in installer-aro-wrapper.
Add results here once the Initiative is started. Recommend discussions & updates once per quarter in bullets.
The installer makes heavy use of it's data/data directory, which contains hundreds of files in various subdirectories that are mostly used for inserting into ignition files. From these files, autogenerated code is created that includes the contents in the installer binary.
Unfortunately, subdirectories that do not contain .go files are not regarded by as golang packages and are therefore not included when building the installer as a library: https://go.dev/wiki/Modules#some-needed-files-may-not-be-present-in-populated-vendor-directory
This is currently handled in the installer fork repo by deleting the compile-time autogeneration and instead doing a one-time autogeneration that is checked in to the repo: https://github.com/openshift/installer-aro/pull/27/commits/26a5ed5afe4df93b6dde8f0b34a1f6b8d8d3e583
Since this does not exist in the upstream installer, we will need some way to copy the data/data associated with the current installer version into the wrapper repo - we should probably encapsulate this in a make vendor target. The wiki page above links to https://github.com/goware/modvendor which unfortunately doesn't work, because it assumes you know the file extensions of all of the files (e.g. .c, .h), and it can't handle directory names matching the glob. We could probably easily fix this by forking the tool and teaching it to ignore directories in the source. Alternatively, John Hixson has a script that can do something similar.
Currently the Azure client can only be mocked in unit tests of the pkg/asset/installconfig/azure package. Using the mockable interface consistently and adding a public interface to set it up will allow other packages to write unit tests for code involving the Azure client.
Add support to GCP N4 Machine Series to be used as Control Plane and Compute Nodes when deploying Openshift on Google Cloud
As a user, I want to deploy OpenShift on Google Cloud using N4 Machine Series for the Control Plane and Compute Node so I can take advantage of these new Machine types
OpenShift can be deployed in Google Cloud using the new N4 Machine Series for the Control Plane and Compute Nodes
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | all |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Google has made N4 Machine Series available on their cloud offering. These Machine Series use "hyperdisk-balanced" disk for the boot device that are not currently supported
The documentation will be updated adding the new disk type that needs to be supported as part of this enablement. Also the N4 Machine Series will be added as tested Machine types for Google Cloud when deploying OpenShift
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
Merge the CSI driver operator into csi-operator repo and re-use asset generator and CSI operator code there.
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
Maintaining separate CSI driver operator repo is hard, especially when dealing with CVEs and library bumps. In addition, we could share even more code when moving all CSI driver operators into a single repo. Having a common repo will across drivers will ease maintenance burden.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
As cluster admin, I upgrade my cluster to a version with this epic implemented and I do not see any change, the CSI driver works the same as before. (Some pods, their containers or services may get renamed during the upgrade process).
As OCP developer, I have 1 less repo to worry about when fixing a CVE / bumping library-go or Kubernetes libraries.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | yes |
Classic (standalone cluster) | yes |
Hosted control planes | all |
Multi node, Compact (three node), or Single node (SNO), or all | all |
Connected / Restricted Network | all |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | all |
Operator compatibility | |
Backport needed (list applicable versions) | no |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | no |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
As cluster admin, I upgrade my cluster to a version with this epic implemented and I do not see any change, the CSI driver works the same as before. (Some pods, their containers or services may get renamed during the upgrade process).
As OCP developer, I have 1 less repo to worry about when fixing a CVE / bumping library-go or Kubernetes libraries.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
N/A includes all the CSI operators Red Hat manages as part of OCP
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
This effort started with CSI operators that we included for HCP, we want to align all CSI operator to use the same approach in order to limit maintenance efforts.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Not customer facing, this should not introduce any regression.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
No doc needed
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
N/A, it's purely tech debt / internal
Epic Goal*
Merge the CSI driver operator into csi-operator repo and re-use asset generator and CSI operator code there.
Why is this important? (mandatory)
Maintaining separate CSI driver operator repo is hard, especially when dealing with CVEs and library bumps. In addition, we could share even more code when moving all CSI driver operators into a single repo.
Scenarios (mandatory)
As cluster admin, I upgrade my cluster to a version with this epic implemented and I do not see any change, the CSI driver works the same as before. (Some pods, their containers or services may get renamed during the upgrade process).
As OCP developer, I have 1 less repo to worry about when fixing a CVE / bumping library-go or Kubernetes libraries.
Note: we do not plan to do any changes for HyperShift. The EFS CSI driver will still fully run in the guest cluster, including its control plane.
Dependencies (internal and external) (mandatory)
None, this can be done just by the storage team and independently on other operators / features.
Contributing Teams(and contacts) (mandatory)
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Generally speaking, customers and partners should not be installing packages client-side, i.e. `rpm-ostree install $pkg` directly on the nodes. It's not officially supported outside of troubleshooting situations, but the documentation is not very explicit on this point and we have anecdotal data that customers and partners do in fact install packages directly on hosts.
Adding some telemetry to help understand how common this is among data-reporting clusters. Hopefully such data will help us understand how important it is to preserve this ability in the bootc-world. While it's not a pattern we want to encourage, we should be careful about dropping it without considering how to avoid breaking users' clusters in unexpected ways.
Understand what % of machines (or a proxy thereof) have locally layered packages which aren't CoreOS extensions.
This needs to be backported to 4.14 so we have a better sense of the fleet as it is.
4.12 might be useful as well, but is optional.
Why not simply block upgrades if there are locally layered packages?
That is indeed an option. This card is only about gathering data.
Some customers are known to layer packages locally but it's worse if the issue is a third party integration. In such a case, if the add-on breaks, the customer will call the 3rd party first because that's what appears to be broken. It may be a long, undelightful trip to get to a satisfying resolution. If they are blocked on upgrade due to that 3rd party integration they may not be able to upgrade the OCP y-version. That could be a lengthy delay.
Description copied from attached feature card: https://issues.redhat.com/browse/OCPSTRAT-1521
Generally speaking, customers and partners should not be installing packages client-side, i.e. `rpm-ostree install $pkg` directly on the nodes. It's not officially supported outside of troubleshooting situations, but the documentation is not very explicit on this point and we have anecdotal data that customers and partners do in fact install packages directly on hosts.
Adding some telemetry to help understand how common this is among data-reporting clusters. Hopefully such data will help us understand how important it is to preserve this ability in the bootc-world. While it's not a pattern we want to encourage, we should be careful about dropping it without considering how to avoid breaking users' clusters in unexpected ways.
Understand what % of machines (or a proxy thereof) have locally layered packages which aren't CoreOS extensions.
This needs to be backported to 4.14 so we have a better sense of the fleet as it is.
4.12 might be useful as well, but is optional.
Why not simply block upgrades if there are locally layered packages?
That is indeed an option. This card is only about gathering data.
Some customers are known to layer packages locally but it's worse if the issue is a third party integration. In such a case, if the add-on breaks, the customer will call the 3rd party first because that's what appears to be broken. It may be a long, undelightful trip to get to a satisfying resolution. If they are blocked on upgrade due to that 3rd party integration they may not be able to upgrade the OCP y-version. That could be a lengthy delay.
Improve the cluster expansion with the agent workflow added in OpenShift 4.16 (TP) and OpenShift 4.17 (GA) with:
Improve the user experience and functionality of the commands to add nodes to clusters using the image creation functionality.
In order to cover simpler scenarios (ie adding just one node without any static networking configuration), it could be useful for the user to provide the minimum requested input via option flags on the command line rather than providing the full-fledged nodes-config.yaml file.
Internally the oc command will take care to always generate the required nodes-config.yaml to be passed to the node-joiner tool
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
A set of capabilities need to be added to the Hypershift Operator that will enable AWS Shared-VPC deployment for ROSA w/ HCP.
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
Build capabilities into HyperShift Operator to enable AWS Shared-VPC deployment for ROSA w/ HCP.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Antoni Segura Puimedon Please help with providing what Hypershift will need on the OCPSTRAT side.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | (perhaps) both |
Classic (standalone cluster) | |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86_64 and Arm |
Operator compatibility | |
Backport needed (list applicable versions) | 4.14+ |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | no (this is an advanced feature not being exposed via web-UI elements) |
Other (please specify) | ROSA w/ HCP |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
"Shared VPCs" are a unique AWS infrastructure design: https://docs.aws.amazon.com/vpc/latest/userguide/vpc-sharing.html
See prior work/explanations/etc here: https://issues.redhat.com/browse/SDE-1239
Summary is that in a Shared VPC environment, a VPC is created in Account A and shared to Account B. The owner of Account B wants to create a ROSA cluster, however Account B does not have permissions to create a private hosted zone in the Shared VPC. So they have to ask Account A to create the private hosted zone and link it to the Shared VPC. OpenShift then needs to be able to accept the ID of that private hosted zone for usage instead of creating the private hosted zone itself.
QE should have some environments or testing scripts available to test the Shared VPC scenario
The AWS endpoint controller in the CPO currently uses the control plane operator role to create the private link endpoint for the hosted cluster as well as the corresponding dns records in the hypershift.local hosted zone. If a role is created to allow it to create that vpc endpoint in the vpc owner's account, the controller would have to explicitly assume the role so it can create the vpc endpoint, and potentially a separate role for populating dns records in the hypershift.local zone.
The users would need to create a custom policy to enable this
Add the necessary API fields to support a Shared VPC infrastructure, and enable development/testing of Shared VPC support by adding the Shared VPC capability to the hypershift CLI.
Currently the same SG is used for both workers and VPC endpoint. Create a separate SG for the VPC endpoint and only open the ports necessary on each.
This feature aims to comprehensively refactor and standardize various components across HCP, ensuring consistency, maintainability, and reliability. The overarching goal to increase customer satisfaction by increasing speed to market and saving engineering budget by reducing incidents/bugs. This will be achieved by reducing technical debt, improving code quality, and simplifying the developer experience across multiple areas, including CLI consistency, NodePool upgrade mechanisms, networking flows, and more. By addressing these areas holistically, the project aims to create a more sustainable and scalable codebase that is easier to maintain and extend.
Over time, the HyperShift project has grown organically, leading to areas of redundancy, inconsistency, and technical debt. This comprehensive refactor and standardization effort is a response to these challenges, aiming to improve the project's overall health and sustainability. By addressing multiple components in a coordinated way, the goal is to set a solid foundation for future growth and development.
Ensure all relevant project documentation is updated to reflect the refactored components, new abstractions, and standardized workflows.
This overarching feature is designed to unify and streamline the HCP project, delivering a more consistent, maintainable, and reliable platform for developers, operators, and users.
Improve the consistency and reliability of APIs by enforcing immutability and clarifying service publish strategy support.
As a HC consumer I expect the API UX to be meaningful and coherent.
i.e. immutable fields should fail at the API when possible
ServicePublishingStrategy Name and type should be made immutable via CEL.
This requires/does not require a design proposal.
This requires/does not require a feature gate.
placeholder for linking all stories aimed to improve hypershift CI:
Description of problem:
HyperShift e2e tests will start out with jUnit available but end up without it, making it hard to read the results.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
We can improve the robustness, maintainability, and developer experience of asynchronous assertions in end-to-end tests by adhering to a couple rules:
Description of problem:
HyperShift E2E tests have so many files in the artifacts buckets in GCS that the pages in Deck load super slowly. Using a .tar.gz for must-gather content like the OCP E2Es do will improve this significantly.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Our tests are pretty chatty today. This makes it exceedingly difficult to determine what's going on and, when something failed, what that was. We need to do a full pass through them to enact changes to the following ends:
As a dev I want the base code to be easier to read, maintain and test
If devs are don't have a healthy dev environment the project will go and the business won't make $$
A common concern with dealing with escalations/incidents in Managed OpenShift Hosted Control Planes is the resolution time incurred when the fix needs to be delivered in a component of the solution that ships within the OpenShift release payload. This is because OpenShift's release payloads:
This feature seeks to provide mechanisms that put the upper time boundary in delivering such fixes to match the current HyperShift Operator <24h expectation
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | managed (ROSA and ARO) |
Classic (standalone cluster) | No |
Hosted control planes | Yes |
Multi node, Compact (three node), or Single node (SNO), or all | All supported ROSA/HCP topologies |
Connected / Restricted Network | All supported ROSA/HCP topologies |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | All supported ROSA/HCP topologies |
Operator compatibility | CPO and Operators depending on it |
Backport needed (list applicable versions) | TBD |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | No |
Other (please specify) | No |
Discussed previously during incident calls. Design discussion document
SOP needs to be defined for:
Acceptance criteria:
The OpenShift IPsec implementation will be enhanced for a growing set of enterprise use cases, and for larger scale deployments.
The OpenShift IPsec implementation was originally built for purpose-driven use cases from telco NEPs, but was also be useful for a specific set of other customer use cases outside of that context. As customer adoption grew and it was adopted by some of the largest (by number of cluster nodes) deployments in the field, it became obvious that some redesign is necessary in order to continue to deliver enterprise-grade IPsec, for both East-West and North-South traffic, and for some of our most-demanding customer deployments.
Key enhancements include observability and blocked traffic across paths if IPsec encryption is not functioning properly.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
The OpenShift IPsec feature is fundamental to customer deployments for ensuring that all traffic between cluster nodes (East-West) and between cluster nodes and external-to-the-cluster entities that also are configured for IPsec (North-South) is encrypted by default. This encryption must scale to the largest of deployments.
Questions to be addressed:
Have Hosted Cluster's cloud resource cleanup be directly managed by HyperShift instead of delegating to the operators that run in the Hosted Control Plane so that we can achieve better SLO performance and more control over what fails to delete.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | no |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | All supported Hosted Control Planes topologies and configurations |
Connected / Restricted Network | All supported Hosted Control Planes topologies and configurations |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | All supported Hosted Control Planes topologies and configurations |
Operator compatibility | N/A |
Backport needed (list applicable versions) | No |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | Maybe the failure to delete resources could be shown in the console. |
Other (please specify) |
The OpenShift installer and Hive manage it this way
We need to come up with the right level of granularity for the emitted metrics and the right UX to show it
The metrics and the UX need to be documented. An SOP for tracking failures should be written.
ROSA/HCP and ARO/HCP
As a Hosted Cluster admin, I want to be able to:
so that I can achieve
Service provider achieves
Description of criteria:
Cloud resource deletion throttling detection
OpenShift relies on internal certificates for communication between components, with automatic rotations ensuring security. For critical components like the API server, rotations occur via a rollout process, replacing certificates one instance at a time.
In clusters with high transaction rates and SNO, this can lead to transient errors for in-flight transactions during the transition.
This feature ensures seamless TLS certificate rotations in OpenShift, eliminating downtime for the Kubernetes API server during certificate updates, even under heavy loads or in SNO deployments.
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Why is this important? (mandatory)
What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Provide a way to install NodePools with varied architectures on the Agent platform.
This feature is important to enable workloads with different architectures (IBM Power/Z in this case) within the same Hosted Cluster.
Development (DEV):
Continuous Integration (CI):
Quality Engineering (QE):
Provide a way to install NodePools with varied architectures on the Agent platform.
This feature is important to enable workloads with different architectures (IBM Power/Z in this case) within the same Hosted Cluster.
Development (DEV):
Continuous Integration (CI):
Quality Engineering (QE):
Add a doc to explain the steps on how can we have heterogenous node pools in agent platform.
Goal:
Provide a Technical Preview of Gateway API with Istio to unify the management of cluster ingress with a common, open, expressive, and extensible API.
Description:
Gateway API is the evolution of upstream Kubernetes Ingress APIs. The upstream project is part of Kubernetes, working under SIG-NETWORK. OpenShift is contributing to the development, building a leadership position, and preparing OpenShift to support Gateway API, with Istio as our supported implementation.
The plug-able nature of the implementation of Gateway API enables support for additional and optional 3rd-party Ingress technologies.
At its core, OpenShift's implementation of Gateway API will be based on the existing Cluster Ingress Operator and OpenShift Service Mesh (OSSM). The Ingress Operator will manage the Gateway API CRDs (gatewayclasses, gateways, httproutes), install and configure OSSM, and configure DNS records for gateways. OSSM will manage the Istio and Envoy deployments for gateways and configure them based on the associated httproutes. Although OSSM in its normal configuration does support service mesh, the Ingress Operator will configure OSSM without service mesh features enabled; for example, using Gateway API will not require the use of sidecar proxies. Istio will be configured specifically to support Gateway API for cluster ingress. See the gateway-api-with-cluster-ingress-operator enhancement proposal for more details.
Problem: ** As an administrator, I would like to securely expose cluster resources to remote clients and services while providing a self-service experience to application developers.
Tech Preview: A feature is implemented as Tech Preview so that developers can issue an update to the Dev Preview MVP and:
Dependencies (internal and external)
As a user I want to be able to automatically recover from a GWAPI CRD being deleted,instead of manually re-adding the CRD.
Currently, if you delete one or more of the gwapi crds it does not get recreated until you restart the ingress operator.
Additional information on each of the above items can be found here: Networking Definition of Planned
This involves grabbing the clusters global tlssecurity profile from the operator pod and then storing it locally in renderconfig so it can be applied to MCO's templates and the MCS.
Note: The kubelet config controller already fetches this object and updates the kubeletconfig with it. So this can probably refactored into a common function.
Done when:
This epic will encompass work that wasn't required for the MVP of tlsSecurityProfile for the MCO.
Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission.
With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.
With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn".
Epic Goal
Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.
Add an annotation that shows what the label syncer would set.
If a customer takes ownership of the audit and warn labels it is unclear what the label syncer would enforce, without evaluating all the SCCs of all users in the namespace.
This:
When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.
To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).
Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).
The following tables track progress.
|
4.19 | 4.18 | 4.17 | 4.16 | 4.15 | 4.14 |
---|---|---|---|---|---|---|
monitored | 82 | 82 | 82 | 82 | 82 | 82 |
fix needed | 68 | 68 | 68 | 68 | 68 | 68 |
fixed | 39 | 39 | 35 | 32 | 39 | 1 |
remaining | 29 | 29 | 33 | 36 | 29 | 67 |
~ remaining non-runlevel | 8 | 8 | 12 | 15 | 8 | 46 |
~ remaining runlevel (low-prio) | 21 | 21 | 21 | 21 | 21 | 21 |
~ untested | 2 | 2 | 2 | 2 | 82 | 82 |
# | namespace | 4.19 | 4.18 | 4.17 | 4.16 | 4.15 | 4.14 |
---|---|---|---|---|---|---|---|
1 | oc debug node pods | #1763 | #1816 | #1818 | |||
2 | openshift-apiserver-operator | #573 | #581 | ||||
3 | openshift-authentication | #656 | #675 | ||||
4 | openshift-authentication-operator | #656 | #675 | ||||
5 | openshift-catalogd | #50 | #58 | ||||
6 | openshift-cloud-credential-operator | #681 | #736 | ||||
7 | openshift-cloud-network-config-controller | #2282 | #2490 | #2496 | |||
8 | openshift-cluster-csi-drivers | #6 #118 | #524 #131 #306 #265 #75 | #170 #459 | #484 | ||
9 | openshift-cluster-node-tuning-operator | #968 | #1117 | ||||
10 | openshift-cluster-olm-operator | #54 | n/a | n/a | |||
11 | openshift-cluster-samples-operator | #535 | #548 | ||||
12 | openshift-cluster-storage-operator | #516 | #459 #196 | #484 #211 | |||
13 | openshift-cluster-version | #1038 | #1068 | ||||
14 | openshift-config-operator | #410 | #420 | ||||
15 | openshift-console | #871 | #908 | #924 | |||
16 | openshift-console-operator | #871 | #908 | #924 | |||
17 | openshift-controller-manager | #336 | #361 | ||||
18 | openshift-controller-manager-operator | #336 | #361 | ||||
19 | openshift-e2e-loki | #56579 | #56579 | #56579 | #56579 | ||
20 | openshift-image-registry | #1008 | #1067 | ||||
21 | openshift-ingress | #1032 | |||||
22 | openshift-ingress-canary | #1031 | |||||
23 | openshift-ingress-operator | #1031 | |||||
24 | openshift-insights | #1033 | #1041 | #1049 | #915 | #967 | |
25 | openshift-kni-infra | #4504 | #4542 | #4539 | #4540 | ||
26 | openshift-kube-storage-version-migrator | #107 | #112 | ||||
27 | openshift-kube-storage-version-migrator-operator | #107 | #112 | ||||
28 | openshift-machine-api | #1308 #1317 |
#1311 | #407 | #315 #282 #1220 #73 #50 #433 | #332 #326 #1288 #81 #57 #443 | |
29 | openshift-machine-config-operator | #4636 | #4219 | #4384 | #4393 | ||
30 | openshift-manila-csi-driver | #234 | #235 | #236 | |||
31 | openshift-marketplace | #578 | #561 | #570 | |||
32 | openshift-metallb-system | #238 | #240 | #241 | |||
33 | openshift-monitoring | #2298 #366 | #2498 | #2335 | #2420 | ||
34 | openshift-network-console | #2545 | |||||
35 | openshift-network-diagnostics | #2282 | #2490 | #2496 | |||
36 | openshift-network-node-identity | #2282 | #2490 | #2496 | |||
37 | openshift-nutanix-infra | #4504 | #4539 | #4540 | |||
38 | openshift-oauth-apiserver | #656 | #675 | ||||
39 | openshift-openstack-infra | #4504 | #4539 | #4540 | |||
40 | openshift-operator-controller | #100 | #120 | ||||
41 | openshift-operator-lifecycle-manager | #703 | #828 | ||||
42 | openshift-route-controller-manager | #336 | #361 | ||||
43 | openshift-service-ca | #235 | #243 | ||||
44 | openshift-service-ca-operator | #235 | #243 | ||||
45 | openshift-sriov-network-operator | #995 | #999 | #1003 | |||
46 | openshift-user-workload-monitoring | #2335 | #2420 | ||||
47 | openshift-vsphere-infra | #4504 | #4542 | #4539 | #4540 | ||
48 | (runlevel) kube-system | ||||||
49 | (runlevel) openshift-cloud-controller-manager | ||||||
50 | (runlevel) openshift-cloud-controller-manager-operator | ||||||
51 | (runlevel) openshift-cluster-api | ||||||
52 | (runlevel) openshift-cluster-machine-approver | ||||||
53 | (runlevel) openshift-dns | ||||||
54 | (runlevel) openshift-dns-operator | ||||||
55 | (runlevel) openshift-etcd | ||||||
56 | (runlevel) openshift-etcd-operator | ||||||
57 | (runlevel) openshift-kube-apiserver | ||||||
58 | (runlevel) openshift-kube-apiserver-operator | ||||||
59 | (runlevel) openshift-kube-controller-manager | ||||||
60 | (runlevel) openshift-kube-controller-manager-operator | ||||||
61 | (runlevel) openshift-kube-proxy | ||||||
62 | (runlevel) openshift-kube-scheduler | ||||||
63 | (runlevel) openshift-kube-scheduler-operator | ||||||
64 | (runlevel) openshift-multus | ||||||
65 | (runlevel) openshift-network-operator | ||||||
66 | (runlevel) openshift-ovn-kubernetes | ||||||
67 | (runlevel) openshift-sdn | ||||||
68 | (runlevel) openshift-storage |
We should be able to correlate flows with network policies:
PoC doc: https://docs.google.com/document/d/14Y3YYFxuOs3o-Lkipf-d7ZZp5gpbk6-01ZT_fTraCu8/edit
There are two possible approaches in terms of implementation:
The PoC describes the former, however it is probably most interesting to aim the latter. (95% of the PoC is valid in both cases, ie. all the "low level" parts: OvS, OVN). The latter involves more work in FLP.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
Users/customers of OpenShift on AWS (ROSA) want to use static IPs (and therefore AWS Elastic IPs) so that they can configure appropriate firewall rules. They want the default AWS Load Balancer that they use (NLB) for their router to use these EIPs.
Kubernetes does define a service annotation for configuring EIP
allocations, which should work in OCP:
// ServiceAnnotationLoadBalancerEIPAllocations is the annotation used on the
// service to specify a comma separated list of EIP allocations to use as
// static IP addresses for the NLB. Only supported on elbv2 (NLB)
const ServiceAnnotationLoadBalancerEIPAllocations = "service.beta.kubernetes.io/aws-load-balancer-eip-allocations"
We do not provide an API field on the IngressController API to configure
this annotation.
This is a feature request to enhance the IngressController API to be able to support static IPs from install time and upon reconfiguration of the router (may require destroy/recreate LB)
The observable functionality that the user now has as a result of receiving this feature. Complete during New status.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Allowing those EIPs to be provisioned outside and survive the cluster reconfiguration or even creation/deletion, it helps support our "don't treat clusters as pets" philosophy. It also removes additional burden for them to wrap the cluster or our managed service with yet another global IP service that should be unnecessary and bring more complexity. That aligns precisely with their interest in the functionality and we should pursue making this seamless.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
See Parent Feature for details.
This enhancement allow user to set AWS EIP for the NLB default or custom ingress controller.
This is a feature request to enhance the IngressController API to be able to support static IPs during
R&D spike for 4.14 and implementation in 4.15
The end-to-end test will cover the scenario where the user sets an eipAllocations field in the IngressController CR and verify that the LoadBalancer-type service has the service.beta.kubernetes.io/aws-load-balancer-eip-allocations annotation set with the value of the eipAllocations field from the IngressController CR that was created.
The end to end test will also the cover the scenario where user updates the eipAllocations field in the IngressController CR and verify that the LoadBalancer-type service's
service.beta.kubernetes.io/aws-load-balancer-eip-allocations annotation is updated with the new value of eipAllocations from the IngressController CR.
Add the logic to the ingress operator to create a load balancer service with a specific subnet as provided by the API. The Ingress Operator will need to set a platform-specific annotation on the load balancer type service to specify a subnet.
Develop the API updates and create a PR against the openshift/api repo for allowing users to select a subnet for Ingress Controllers.
Get review from API team.
After the implementation https://github.com/openshift/cluster-ingress-operator/pull/1046 is merged, and there are is a 95+% pass rates over at least 14 CI runs (slack ref), we will need to open a PR to promote from the TechPreview to the Default feature set.
Phase 2 Goal:
for Phase-1, incorporating the assets from different repositories to simplify asset management.
Overarching Goal
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift.
Phase 1 & 2 covers implementing base functionality for CAPI.
Phase 2 also covers migrating MAPI resources to CAPI.
There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To enable a quick start in CAPI, we want to allow the users to provide just the Machines/MachineSets and the relevant configuration for the Machines. The cluster infrastructure is either not required to be populated, or something they should not care about.
To enable the quick start, we should create and where applicable, populate required fields for the infrastructure cluster.
This will go alongside a generated Cluster object and should mean that the `openshift-cluster-api` Cluster is now infrastructure ready.
To enable a quick start in CAPI, we want to allow the users to provide just the Machines/MachineSets and the relevant configuration for the Machines. The cluster infrastructure is either not required to be populated, or something they should not care about.
To enable the quick start, we should create and where applicable, populate required fields for the infrastructure cluster.
This will go alongside a generated Cluster object and should mean that the `openshift-cluster-api` Cluster is now infrastructure ready.
To enable a quick start in CAPI, we want to allow the users to provide just the Machines/MachineSets and the relevant configuration for the Machines. The cluster infrastructure is either not required to be populated, or something they should not care about.
To enable the quick start, we should create and where applicable, populate required fields for the infrastructure cluster.
This will go alongside a generated Cluster object and should mean that the `openshift-cluster-api` Cluster is now infrastructure ready.
To enable a quick start in CAPI, we want to allow the users to provide just the Machines/MachineSets and the relevant configuration for the Machines. The cluster infrastructure is either not required to be populated, or something they should not care about.
To enable the quick start, we should create and where applicable, populate required fields for the infrastructure cluster.
This will go alongside a generated Cluster object and should mean that the `openshift-cluster-api` Cluster is now infrastructure ready.
Implement Migration core for MAPI to CAPI for AWS
When customers use CAPI, There must be no negative effect to switching over to using CAPI . Seamless migration of Machine resources. the fields in MAPI/CAPI should reconcile from both CRDs.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
We would like new code going into the CAPI operator repository to be linted based on the teams linting standards.
The CPMS is a good source for an initial configuration, though may be a touch strict, we can make a judgement if we want to disable some of the linters it uses for this project.
Some linters enabled may be easy to bulk fix in the existing code, others may be harder.
We can make a judgement as to whether to fix the issues or set the CI configuration to only check new code.
To implement https://github.com/openshift/enhancements/pull/1465, we need to include an `authoritativeAPI` field on Machines, MachineSets, ControlPlaneMachineSets and MachineHealthChecks.
This field will control which version of a resource is considered authoritative, and therefore, which controllers should implement the functionality.
The field details are outlined in the enhancement.
The status of each resource should also be updated to indicate the authority of the API.
APIs should be added behind a ClusterAPIMigration feature gate.
When the Machine and MachineSet MAPI resource are non-authoritative, the Machine and MachineSet controllers should observe this condition and should exit, pausing the reconciliation.
When they pause, they should acknowledge this pause by adding a paused condition to the status and ensuring it is set to true.
Currently, the existing procedure for full rotation of all cluster CAs/certs/keys is not suitable for Hypershift. Several oc helper commands added for this flow are not functional in Hypershift. Therefore, a separate and tailored procedure is required specifically for Hypershift post its General Availability (GA) stage.
Most of the rotation procedure can be performed on the management side, given the decoupling between the control-plane and workers in the HyperShift architecture.
That said, it is important to ensure and assess the potential impacts on customers and guests during the rotation process, especially on how they affect SLOs and disruption budgets.
As a hypershift QE, I want to be able to:
so that I can achieve
This does not require a design proposal.
This does not require a feature gate.
As an engineer I would like to customize the self-signed certificates rotation used in the HCP components using an annotation over the HostedCluster object.
As an engineer I would like to customize the self-signed certificates expiration used in the HCP components using an annotation over the HostedCluster object.
Note: This feature will be a TechPreview in 4.16 since the newly introduced API must graduate to v1.
Overarching Goal
Customers should be able to update and boot a cluster without a container registry in disconnected environments. This feature is for Baremetal disconnected cluster.
Background
Describes the work needed from the MCO team to take Pinned Image Sets to GA.
As described in https://github.com/openshift/enhancements/pull/1483, we would like the cluster to be able to upgrade properly without accessing an external registry in case in case all the images already exist and pinned in the relevant nodes. Same goes for boot.
The required functionality is mostly to add blocking tests that ensure that's the case, and address any issues that these test might reveal in the OCP behavior.
Details:
These are the tests that will need to be added:
All these tests will have a preparation step that will use the PinnedImageSet (see MCO-838 for more details) support to ensure that all the images required are present in all the nodes.
Our proposal for these set of tests is to create a set of machines inside a virtual network. This network will initially have access to the registry server. That registry will be used to install the cluster and to pull the pinned images. After that access to the registry server will be blocked using a firewall rule in the virtual network. Then the reboots or upgrades will be performed and verified.
So these are the things that need to be done:
Goal Summary
This feature aims to make sure that the HyperShift operator and the control-plane it deploys uses Managed Service Identities (MSI) and have access to scoped credentials (also via access to AKS's image gallery potentially). Additionally, for operators deployed in customers account (system components), they would be scoped with Azure workload identities.
Today Azure installation requires manually created service principal which involves relations, permission granting, credential setting, credential storage, credentials rotation, credentials clean up, and service principal deletion. This is not only mundane and time-consuming but also less secure and risks access to resources by adversaries due to lack of credential rotation.
Employ Azure managed credentials which drastically reduce the steps required to just managed identity creation, permission granting, and resource deletion.
Ideally, this should be a HyperShift-native functionality. I.e., HyperShift should use managed identities for the control plane, the kubelet, and any add-on that needs access to Azure resources.
Operators running management side needing to access azure customer account will use MSI.
Operands running in the guest cluster should rely on workload identity.
This ticket is to solve the latter.
We need to implement workload identity support in our components that run on the spoke cluster.
Address any TODOs in the code related to this ticket.
Networking Definition of Planned
Epic Template descriptions and documentation
Support Managed Service Identity (MSI) authentication in Azure.
Controllers that require cloud access and run on the control plane side in ARO hosted clusters will need to use MSI to acquire tokens to interact with the hosted cluster's cloud resources.
The cluster network operator runs the following pods that require cloud credentials:
The following components use the token-minter but do not require cloud access:
These pods will need to use MSI when running in hosted control plane mode.
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
1. …
1. …
Support Managed Service Identity (MSI) authentication in Azure.
Controllers that require cloud access and run on the control plane side in ARO hosted clusters will need to use MSI to acquire tokens to interact with the hosted cluster's cloud resources.
The cluster ingress controller will need to support MSI.
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
1. …
1. …
This feature focuses on the optimization of resource allocation and image management within NodePools. This will include enabling users to specify resource groups at NodePool creation, integrating external DNS support, ensuring Cluster API (CAPI) and other images are sourced from the payload, and utilizing Image Galleries for Azure VM creation.
As ARO/HCP provider, I want to be able to:
so that I can achieve
Description of criteria:
This requires a design proposal so OCM knows where to specify the resource group
This requires might require a feature gate in case we don't want it for self-managed.
Azure stipulates that any VM that's created must have a NIC in the same location as the virtual machine. The NIC must also belong to a subnet in the same location as well. Furthermore, the network security group attached to a subnet must also be in the same location as the subnet and therefore vnet.
So all these resources, virtual network (including subnets), network security group, and resource group (and all the associated resources in there including VMs and NICs), all have to be in the same location.
Description of criteria:
Detail about what is specifically not being delivered in the story
This does not require a design proposal.
This does not require a feature gate.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Before GAing Azure let's make sure we do a final API review
We should move the MachineIdentityID to the NodePool API rather than the HostedCluster API. This field is specifically related to a NodePool and shouldn't be a HC wide field.
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a user of Azure HostedClusters on AKS, I want to be able to:
so that I can
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
As a developer, I want to be able to generate YAMLs using generative AI. For that I need to open the LightSpeed chat interface easily to start a conversation.
Add support to open the LightSpeed chat interface directly from the YAML Editor Code Toolbar.
OpenShift Lightspeed.
As a user, I want to quickly open the Lightspeed Chat interface directly from the YAML Editor UI so that I can quickly get some help in getting some sample YAMLs.
Improve onboarding experience for using Shipwright Builds in OpenShift Console
Enable users to create and use Shipwright Builds in OpenShift Console while requiring minimal expertise about Shipwright
Requirements | Notes | IS MVP |
Enable creating Shipwright Builds using a form | Yes | |
Allow use of Shipwright Builds for image builds during import flows | Yes | |
Enable access to build strategies through navigation | Yes |
TBD
TBD
Shipwright Builds UX in Console should provide a simple onboarding path for users in order to transition them from BuildConfigs to Shipwright Builds.
TBD
TBD
TBD
TBD
TBD
Builds for OpenShift (Shipwright) offers Shipwright builds for building images on OpenShift but it's not available as a choice in the import flows. Furthermore, BuildConfigs could be turned off through the Capabilities API in OpenShift which would prevent users from importing their applications into the cluster through the import flows, even when they have Shipwright installed on their clusters.
As a developer, I want to use Shipwright Builds when importing applications through the import flows so that I can take advantage of Shipwright Build strategies such as buildpacks for building my application.
User should be provided the option to use Shipwright Builds when the OpenShift Builds operator is installed on the cluster.
To enable developers to take advantage of new build capabilities on OpenShift.
As a user, I need to be able to use the Shipwright Builds to build my application in the Git Import Flow.
As a developer, I need to ensure that all the current import workflows are working properly after this update and also fix and add new unit and e2e tests to ensure this feature does not get broken in the future.
As a user, I need to be able to use the Edit Application form to edit the resource I have created using the Git Import Form.
The build strategy list page is missing from admin console
Add a page for Shipwright BuildStrategy in admin console with 2 tabs in the page
As a user, I want to see the Shipwright lists pages under one navigation option in the administrative perspective
Telecommunications providers continue to deploy OpenShift at the Far Edge. The acceleration of this adoption and the nature of existing Telecommunication infrastructure and processes drive the need to improve OpenShift provisioning speed at the Far Edge site and the simplicity of preparation and deployment of Far Edge clusters, at scale.
A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
requirement | Notes | isMvp? |
Telecommunications Service Provider Technicians will be rolling out OCP w/ a vDU configuration to new Far Edge sites, at scale. They will be working from a service depot where they will pre-install/pre-image a set of Far Edge servers to be deployed at a later date. When ready for deployment, a technician will take one of these generic-OCP servers to a Far Edge site, enter the site specific information, wait for confirmation that the vDU is in-service/online, and then move on to deploy another server to a different Far Edge site.
Retail employees in brick-and-mortar stores will install SNO servers and it needs to be as simple as possible. The servers will likely be shipped to the retail store, cabled and powered by a retail employee and the site-specific information needs to be provided to the system in the simplest way possible, ideally without any action from the retail employee.
Q: how challenging will it be to support multi-node clusters with this feature?
< What does the person writing code, testing, documenting need to know? >
< Are there assumptions being made regarding prerequisites and dependencies?>
< Are there assumptions about hardware, software or people resources?>
< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>
< Are there Upgrade considerations that customers need to account for or that the feature should address on behalf of the customer?>
<Does the Feature introduce data that could be gathered and used for Insights purposes?>
< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >
< What does success look like?>
< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact>
< If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>
< Which other products and versions in our portfolio does this feature impact?>
< What interoperability test scenarios should be factored by the layered product(s)?>
Question | Outcome |
Allow users to use the openshift installer to generate the IBI artifacts.
Please describe what conditions must be met in order to mark this feature as "done".
Yes
We should add the Image-based Installer imagebased create image and imagebased create installation-config-template subcommands to the OpenShift Installer, conforming to the respective enhancement, for the generation of the IBI installation ISO image.
We should add the Image-based Installer imagebased create config and imagebased create config-template subcommands to the OpenShift Installer, conforming to the respective enhancement, for the generation of the IBI config ISO image.
We should add integration tests for all the image-based installer commands in the same way as the agent installer ones (i.e. using testcript).
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
When a `PerformanceProfile` should be mirrored into the HCP namespace, its given name which compose from a constant (perfprof) + the node pool name.
In case there are multiple `PerformanceProfiles`, one per hosted cluster, the approach mentioned above makes it hard to understand which `PreformanceProfile` in the clusters namespace is associate with each hostedcluster.
It requires to examine the node pool \`spec.tuningConfig\` to figure out the relation between user profile and the mirrored one.
we should embed the configMap (that encapsulates the performanceProfile) name in the mirrored profile, to make the relation clear and observable on first sight.
We already have a lot of test cases to check PAO behaviour on OCP.
As we want to cover the same basic behaviour on Hypershift deployments as in Stand Alone ones, it seems a good way to go to adapt the test cases to run on both types of deployments.
Target: To have the same test coverage in Hypershift then we already have on Stand Alone.
PAO controller takes certain tuning decision based on the container-runtime that being used under the cluster.
As of now hypershift does not support ContainerRuntimeConfiguration at all.
We should address this gap when ContainerRuntimeConfiguration will get supported on hypershift
UPDATE [15.05.2024]
Hypershift does support ContainerRuntimeConfiguration but do not populate the object into the HCP namespace, so the PAO still cannot read it.
Implement in the hypershift side the handing of the performance profile status as described in the enhancement proposal.
Currently, running a specific suite or a spec for example:
ginkgo run --focus="Verify rcu_nocbs kernel argument on the node"
that is executing commands on the node (using the node inspector) will fail, as the user must run the 0_config suite beforehand (we are creating the node inspector in the 0_config suite).
This places a burden on the developer, so we should consider how to ensure the node inspector will be available for these scenarios.
A potential fix that can be adding lazy initialization for the node inspector, once we are executing commands on a node.
We need to update NTO roles with configmap/finalizers permissions so it would be able to set the controller reference properly for the dependent objects.
PerformanceProfile objects are handled in a different way on Hypershift so modifications in the Performance Profile controller are needed to handle this.
Basically Performance Profile controller have to reconcile ConfigMaps which will have PerformanceProfile objects embedded into them, create the different manifests as usual and then handle them to the hosted cluster using different methods.
More info in the enhancement proposal
Target: To have a feature equivalence in Hypershift and Standalone deployments
PAO controller takes certain tuning decision based on the container-runtime that being used under the cluster.
As of now hypershift does not support ContainerRuntimeConfiguration at all.
We should address this gap when ContainerRuntimeConfiguration will get supported on hypershift
UPDATE [15.05.2024]
Hypershift does support ContainerRuntimeConfiguration but do not populate the object into the HCP namespace, so the PAO still cannot read it.
We need to make sure that the controller evaluates feature gate on Hypershift.
This is needed for features which are behind feature gate such as MixedCPUs
Some insightful conversation regarding the subject: https://redhat-internal.slack.com/archives/C01C8502FMM/p1712846255671449?thread_ts=1712753642.635209&cid=C01C8502FMM
Implement in the NTO side the handing of the performance profile status as described in the enhancement proposal.
On hypershift we don't have the machine-config-daemon pod, so we cannot execute commands directly on the nodes during the e2e test run.
In order to preserve the ability of executing commands on hypershift nodes , we should create a daemonset which creates a high-privileged pods that mounts the host-fs.
The daemons set should be spinned up at the beginning of the tests and deleted at the end of them.
In addition the API should remain the as similar as possible.
Relevant section in the design doc:
https://docs.google.com/document/d/1_NFonPShbi1kcybaH1NXJO4ZojC6q7ChklCxKKz6PIs/edit#heading=h.3zhagw19tayv
we should add `--hypershift` flag for ppc to generate a performance profile that is adjusted for hypershift platform.
This epic is to track any stories for hypershift kubevirt development that do not fit cleanly within a larger effort.
Here are some examples of tasks that this "catch all" epic can capture
The hypershift/kubevirt platform requires usage of RWX volumes for both the VM root volumes and any kubevirt-csi provisioned volumes in order to be live migratable. There are other situations where a VMI might not be live migratable as well.
We should report via a condition when a VMI is not live migratable (there's a condition on the VMI that indicates this) and warn that live migration is not possible on the NodePool (and likely HostedCluster as well). This warning shouldn't block the cluster creation/update
Issue CNV-30407 gives the user the ability to choose whether or not to utilize multiqueue for the VMs in a nodepool. This setting is known to improve performance when jumbo frames (mtu 9000 or larger) are in use, but has the issue that performance was degraded with smaller MTUs
The issue that impacted performance on smaller MTUs is being resolved here, KMAINT-145. Once KMAINT-145 lands, the default in both the CLI and CEL defaulting should be to enable multiqueue
In 4.18 we plan to remove react-router v5 shared modules. For that reason we need to deprecated these shared modules now, in 4.16.
We should announce to all the Dynamic Plugins owners that we are planning to update the react-route from v5 to v6 and that we already started with the process. Also we need to explain to them whats our migration plan and timeline:
We should give them some grace period (2 releases - end of 4.18). The change on the plugin's side needs to happen as one change.
AC:
Since console is aiming for adopting PF6 in 4.18, we need to start the deprecation process in 4.16, due to the N+1 deprecation policy.
This will give us time in 4.17 to prepare console for adopting the PF6.
This will give plugins time to move to PF5.
AC:
This epic tracks the rebase of openshift/etcd to 3.5.14
This update includes the following changes:
https://github.com/etcd-io/etcd/blob/main/CHANGELOG/CHANGELOG-3.5.md#v3514-2024-05-29
Most notably this includes the experimental flag to stop serving requests on an etcd member that is undergoing defragmentation which would help address https://issues.redhat.com/browse/OCPSTRAT-319
Rebase openshift/etcd to latest 3.5.14 upstream release.
tracking here all the work that needs to be done to configure the ironic containers (ironic-image and ironic-agent-image) to be ready for OCP 4.18
this includes also CI configuration, tools and documentation updates
all the configuration bits need to happen at least one sprint BEFORE 4.18 branching (current target August 9)
docs tasks can be completed after the configuration tasks
the CI tasks need to be completed RIGHT AFTER 4.18 branching happens
tag creation is now automated during OCP tags creation
builder creation still needs to be coordinated with RHOS delivery
before configuring the sources for 4.18 we need to make sure that the latest sources for 4.17 are up-to-date
This is not a user-visible change.
METAL-119 provides the upstream ironic functionality
tracking here all the work that needs to be done to configure the ironic container images for OCP 4.17
this includes also CI configuration, tools and documentation updates
This document will include two approaches to configure an environment with multipath:
The doc will be under assisted-service/docs/dev
Description of the problem:
Assisted Service currently skips formatting of FC and iSCSI drives, but doesn't skip multipath drives over FC/iSCSI. These drives aren't really part of the server and shouldn't be formatted like direct-attached storage.
Description of the problem:
Multipath object in the inventory does not contain "wwn" value, thus a user cannot use "wwn" rootDiskHint in the config.
How reproducible:
Always
Steps to reproduce:
1.
2.
3.
Actual results:
Example:
{ "bootable": true, "by_id": "/dev/disk/by-id/wwn-0x60002ac00000000000000ef700026667", "drive_type": "Multipath", "has_uuid": true, "id": "/dev/disk/by-id/wwn-0x60002ac00000000000000ef700026667", "installation_eligibility": \{ "eligible": true, "not_eligible_reasons": null }, "name": "dm-0", "path": "/dev/dm-0", "size_bytes": 214748364800 },
Expected results:
*
{ "bootable": true, "by_id": "/dev/disk/by-id/wwn-0x60002ac00000000000000ef700026667", "drive_type": "Multipath", "has_uuid": true, "id": "/dev/disk/by-id/wwn-0x60002ac00000000000000ef700026667", "installation_eligibility": \{ "eligible": true, "not_eligible_reasons": null }, "name": "dm-0", "path": "/dev/dm-0", "size_bytes": 214748364800, "wwn": "0x60002ac00000000000000ef700026667" },*
The F5 Global Server Load Balancing offers a DNS-based load balancer (called a Wide IP) that can be used alone or in front of local (proxying) load balancers. Essentially it does round-robin DNS. However, crucially, if there are no healthy listeners available then the DNS name will not resolve to an IP. Even if there is a second layer of local load balancer, if these are part of the F5 then the listener status will propagate through to the DNS load balancer (unless the user commits the error of creating a TCP monitor for the Wide IP).
This means that in assisted installations with a Wide IP as the first layer of load balancing, the DNS validations will fail even when everything is configured correctly because there are not yet any listeners available (the API and Ingress are not yet up).
At least in the case of the customer who brought this to light, there was a CNAME record created that pointed to the global load balancer, it just didn't resolve further to an IP. This presents an opportunity to stop disallowing this configuration in assisted installations.
Since golang 1.20 (which is used in the agent since 4.15), the net.LookupCNAME() function is able to look up the CNAME record for a domain name. If we were to fall back to doing this lookup when no IP addresses are found for the host, then we could treat evidence of having set up the CNAME record as success. (We will have to take care to make this robust against likely future changes to LookupCNAME(), since it remains broken in a number of ways that we don't care about here.) In theory this is relevant only when UserManagedNetworking is enabled.
A small downside is that we would no longer catch configuration problems caused by a mis-spelled CNAME record. However, since we don't validate the load balancer configuration from end-to-end anyway, I think we can discount the importance of this. The most important thing is to catch the case where users have neglected to take any action at all to set up DNS records, which this would continue to do.
Fill cnames in the domain name resolution result if available
There were several issues found in customer sites concerning connectivity checks:
When none platform is in use, if there is ambiguity in node-ip assignment, then incorrect assignment might lead to installation failure. This happens when etcd detects that the socket address from an etcd node does not match the expected address in the peer certificate. In this case etcd rejects such connection.
Example: assuming to networks - net1 and net2.
master node 1 has 1 address that belongs to net1.
master node 2 has 2 addresses. one that belongs to net 1, and another that belongs to net 2
master node 3 has 1 address that belongs to net 1.
If the selected node-ip of master node 2 belongs to net 2, then when it will create a connection with any other master node, the socket address will be the address that belongs to net 1. Since etcd expects it to be the same as the node-ip, it will
reject the connection.
This can be solved by node-ips selection that will not cause such a conflict.
Node ips assignment should be done through ignition.
To correctly set bootstrap ip, the machine-network for the cluster must be set to match the selected node-ip for that host.
A user that wants to install clusters in FIPS-mode must run the corresponding installer in a matching runtime environment to the one the installer was built against.
In 4.16 the installer started linking against the RHEL9 crypto library versions which means that in a given container in a FIPS-enabled environment only either >=4.16 or <4.16 version installers may be run.
To solve this we will limit users to only one of these sets of versions for a given assisted service deployment.
This epic should implement the "Publish multiple assisted-service images" alternative in https://github.com/openshift/assisted-service/pull/6290 which has much more detail about the justification for this approach.
The main enhancement will be implemented as a followup in https://issues.redhat.com/browse/MGMT-17314
Create a dockerfile in the assisted-service repo that will build the current service with an el8 (centos8 stream in this case) base.
Manage the effort for adding jobs for release-ocm-2.11 on assisted installer
https://docs.google.com/document/d/1WXRr_-HZkVrwbXBFo4gGhHUDhSO4-VgOPHKod1RMKng
Merge order:
Update the BUNDLE_CHANNELS in the Makefile in assisted-service and run bundle generation.
Please describe what this feature is going to do.
Please describe what conditions must be met in order to mark this feature as "done".
If the answer is "yes", please make sure to check the corresponding option.
Please describe what this feature is going to do.
Currently the Infrastructure operator (AgentServiceConfig controller) will only install assissted service properly on an OpenShift cluster.
There is a requirement from the Sylva project to also be able to install on other Kubernetes distributions so the same operator should support this.
Please describe what conditions must be met in order to mark this feature as "done".
Infrastructure operator correctly installs assisted service and related components in a kubernetes cluster.
If the answer is "yes", please make sure to check the corresponding option.
No
This allows assisted installer to integrate properly with the Sylva upstream project which will be used by Orange.
While deployed in the Sylva upstream assisted installer should still deploy supported OpenShift clusters which will expand our install base to users who wouldn't previously have easy access to OpenShift.
Yes, our competitors are also contributors to the Sylva project and can install their container orchestration platforms through it.
N/A
Route is an openshift-specific type which is not available in non-OCP kubernetes clusters.
Ensure we create an Ingress instead of a route when we're running in non-OCP kubernetes clusters.
The webhook server used by assisted service will only run if tls certs and keys are provided.
This means that we first need to find a maintainable way to create and rotate certificates then use that solution to create a cert/key that can be used by the webhook server.
Currently the operator requires certain OCP-specific kinds to be present on startup.
It will fail without these.
Find a way to detect the type of cluster in which the operator is running and only watch the applicable types.
registry.ci.openshift.org/openshift/release:golang-* images are based on centos Linux 7, which is EOL. Therefore, dnf is not capable of installing packages anymore as its repositories in Centos Linux 7 are not supported. We want to replace these images.
All of our builds are based on Centos Stream 9, with the exception of builds for FIPS.
No
golang 1.20 image is based on centos 7.
Since centos 7 is EOL, installing rpms on it is impossible (there are no suitable registries).
This task is about fixing all the broken jobs (jobs that are based on golang 1.20 image and try to install rpms on it) cross all the AI relevant repos.
More info can be found in [this|https://redhat-internal.slack.com/archives/C035X734RQB/p1719907286531789] discussion.
Currently, we use CI Golang base image (registry.ci.openshift.org/openshift/release:golang-<version>) in a lot of assisted installer components builds. These images are based on Centos Linux 7, which is EOL. We want to replace these images with maintained images e.g. `registry.access.redhat.com/ubi9/go-toolset:<version>`
Description of the problem:
When Iscsi multipath enabled , the disks detected as expected + mpath device.
In the UI we see the mpatha device and under it two disks,
The issue here is that we allow to user set members disk as installation disk which trigger format and break the mpath .
sda 8:0 0 120G 0 disk
sdb 8:16 0 30G 0 disk
└─mpatha 253:0 0 30G 0 mpath
sdc 8:32 0 30G 0 disk
└─mpatha 253:0 0 30G 0 mpath
In case user pick sdb as installation disk (which is wrong) , cluster installation will fail
This host failed its installation.
Failed - failed after 3 attempts, last error: failed executing /usr/bin/nsenter [--target 1 --cgroup --mount --ipc --pid -- coreos-installer install --insecure -i /opt/install-dir/worker-3bd2cf09-01c6-48ed-9a44-25b4dbb3633d.ign --append-karg ip=ens3:dhcp --append-karg rd.iscsi.firmware=1 /dev/sdf], Error exit status 1, LastOutput "Error: getting sector size of /dev/sdf Caused by: 0: opening "/dev/sdf" 1: No such device or address (os error 6)".
How to address this issue :
From what i saw i would expect :
- Installation disk should be handled from parent , means we should enable installation disk on the mpath device .
(I tried handle it from fdisk and looks like any changed on mpath device reflected to both member disks that binded to mpath ( both "local" disks actually are the same disk on the target)
How reproducible:
always
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
Moving forward with changing the way we test assisted-installer. We should change the way assisted-test-infra and subsystem tests on assisted-service are deploying assisted-service.
There are lots of issues when running minikube with kvm2 driver, most of them are because of the complex setup (downloading the large ISO image, setting up the libvirt VM, defining registry addon, etc.)
Currently we are in the process of moving our test deployments to kind instead of minikube. Assisted-Service is already compatible with kind, but not in debug mode. We want to enable debugging of assisted-service in subsystem-tests, e2e tests, and all other deployments on kind
DoD: developers can run a command like make subsystem that deploys the service on kind and runs all subsystem tests.
Right now, we have hub-cluster creation only documented on README and not actually being automated in our run. We should have an automated way to have everything related to running subsystem tests.
Also, we should make sure we're able to build service and update the image in the kind cluster, to allow for a continuous development experience.
CMO should inject the following environment variables into Alertmanager containers:
The values are retrieved from the cluster-wide Proxy resource.
In terms of risks:
Remove prometheus-adapter related code from CMO code base
Proposed title of this feature request
Add scrape time jitter tolerations to UWM prometheus
What is the nature and description of the request?
Change the configuration of the UWM Prometheus instances to tolerate scrape time jitters.
Why does the customer need this? (List the business requirements)
Prometheus chunk compression relies on scrape times being accurately aligned to the scrape interval. Due to the nature of delta of delta encoding, a small delay from the configured scrape interval can cause tsdb data to occupy significantly more space.
We have observed a 50% difference in on disk tsdb storage for a replicated HA pair.
The downside is a reduction in sample accuracy and potential impact to derivatives of the time series. Allowing a jitter toleration will trade off improved chunk compression for reduced accuracy of derived data like the running average of a time series.
List any affected packages or components.
UWM Prometheus
Epic Goal
Why is this important?
Scenarios
1. …
Acceptance Criteria
Dependencies (internal and external)
1. …
Previous Work (Optional):
1. …
Open questions::
1. …
Done Checklist
Currently transit gateway is using global routing as default.
Since global routing costs double than the local routing, need to local routing when powervs and vpc regions are same.
Also introduce a new flag to pass global routing config from user.
Epic Goal
Description of problem:
UDP aggregation on s390x is disabled. This was done, because changing the interface feature did not work in the past (https://issues.redhat.com/browse/OCPBUGS-2532). We have implemented a change in openshift/ovn-kubernetes (https://issues.redhat.com/browse/OCPBUGS-18935) that makes the used Go library safchain/ethtool able to change the interface features on s390x, so enabling UDP aggregation should now be possible for s390x nodes. Our proposed change is: https://github.com/openshift/cluster-network-operator/pull/2331 - this fix still needs to be verified end-to-end with a payload that includes all fixes.
Version-Release number of selected component (if applicable):
How reproducible:
Always, it is disabled in code: https://github.com/openshift/cluster-network-operator/blob/a5b5de5098592867f39ac513dc62024b5b076044/pkg/network/ovn_kubernetes.go#L671-L674
Steps to Reproduce:
1. Install OpenShift cluster on s390x with OVN-Kubernetes
Actual results:
Interface feature rx-udp-gro-forwarding is disabled.
Expected results:
Interface feature rx-udp-gro-forwarding is enabled.
Additional info:
There is a config map for disabling the interface feature if needed (namespace openshift-network-operator, config map udp-aggregation-config, value disable-udp-aggregation = "true" disables UDP aggregation). This should be kept.
Epic Goal
As the IBM Z Openshift user, I would like to install Openshift on LPAR (classic and DPM) for s390x architecture. Due to the fact that Assisted Installer should ensure a good user experience, difficulties and configuration file handling should be limited to a minimum. With LPAR support additional files are needed to boot a LPAR.
With this story a new API call will be introduced to create and download the INS file based on the current infra-env.
The current version of openshift/coredns vendors Kubernetes 1.28 packages. OpenShift 4.17 is based on Kubernetes 1.30. We also want to bump CoreDNS to v1.11.3 contain the most recent updates and fixes. We met on Mar 27, 2024, and reviewed the updates to CoreDNS 1.11.3, notes are here: https://docs.google.com/document/d/1xMMi5_lZRzclqe8pI2P9U6JlcggfSRvrEDj7cOfMDa0
Using old Kubernetes API and client packages brings risk of API compatibility issues. We also want to stay up-to-date with CoreDNS fixes and improvements.
Create the PR with the rebase 1.11.3 in openshift/coredns, confirm functionality, and merge PR.
Create separate SNO alert with templating engine to adjust alerting rules based on workload partitioning mechanism
After splitting the SNO and HA alerts, HA alert does not need SNO section in the description.
None
Create the RT kernel lane that runs the parallel conformance tests against an SNO cluster running the realtime kernel on AWS Metal
As a developer, I want to go through the e2e tests and remove PatternFly classnames being used as selectors so that future PatternFly upgrades do not break integration tests.
AC:
We are undertaking an effort to build OKD on top of CentOS Stream.
The current status of containers can be seen at https://docs.google.com/spreadsheets/d/1s3PWv9ukytTuAqeb2Y6eXg0nbmW46q5NCmo_73Cd_lU/edit#gid=595447113
Some of the work done to produce a build for arm64 and to produce custom builds in https://github.com/okd-project/okd-centos9-rebuild required Dockerfiles and similar assets from the cluster operators repositories to be forked.
This story is to track the eventual backport that should be achieved soon to get rid of most of the forks in the repo by merging the "upstream".
Provide the ability to export data in a CSV format from the various Observability pages in the OpenShift console.
Initially this will include exporting data from any tables that we use.
Product Requirements:
A user will have the ability to click a button which will download the data in the current table in a CSV format. The user will then be able to take this downloaded file use it to import into their system.
Provide the ability of users to export csv data from the dashboard and metrics line graphs
This epic will own all of the usual update, rebase and release chores which must be done during the OpenShift 4.17 timeframe for Custom Metrics Autoscaler, Vertical Pod Autoscaler and Cluster Resource Override Operator
Shepherd and merge operator automation PR https://github.com/openshift/vertical-pod-autoscaler-operator/pull/163
Update operator like was done in https://github.com/openshift/vertical-pod-autoscaler-operator/pull/146
Shepherd and merge operand automation PR https://github.com/openshift/kubernetes-autoscaler/pull/304
Coordinate with cluster autoscaler team on upstream rebase as in https://github.com/openshift/kubernetes-autoscaler/pull/250
As an SRE montitoring ROSA HCP, I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Epic Goal*
OCP storage components (operators + CSI drivers) should not use environment variables for cloud credentials. It's discouraged by OCP hardening guide and reported by compliance operator. Our customers noticed it, https://issues.redhat.com/browse/OCPBUGS-7270
Why is this important? (mandatory)
We should honor our own recommendations.
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
none
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Description of problem:
[AWS EBS CSI Driver] could not provision ebs volume succeed on cco manual mode private clusters
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-07-20-191204
How reproducible:
Always
Steps to Reproduce:
1. Install a private cluster with manual mode -> https://docs.openshift.com/container-platform/4.16/authentication/managing_cloud_provider_credentials/cco-short-term-creds.html#cco-short-term-creds-format-aws_cco-short-term-creds 2. Create one pvc and pod consume the pvc.
Actual results:
In step 2 the pod,pvc stuck at Pending $ oc logs aws-ebs-csi-driver-controller-75cb7dd489-vvb5j -c csi-provisioner|grep new-pvc I0723 15:25:49.072662 1 controller.go:1366] provision "openshift-cluster-csi-drivers/new-pvc" class "gp3-csi": started I0723 15:25:49.073701 1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"openshift-cluster-csi-drivers", Name:"new-pvc", UID:"f4f9bbaf-4149-44be-8716-8b7b973e16b8", APIVersion:"v1", ResourceVersion:"185085", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "openshift-cluster-csi-drivers/new-pvc" I0723 15:25:49.656889 1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"openshift-cluster-csi-drivers", Name:"new-pvc", UID:"f4f9bbaf-4149-44be-8716-8b7b973e16b8", APIVersion:"v1", ResourceVersion:"185085", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "gp3-csi": rpc error: code = Internal desc = Could not create volume "pvc-f4f9bbaf-4149-44be-8716-8b7b973e16b8": could not create volume in EC2: NoCredentialProviders: no valid providers in chain I0723 15:25:50.657418 1 controller.go:1366] provision "openshift-cluster-csi-drivers/new-pvc" class "gp3-csi": started I0723 15:25:50.658112 1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"openshift-cluster-csi-drivers", Name:"new-pvc", UID:"f4f9bbaf-4149-44be-8716-8b7b973e16b8", APIVersion:"v1", ResourceVersion:"185085", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "openshift-cluster-csi-drivers/new-pvc" I0723 15:25:51.182476 1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"openshift-cluster-csi-drivers", Name:"new-pvc", UID:"f4f9bbaf-4149-44be-8716-8b7b973e16b8", APIVersion:"v1", ResourceVersion:"185085", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "gp3-csi": rpc error: code = Internal desc = Could not create volume "pvc-f4f9bbaf-4149-44be-8716-8b7b973e16b8": could not create volume in EC2: NoCredentialProviders: no valid providers in chain
Expected results:
In step 2 the pv should become Bond(volume provision succeed) and pod Running well.
Additional info:
Description of problem:
The operator pass credentials to the CSI driver using environment variables, which is discouraged. This has already been changed in 4.18, let's backport this to 4.17 too.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The reason is no longer store vSphere configuration in a ConfigMap.
Epic Goal*
Out AWS EBS CSI driver operator misses some nice to have functionality. This Epic means to track it, so we finish it in some next OCP release.
Why is this important? (mandatory)
In general, AWS EBS CSI driver controller should be a good citizen in HyperShift's hosted control plane. It should scale appropriately, report metrics and not use kubeadmin privileges in the guest cluster.
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
None
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Our operators use Unstructred client to read HostedControlPlane. HyperShift has published their API types that don't require many dependencies and we could import their types.go.
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update OCP release number in OLM metadata manifests of:
OLM metadata of the operators are typically in /config/manifest directory of each operator. Example of such a bump: https://github.com/openshift/aws-efs-csi-driver-operator/pull/56
We should do it early in the release, so QE can identify new operator builds easily and they are not mixed with the old release.
Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories
Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.
This includes update of VolumeSnapshot CRDs in cluster-csi-snapshot-controller- operator assets and client API in go.mod. I.e. copy all snapshot CRDs from upstream to the operator assets + go get -u github.com/kubernetes-csi/external-snapshotter/client/v6 in the operator repo.
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.
This includes (but is not limited to):
Operators:
(please cross-check with *-operator + vsphere-problem-detector in our tracking sheet)
EOL, do not upgrade:
The following operators were migrated to csi-operator, do not update these obsolete repos:
tools/library-bump.py and tools/bump-all may be useful. For 4.16, this was enough:
mkdir 4.16-bump cd 4.16-bump ../library-bump.py --debug --web <file with repo list> STOR-1574 --run "$PWD/../bump-all github.com/google/cel-go@v0.17.7" --commit-message "Bump all deps for 4.16"
4.17 perhaps needs an older prometheus:
../library-bump.py --debug --web <file with repo list> STOR-XXX --run "$PWD/../bump-all github.com/google/cel-go@v0.17.8 github.com/prometheus/common@v0.44.0 github.com/prometheus/client_golang@v1.16.0 github.com/prometheus/client_model@v0.4.0 github.com/prometheus/procfs@v0.10.1" --commit-message "Bump all deps for 4.17"
Two invariants for etcd have started failiing in MicroShift:
In the same execution another test failed for feature gates that we need to fix, as it has been failing for a couple of runs now.
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled
ref:
https://github.com/openshift/must-gather/pull/345
MG is using the NTO image which contains all the tools needed for collection sysinfo data.
On the hosted-cluster we do not have NTO image, because it reside on the MNG cluster.
the scripts should detects NTO image somehow as a fallback.
An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.
The Kubevirt team should assess when they no longer need the static implementation of their plugin after their migration to dynamic plugins is complete, and should remove the legacy static plugin code from the console repo.
AC:
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Pre Lightspeed Install
Post Lightspeed Install
As a cluster admin I want to set a cluster wide setting for hiding the Lightspeed button from OCP Console, for all the console users. This change will need to consume api and console-operator changes done in CONSOLE-4161. Console need to read the new field `LightspeedButtonState` field and set it as a SERVER_FLAG so frontend will be able to digest it.
AC:
Create Lightspeed hover button as part of core but extensible by Lightspeed Dynamic Plugin
AC:
ccing Andrew Pickering
As a cluster admin I want to set a cluster wide setting for hiding the Lightspeed button from OCP Console, for all the console users. For this change, an additional field will need to be introduced into console-operator's config API.
AC:
To save costs in OpenShift CI, we would like to be able to create test clusters using spot instances for both workers and masters.
AWS spot instances are:
They are thus deemed to be ideal for CI use cases because:
Spot instances can be requested by editing manifests (generated via openshift-install create manifests) and injecting spotMarketOptions into the appropriate AWS provider-specific path. Today this works for workers via both terraform and CAPI code paths; but for masters only via CAPI. The lack of support for spot masters via terraform is due to an omission in translating the relevant field from the master machine manifest.
This RFE proposes to:
1. Add that missing translation to openshift-install so that terraform-created master machines honor requests for spot instances.
2. Add some way to detect that this enablement exists in a given binary so that wrappers can perform validation (i.e. reject requests for spot masters if the support is absent – otherwise such requests are silently ignored).
3. Backport the above to all releases still in support to maximize the opportunity for cost savings.
We do not propose to officially support spot masters for customers, as their unreliability makes them unsuitable for general production use cases. As such, we will not add:
(We may even want to consider adding words to the openshift API field for spotMarketOptions indicating that, even though it can be made to work, we don't support spot masters.)
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This does not require a design proposal.
This does not require a feature gate.
Using Spot instances might bring significant cost savings in CI. There is support in CAPA already and we should enable it for terraform too.
: [sig-network-edge][Conformance][Area:Networking][Feature:Router] The HAProxy router should be able to connect to a service that is idled because a GET on the route will unidle it [Skipped:Disconnected] [Suite:openshift/conformance/parallel/minimal] expand_more
The reason for the failure is the incorrect configuration of the proxy.|
failed log
Will run 1 of 1 specs ------------------------------ [sig-network-edge][Conformance][Area:Networking][Feature:Router] The HAProxy router should be able to connect to a service that is idled because a GET on the route will unidle it [Skipped:Disconnected] [Suite:openshift/conformance/parallel/minimal] github.com/openshift/origin/test/extended/router/idle.go:49 STEP: Creating a kubernetes client @ 06/14/24 10:24:21.443 Jun 14 10:24:21.752: INFO: configPath is now "/tmp/configfile3569155902" Jun 14 10:24:21.752: INFO: The user is now "e2e-test-router-idling-8pjjg-user" Jun 14 10:24:21.752: INFO: Creating project "e2e-test-router-idling-8pjjg" Jun 14 10:24:21.958: INFO: Waiting on permissions in project "e2e-test-router-idling-8pjjg" ... Jun 14 10:24:22.039: INFO: Waiting for ServiceAccount "default" to be provisioned... Jun 14 10:24:22.149: INFO: Waiting for ServiceAccount "deployer" to be provisioned... Jun 14 10:24:22.271: INFO: Waiting for ServiceAccount "builder" to be provisioned... Jun 14 10:24:22.400: INFO: Waiting for RoleBinding "system:image-pullers" to be provisioned... Jun 14 10:24:22.419: INFO: Waiting for RoleBinding "system:image-builders" to be provisioned... Jun 14 10:24:22.440: INFO: Waiting for RoleBinding "system:deployers" to be provisioned... Jun 14 10:24:22.740: INFO: Project "e2e-test-router-idling-8pjjg" has been fully provisioned. STEP: creating test fixtures @ 06/14/24 10:24:22.809 STEP: Waiting for pods to be running @ 06/14/24 10:24:23.146 Jun 14 10:24:24.212: INFO: Waiting for 1 pods in namespace e2e-test-router-idling-8pjjg Jun 14 10:24:26.231: INFO: All expected pods in namespace e2e-test-router-idling-8pjjg are running STEP: Getting a 200 status code when accessing the route @ 06/14/24 10:24:26.231 Jun 14 10:24:28.315: INFO: GET#1 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:25:05.256: INFO: GET#38 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:04.256: INFO: GET#877 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:05.256: INFO: GET#878 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:06.257: INFO: GET#879 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:07.256: INFO: GET#880 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:08.256: INFO: GET#881 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:09.256: INFO: GET#882 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:10.256: INFO: GET#883 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:11.256: INFO: GET#884 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:12.256: INFO: GET#885 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:13.257: INFO: GET#886 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:14.256: INFO: GET#887 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host ... ... ... Jun 14 10:39:19.256: INFO: GET#892 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:20.256: INFO: GET#893 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host [INTERRUPTED] in [It] - github.com/openshift/origin/test/extended/router/idle.go:49 @ 06/14/24 10:39:20.461 ------------------------------ Interrupted by User First interrupt received; Ginkgo will run any cleanup and reporting nodes but will skip all remaining specs. Interrupt again to skip cleanup. Here's a current progress report: [sig-network-edge][Conformance][Area:Networking][Feature:Router] The HAProxy router should be able to connect to a service that is idled because a GET on the route will unidle it [Skipped:Disconnected] [Suite:openshift/conformance/parallel/minimal] (Spec Runtime: 14m59.024s) github.com/openshift/origin/test/extended/router/idle.go:49 In [It] (Node Runtime: 14m57.721s) github.com/openshift/origin/test/extended/router/idle.go:49 At [By Step] Getting a 200 status code when accessing the route (Step Runtime: 14m54.229s) github.com/openshift/origin/test/extended/router/idle.go:175 Spec Goroutine goroutine 307 [select] k8s.io/apimachinery/pkg/util/wait.waitForWithContext({0x95f5188, 0xda30720}, 0xc004cfbcf8, 0x30?) k8s.io/apimachinery@v0.29.0/pkg/util/wait/wait.go:205 k8s.io/apimachinery/pkg/util/wait.poll({0x95f5188, 0xda30720}, 0x1?, 0xc0045c2a80?, 0xc0045c2a87?) k8s.io/apimachinery@v0.29.0/pkg/util/wait/poll.go:260 k8s.io/apimachinery/pkg/util/wait.PollWithContext({0x95f5188?, 0xda30720?}, 0xc004cfbd90?, 0x88699b3?, 0x7?) k8s.io/apimachinery@v0.29.0/pkg/util/wait/poll.go:85 k8s.io/apimachinery/pkg/util/wait.Poll(0xc004cfbd00?, 0x88699b3?, 0x1?) k8s.io/apimachinery@v0.29.0/pkg/util/wait/poll.go:66 > github.com/openshift/origin/test/extended/router.waitHTTPGetStatus({0xc003d8fbc0, 0x5a}, 0xc8, 0x0?) github.com/openshift/origin/test/extended/router/idle.go:306 > github.com/openshift/origin/test/extended/router.glob..func7.2.1() github.com/openshift/origin/test/extended/router/idle.go:178 github.com/onsi/ginkgo/v2/internal.extractBodyFunction.func3({0x2e24138, 0xc0014f2d80}) github.com/onsi/ginkgo/v2@v2.13.0/internal/node.go:463 github.com/onsi/ginkgo/v2/internal.(*Suite).runNode.func3() github.com/onsi/ginkgo/v2@v2.13.0/internal/suite.go:896 github.com/onsi/ginkgo/v2/internal.(*Suite).runNode in goroutine 1 github.com/onsi/ginkgo/v2@v2.13.0/internal/suite.go:883 -----------------------------
This is a clone of issue OCPBUGS-38713. The following is the description of the original issue:
—
: [sig-network-edge] DNS should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]
failed log
[sig-network-edge] DNS should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel] github.com/openshift/origin/test/extended/dns/dns.go:499 STEP: Creating a kubernetes client @ 08/12/24 15:55:02.255 STEP: Building a namespace api object, basename dns @ 08/12/24 15:55:02.257 STEP: Waiting for a default service account to be provisioned in namespace @ 08/12/24 15:55:02.517 STEP: Waiting for kube-root-ca.crt to be provisioned in namespace @ 08/12/24 15:55:02.581 STEP: Creating a kubernetes client @ 08/12/24 15:55:02.646 Aug 12 15:55:03.941: INFO: configPath is now "/tmp/configfile2098808007" Aug 12 15:55:03.941: INFO: The user is now "e2e-test-dns-dualstack-9bgpm-user" Aug 12 15:55:03.941: INFO: Creating project "e2e-test-dns-dualstack-9bgpm" Aug 12 15:55:04.299: INFO: Waiting on permissions in project "e2e-test-dns-dualstack-9bgpm" ... Aug 12 15:55:04.632: INFO: Waiting for ServiceAccount "default" to be provisioned... Aug 12 15:55:04.788: INFO: Waiting for ServiceAccount "deployer" to be provisioned... Aug 12 15:55:04.972: INFO: Waiting for ServiceAccount "builder" to be provisioned... Aug 12 15:55:05.132: INFO: Waiting for RoleBinding "system:image-pullers" to be provisioned... Aug 12 15:55:05.213: INFO: Waiting for RoleBinding "system:image-builders" to be provisioned... Aug 12 15:55:05.281: INFO: Waiting for RoleBinding "system:deployers" to be provisioned... Aug 12 15:55:05.641: INFO: Project "e2e-test-dns-dualstack-9bgpm" has been fully provisioned. STEP: creating a dual-stack service on a dual-stack cluster @ 08/12/24 15:55:05.775 STEP: Running these commands:for i in `seq 1 10`; do [ "$$(dig +short +notcp +noall +answer +search v4v6.e2e-dns-2700.svc A | sort | xargs echo)" = "172.31.255.230" ] && echo "test_endpoints@v4v6.e2e-dns-2700.svc"; [ "$$(dig +short +notcp +noall +answer +search v4v6.e2e-dns-2700.svc AAAA | sort | xargs echo)" = "fd02::7321" ] && echo "test_endpoints_v6@v4v6.e2e-dns-2700.svc"; [ "$$(dig +short +notcp +noall +answer +search ipv4.v4v6.e2e-dns-2700.svc A | sort | xargs echo)" = "3.3.3.3 4.4.4.4" ] && echo "test_endpoints@ipv4.v4v6.e2e-dns-2700.svc"; [ "$$(dig +short +notcp +noall +answer +search ipv6.v4v6.e2e-dns-2700.svc AAAA | sort | xargs echo)" = "2001:4860:4860::3333 2001:4860:4860::4444" ] && echo "test_endpoints_v6@ipv6.v4v6.e2e-dns-2700.svc";sleep 1; done @ 08/12/24 15:55:05.935 STEP: creating a pod to probe DNS @ 08/12/24 15:55:05.935 STEP: submitting the pod to kubernetes @ 08/12/24 15:55:05.935 STEP: deleting the pod @ 08/12/24 16:00:06.034 [FAILED] in [It] - github.com/openshift/origin/test/extended/dns/dns.go:251 @ 08/12/24 16:00:06.074 STEP: Collecting events from namespace "e2e-test-dns-dualstack-9bgpm". @ 08/12/24 16:00:06.074 STEP: Found 0 events. @ 08/12/24 16:00:06.207 Aug 12 16:00:06.239: INFO: POD NODE PHASE GRACE CONDITIONS Aug 12 16:00:06.239: INFO: Aug 12 16:00:06.334: INFO: skipping dumping cluster info - cluster too large Aug 12 16:00:06.469: INFO: Deleted {user.openshift.io/v1, Resource=users e2e-test-dns-dualstack-9bgpm-user}, err: <nil> Aug 12 16:00:06.506: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients e2e-client-e2e-test-dns-dualstack-9bgpm}, err: <nil> Aug 12 16:00:06.544: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens sha256~4QgFXAn8lyosshoHOjJeddr3MJbIL2DnCsoIvJVOGb4}, err: <nil> STEP: Destroying namespace "e2e-test-dns-dualstack-9bgpm" for this suite. @ 08/12/24 16:00:06.544 STEP: dump namespace information after failure @ 08/12/24 16:00:06.58 STEP: Collecting events from namespace "e2e-dns-2700". @ 08/12/24 16:00:06.58 STEP: Found 2 events. @ 08/12/24 16:00:06.615 Aug 12 16:00:06.615: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for dns-test-d93fff7e-90a3-408e-a197-fc4ff0738b30: { } FailedScheduling: 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling. Aug 12 16:00:06.615: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for dns-test-d93fff7e-90a3-408e-a197-fc4ff0738b30: { } FailedScheduling: skip schedule deleting pod: e2e-dns-2700/dns-test-d93fff7e-90a3-408e-a197-fc4ff0738b30 Aug 12 16:00:06.648: INFO: POD NODE PHASE GRACE CONDITIONS Aug 12 16:00:06.648: INFO: Aug 12 16:00:06.743: INFO: skipping dumping cluster info - cluster too large STEP: Destroying namespace "e2e-dns-2700" for this suite. @ 08/12/24 16:00:06.743 • [FAILED] [304.528 seconds] [sig-network-edge] DNS [It] should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel] github.com/openshift/origin/test/extended/dns/dns.go:499 [FAILED] Failed: timed out waiting for the condition In [It] at: github.com/openshift/origin/test/extended/dns/dns.go:251 @ 08/12/24 16:00:06.074 ------------------------------ Summarizing 1 Failure: [FAIL] [sig-network-edge] DNS [It] should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel] github.com/openshift/origin/test/extended/dns/dns.go:251 Ran 1 of 1 Specs in 304.528 seconds FAIL! -- 0 Passed | 1 Failed | 0 Pending | 0 Skipped fail [github.com/openshift/origin/test/extended/dns/dns.go:251]: Failed: timed out waiting for the condition Ginkgo exit error 1: exit with code 1
failure reason
TODO
Today, when we create an AKS cluster, we provide the catalog images like so:
--annotations hypershift.openshift.io/certified-operators-catalog-image=registry.redhat.io/redhat/certified-operator-index@sha256:fc68a3445d274af8d3e7d27667ad3c1e085c228b46b7537beaad3d470257be3e \ --annotations hypershift.openshift.io/community-operators-catalog-image=registry.redhat.io/redhat/community-operator-index@sha256:4a2e1962688618b5d442342f3c7a65a18a2cb014c9e66bb3484c687cfb941b90 \ --annotations hypershift.openshift.io/redhat-marketplace-catalog-image=registry.redhat.io/redhat/redhat-marketplace-index@sha256:ed22b093d930cfbc52419d679114f86bd588263f8c4b3e6dfad86f7b8baf9844 \ --annotations hypershift.openshift.io/redhat-operators-catalog-image=registry.redhat.io/redhat/redhat-operator-index@sha256:59b14156a8af87c0c969037713fc49be7294401b10668583839ff2e9b49c18d6 \
We need to fix this so that we don't need to override those images on create command when we are in AKS.
The current reason we are annotating the catalog images when we create an AKS cluster is because the HCP controller will try to put the images out of an ImageStream if there are no overrides here - https://github.com/openshift/hypershift/blob/64149512a7a1ea21cb72d4473f46210ac1d3efe0/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go#L3672. In AKS, ImageStreams are not available.
This will be home to stories that aren't required for GA.
We have decided that each platform will have its own feature gate as we work on adding per platform support. This story involves ensuring that the boot image controller only runs when on a valid combination of feature gate and cluster platform is found.
There are use cases for using the baremetal platform and having the baremetal capability while disabling MachinePI (for example to use CAPI ).
Currently, there are few validations preventing this in the installer and in openshift/api, these validations exists because CBO will crash if the Machine CRD doesn't exist on the cluster.
CBO is usable on many platforms, and depends on MAPI on all of them, when it only really needs it for baremetal IPI.
Allow using the baremetal platform and having the baremetal capability while disabling MachinePI{}
yes, we should update the docs to say this configuration is valid.
Overall this should allow users to install OCP with baremetal platform using ABI or assisted installer while disabling the MachineAPI capability.
Here is an example patch for configuring this when installing with assisted installer:
curl --request PATCH --header "Content-Type: application/json" --data '"{\"capabilities\": {\"baselineCapabilitySet\": \"None\", \"additionalEnabledCapabilities\": [\"openshift-samples\", \"marketplace\", \"Console\", \"baremetal\", \"Insights\", \"Storage\", \"NodeTuning\", \"CSISnapshot\", \"OperatorLifecycleManager\", \"Ingress\"]}}"' "http://$ASSISTED_SERVICE_IP:$ASSISTED_SERVICE_PORT/api/assisted-install/v2/clusters/$CLUSTER_ID/install-config"
"
Also, skip machine generation since the CRD won't exist
CBO provisioning_controller is crashing in case the machine CRD doesn't exist
Up until now, we only published the wheel files to PyPI, without the more basic sdist zip:
https://github.com/openshift/assisted-service/blob/ea66ea918a923d6f23bdddcc8daa1da0fa0471c8/Makefile#L240
Allow uploading the (already created) sdist file as well.
Currently subsystem test doesn't clean all resources in the end of tests resulting in errors when running more than once. We want to clean this resources so the tests could be repetitive
Currently, we generate python client for assisted-service inside its primary Dockerfile. In order to generate the python client, we need the repository git tags. This behavior collisions with Konflux behavior because Konflux builds the image, searching for the tags in the fork instead of the upstream repository. As a result developers currently need to periodically push git tags to their fork for konflux to build successfully. We want to explore the option of removing the client generation from the Dockerfile.
reference - https://redhat-internal.slack.com/archives/C035X734RQB/p1719496182644409
It has been decided to move the client generation to test-infra
ACM 2.10/MCE 2.5 rolled back Assisted Installer/Service images to use RHEL 8 (issue ACM-10022) due to incompatibility of various dependencies when running FIPs.
In order to support RHEL 9 in assisted installer/service
For this epic we will clearly need to make major changes to the code that runs the installer binary. Right now this is all mixed in with the discovery ignition code as well as some generic ignition parsing and merging functions.
Split split logic into separate files so the changes for this epic will be easier to reason about.
Current file for reference: https://github.com/openshift/assisted-service/blob/9a4ece99d181459927586ac5105ca606ddc058fe/internal/ignition/ignition.go
Ensure there's a clear line between what is coming from assisted service for a particular request to generate the manifests and what needs to be provided during deployment through env vars.
Ideally this interface would be described by the pkg/generator package.
Clean up this package to remove unused parameters and ensure it is self contained (env vars are not coming from main, for example).
During 4.15, the OCP team is working on allowing booting from iscsi. Today that's disabled by the assisted installer. The goal is to enable that for ocp version >= 4.15.
iscsi boot is enabled for ocp version >= 4.15 both in the UI and the backend.
When booting from iscsi, we need to make sure to add the `rd.iscsi.firmware=1` kargs during install to enable iSCSI booting.
yes
In order to successfully install OCP on an iSCSI boot volume, we need to make sure that the machine has 2 network interfaces:
This is required because on startup OVS/OVN will reconfigure the default interface (the network interface used for the default gateway). This behavior makes the usage of the default interface impracticable for the iSCSI traffic because we loose the root volume, and the node becomes unusable. See https://issues.redhat.com/browse/OCPBUGS-26071
In the scope of this issue we need to:
Currently, the monitoring stack is configured using a configmap. In OpenShift though the best practice is to configure operators using custom resources.
To start the effort we should create a feature gate behind which we can start implementing a CRD config approach. This allows us to iterate in smaller increments without having to support full feature parity with the config map from the start. We can start small and add features as they evolve.
One proposal for a minimal DoD was:
Feature parity should be planned in one or more separate epics.
Behind a feature flag we can start experimenting with a CRD and explore migration and upgrades.
(originated from https://issues.redhat.com/browse/TRT-55)
Currently the KubePersistentVolumeErrors alert is deployed by the cluster-monitoring operator and lives in the openshift-monitoring namespace. The metric involved in the alert (kube_persistentvolume_status_phase) comes from kube-state-metrics but it would be clearer if https://github.com/openshift/cluster-storage-operator/ owns the alert.
Also relevant https://coreos.slack.com/archives/C01CQA76KMX/p1637075395269400
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
The 4.16 dev cycle showed CMO e2e test timeouts more frequently. This hinders our development process and might indicate an issue in our code.
We should spend some time to analyze these failures and improve CMO e2e test reliability.
Most of the Kube API requests executed during CMO e2e tests wait a few seconds before actually issuing the request. We could save a fraction of time per action if it doesn't wait.
This epic is to track stories that are not completed in MON-3378
There are a few places in CMO where we need to remove code after the release-4.16 branch is cut.
To find them, look for the "TODO" comments.
After we have replaced all oauth-proxy occurrences in the monitoring stack, we need to make sure that all references to oauth-proxy are removed from the cluster monitoring operator. Examples:
Do a test deploy and then add the following regions to the installer:
osa21
syd04
lon06
sao01
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
KCM and KAS previously relied on having the `{}cloud-provider` and `{-}-cloud-config` flags set. However, these are no longer required as there is no cloud provider code in either binary.
Both operators rely on the config observer in library go to set these flags.
In the future, if these values are set, even to the empty string, then startup will fail.
The config observer sets keys and values for a map (see), we need to make sure the keys for these two flags are deleted rather than set to a specific value.
None
ClusterTask has been deprecated and will be removed in Pipelines Operator 1.17. In the console UI, we have a ClusterTask list page, and ClusterTasks are also listed in the Tasks quick search in the Pipeline builder form.
Remove ClusterTask and references from the console UI and use Tasks from `openshift-pipelines` namespace.
Resolver in Tekton https://tekton.dev/docs/pipelines/resolution-getting-started/
Task resolution: https://tekton.dev/docs/pipelines/cluster-resolver/#task-resolution
ClusterTask has been deprecated and will be removed in Pipelines Operator 1.17
We have to use Tasks from `openshift-pipelines` namespace. This change will happen in console-plugin repo(dynamic plugin). So in console repository we have to remove all the dependency of ClusterTask if the Pipelines Operator is 1.17 and above
This is a clone of issue OCPBUGS-43752. The following is the description of the original issue:
—
Description of problem:
Add disallowed flag to hide the pipelines-plugin pipeline builder route, add action and to catalog provider extension as it is migrated to Pipelines console-plugin. So, that no duplicate action in console
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Currently `pkg/operator/bootstrap.go` is quite a mess as vSphere explicitly ignores the `getPlatformManifests` function and creates manifest list manually.
As the logic used there is different than for any other platform, this creates some confusion as when CoreDNS and keepalived are included. I.e. for every platform except vSphere we always deploy CoreDNS and sometimes skip keepalived, but for vSphere whenever keepalived is skipped, CoreDNS is skipped too.
As the code is not really documented regarding reasons of this, that should be refactored.
List of flakes:
Spec | Location |
---|---|
creates v1 CRDs with a v1 schema successfully | test/e2e/crd_e2e_test.go |
should have removed the old configmap and put the new configmap in place | test/e2e/gc_e2e_test.go |
can satisfy an associated ClusterServiceVersion's ownership requirement | test/e2e/csv_e2e_test.go |
Is updated when the CAs expire | test/e2e/webhook_e2e_test.go |
upgrade CRD with deprecated version | test/e2e/installplan_e2e_test.go |
consistent generation | test/e2e/installplan_e2e_test.go |
should clear up the condition in the InstallPlan status that contains an error message when a valid OperatorGroup is created | test/e2e/installplan_e2e_test.go |
OperatorCondition Upgradeable type and overrides | test/e2e/operator_condition_e2e_test.go |
eventually reports a successful state when using skip ranges | test/e2e/fail_forward_e2e_test.go |
eventually reports a successful state when using replaces | test/e2e/fail_forward_e2e_test.go |
intersection | test/e2e/operator_groups_e2e_test.go |
OLM applies labels to Namespaces that are associated with an OperatorGroup | test/e2e/operator_groups_e2e_test.go |
updates multiple intermediates | test/e2e/subscription_e2e_test.go |
creation with dependencies | test/e2e/subscription_e2e_test.go |
choose the dependency from the right CatalogSource based on lexicographical name ordering of catalogs | test/e2e/subscription_e2e_test.go |
should report only package and channel deprecation conditions when bundle is no longer deprecated | test/e2e/subscription_e2e_test.go |
$ grep -ir "[FLAKE]" test/e2e
Description of problem:
The e2e test "upgrade CRD with deprecated version" in the test/e2e/installplan_e2e_test.go suite is flaking
Version-Release number of selected component (if applicable):
How reproducible:
Hard to reproduce, could be related to other tests running at the same time, or any number of things.
Steps to Reproduce:
It might be worthwhile trying to re-rerun the test multiple times against a ClusterBot, or OpenShift Local, cluster
Actual results:
Expected results:
Additional info:
Deploy ODF with only SSDs
Some customer's (especially those using vmware) are deploying ODF with HDDs
Some customer's (especially those using vmware) are deploying ODF with HDDs
Deployment - Add a warning and block deployment in case HDD disks are in use with LSO
Add capacity warning and block
No documentation requirements,
No, this request is coming from support
Per [1], ODF does not support HDD in internal mode. I would like to request we add a feature to the console during install that either stops, or warns the customer that they're installing an unsupported cluster if HDDs are detected and selected as the osd devices. I know that we can detect the rotational flag of all locally attached devices since we currently have the option to filter by ssd vs. hdd when picking the backing disks during install. This bz is a request to take it a step further and present the customer with the information explicitly during console install that hdds are unsupported. [1] https://access.redhat.com/articles/5001441
bmcmurra@redhat.com
Addition of an ODF use-case specific warning to the LSO's UI indicating users that - "HDD devices are not supported for ODF", in case they are planning to use them for ODF "StorageSystem" creation later.
In some cases, it is desirable to set the control plane size of a cluster regardless of the number of workers.
Introduce an annotation hypershift.openshift.io/cluster-size-override to set the value of the cluster size label regardless of number of workers on the hosted cluster.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
We have decided to remove IPI and UPI support for Alibaba Cloud, which until recently has been in Tech Preview due to the following reasons:
(1) Low customer interest of using Openshift on Alibaba Cloud
(2) Removal of Terraform usage
(3) MAPI to CAPI migration
(4) CAPI adoption for installation (Day 1) in OpenShift 4.16 and Day 2 (post-OpenShift 4.16)
Impacted areas based on CI:
alibaba-cloud-csi-driver/openshift-alibaba-cloud-csi-driver-release-4.16.yaml
alibaba-disk-csi-driver-operator/openshift-alibaba-disk-csi-driver-operator-release-4.16.yaml
cloud-provider-alibaba-cloud/openshift-cloud-provider-alibaba-cloud-release-4.16.yaml
cluster-api-provider-alibaba/openshift-cluster-api-provider-alibaba-release-4.16.yaml
cluster-cloud-controller-manager-operator/openshift-cluster-cloud-controller-manager-operator-release-4.16.yaml
machine-config-operator/openshift-machine-config-operator-release-4.16.yaml
Acceptance Criteria
Remove alibaba from machine api operator
<!--
Please make sure to fill all story details here with enough information so
that it can be properly sized and is immediately actionable. Our Definition
of Ready for user stories is detailed in the link below:
https://docs.google.com/document/d/1Ps9hWl6ymuLOAhX_-usLmZIP4pQ8PWO15tMksh0Lb_A/
As much as possible, make sure this story represents a small chunk of work
that could be delivered within a sprint. If not, consider the possibility
of splitting it or turning it into an epic with smaller related stories.
Before submitting it, please make sure to remove all comments like this one.
-->
*USER STORY:*
<!--
One sentence describing this story from an end-user perspective.
-->
As a [type of user], I want [an action] so that [a benefit/a value].
*DESCRIPTION:*
<!--
Provide as many details as possible, so that any team member can pick it up
and start to work on it immediately without having to reach out to you.
-->
*Required:*
...
*Nice to have:*
...
*ACCEPTANCE CRITERIA:*
<!--
Describe the goals that need to be achieved so that this story can be
considered complete. Note this will also help QE to write their acceptance
tests.
-->
*ENGINEERING DETAILS:*
<!--
Any additional information that might be useful for engineers: related
repositories or pull requests, related email threads, GitHub issues or
other online discussions, how to set up any required accounts and/or
environments if applicable, and so on.
-->
Remove alibaba references from the machine-config-operator project
<!--
Please make sure to fill all story details here with enough information so
that it can be properly sized and is immediately actionable. Our Definition
of Ready for user stories is detailed in the link below:
https://docs.google.com/document/d/1Ps9hWl6ymuLOAhX_-usLmZIP4pQ8PWO15tMksh0Lb_A/
As much as possible, make sure this story represents a small chunk of work
that could be delivered within a sprint. If not, consider the possibility
of splitting it or turning it into an epic with smaller related stories.
Before submitting it, please make sure to remove all comments like this one.
-->
*USER STORY:*
<!--
One sentence describing this story from an end-user perspective.
-->
As a [type of user], I want [an action] so that [a benefit/a value].
*DESCRIPTION:*
<!--
Provide as many details as possible, so that any team member can pick it up
and start to work on it immediately without having to reach out to you.
-->
*Required:*
...
*Nice to have:*
...
*ACCEPTANCE CRITERIA:*
<!--
Describe the goals that need to be achieved so that this story can be
considered complete. Note this will also help QE to write their acceptance
tests.
-->
*ENGINEERING DETAILS:*
<!--
Any additional information that might be useful for engineers: related
repositories or pull requests, related email threads, GitHub issues or
other online discussions, how to set up any required accounts and/or
environments if applicable, and so on.
-->
We have a consistent complication where developers miss or ignore job failures on presubmits, because they don't trust the jobs which sometimes have overall pass rates under 30%.
We have a systemic problem with flaky tests and jobs. Few pay attention anymore, and even fewer people know how to distinguish serious failures from the noise.
Just fixing the test and jobs is infeasible, piece by piece maybe but we do not have the time to invest in what would be a massive effort.
Sippy now has presubmit data throughout the history of a PR.
Could sippy analyze the presubmits for every PR, check test failures against their current pass rate, filter out noise from on-going incidents, and then comment on PRs letting developers know what's really going on.
As an example:
job foo - failure severity: LOW
job bar - failure severity: HIGH
job zoo - failure severity: UNKNOWN
David requests this get published in the job as a spyglass panel, gives a historical artifact. We'd likely do both so we know they see comments.
This epic will cover TRTs project to enhance Sippy to categorize the likely severity of test failures in a bad job run, store this as a historical artifact on the job run, and communicate it directly to developers in their PRs via a comment.
We want to work on enhancing the analytics for Risk Analysis. Currently we comment when we see repeated failures that have high historical pass rates, however when a regression comes in from another PR we will flag that regression as a risk for each PR that sees the failure.
For this story we want to persist potential regressions detected by risk analysis in big query. A potential place to insert logic is in pr_commenting_processor.
We want to make sure we can track the repo, pr, test name, potentially testid and associated risk.
Future work will include querying this data when building the risk summary to see if any tests flagged as risky within the current PR are also failing in other PRs indicating this PR is not the contributing factor and a regression has potentially entered the main branches / payloads.
The intervals charts displayed at the top of all prow job runs has become a critical tool for TRT and OpenShift engineering in general, allowing us to determine what happened when, and in relation to other events. The tooling however is falling short in a number of areas we'd like to improve.
Goals:
Stretch goals:
See linked Jira which must be completed before this can be done.
We want this to be dynamic, so origin code can effectively decide what the intervals should show. This is done via the new "display" field on Intervals. Grouping should likewise be automatic, and the JS / react UI should no longer have to decide which groups to show. Possible origin changes will be needed here.
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled
Please review the following PR: https://github.com/openshift/vsphere-problem-detector/pull/157
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/apiserver-network-proxy/pull/54
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/118
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-38557. The following is the description of the original issue:
—
Similar to the work done for AWS STS and Azure WIF support, the console UI (specifically OperatorHub) needs to:
CONSOLE-3776 was adding filtering for the GCP WIP case, for the operator-hub tile view. Part fo the change was also check for the annotation which indicates that the operator supports GCP's WIF:
features.operators.openshift.io/token-auth-gcp: "true"
AC:
Added a new CLI to autogenerate the updates needed for the deployment.yaml, create a branch, push the changes to GitLab, and create the MR
Description of problem:
Starting with OCP 4.14, we have decided to start using OCP's own "bridge" CNI build instead of our "cnv-bridge" rebuild. To make sure that current users of "cnv-bridge" don't have to change their configuration, we kept "cnv-bridge" as a symlink to "bridge". While the old name still functions, we should make an effort to move users to "bridge". To do that, we can start by changing UI so it generates NADs of the type "bridge" instead of "cnv-bridge".
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Use the NetworkAttachmentDefinition dialog to create a network of type bridge 2. Read the generated yaml
Actual results:
It has "type": "cnv-bridge"
Expected results:
It should have "type": "bridge"
Additional info:
The same should be done to any instance of "cnv-tuning" by changing it to "tuning".
currently azure capi-provider crashes with following error
E0529 10:44:25.385040 1 main.go:430] "unable to create controller" err="failed to create mapper for Cluster to AzureMachines: failed to get restmapping: no matches for kind \"AzureMachinePoolList\" in group \"infrastructure.cluster.x-k8s.io\"" logger="setup" controller="AzureMachinePool"
This is caused by the MachinePool feature gate now being enabled by default in cluster-api-provider-azure.
Description of problem:
Version-Release number of selected component (if applicable):
Build the cluster with PR openshift/ovn-kubernetes#2223,openshift/cluster-network-operator#2433, enable TechPreview feature gate
How reproducible:
Always
Steps to Reproduce:
1. Create namespace ns1,ns2,ns3
2. Create NAD under ns1,ns2,ns3 with
% oc get net-attach-def -n ns1 -o yaml apiVersion: v1 items: - apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: creationTimestamp: "2024-07-11T08:35:13Z" generation: 1 name: l3-network-ns1 namespace: ns1 resourceVersion: "165141" uid: 8eca76bf-ee30-4a0e-a892-92a480086aa1 spec: config: | { "cniVersion": "0.3.1", "name": "l3-network-ns1", "type": "ovn-k8s-cni-overlay", "topology":"layer3", "subnets": "10.200.0.0/16/24", "mtu": 1300, "netAttachDefName": "ns1/l3-network-ns1", "role": "primary" } kind: List metadata: resourceVersion: "" % oc get net-attach-def -n ns2 -o yaml apiVersion: v1 items: - apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: creationTimestamp: "2024-07-11T08:35:19Z" generation: 1 name: l3-network-ns2 namespace: ns2 resourceVersion: "165183" uid: 944b50b1-106f-4683-9cea-450521260170 spec: config: | { "cniVersion": "0.3.1", "name": "l3-network-ns2", "type": "ovn-k8s-cni-overlay", "topology":"layer3", "subnets": "10.200.0.0/16/24", "mtu": 1300, "netAttachDefName": "ns2/l3-network-ns2", "role": "primary" } kind: List metadata: resourceVersion: "" % oc get net-attach-def -n ns3 -o yaml apiVersion: v1 items: - apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: creationTimestamp: "2024-07-11T08:35:26Z" generation: 1 name: l3-network-ns3 namespace: ns3 resourceVersion: "165257" uid: 93683aac-7f8a-4263-b0f6-ed9182c5c47c spec: config: | { "cniVersion": "0.3.1", "name": "l3-network-ns3", "type": "ovn-k8s-cni-overlay", "topology":"layer3", "subnets": "10.200.0.0/16/24", "mtu": 1300, "netAttachDefName": "ns3/l3-network-ns3", "role": "primary" } kind: List metadata:
3. Create test pods under ns1,ns2,ns3
Using below yaml to create pods under ns1
% cat data/udn/list-for-pod.json { "apiVersion": "v1", "kind": "List", "items": [ { "apiVersion": "v1", "kind": "ReplicationController", "metadata": { "labels": { "name": "test-rc" }, "name": "test-rc" }, "spec": { "replicas": 2, "template": { "metadata": { "labels": { "name": "test-pods" }, "annotations": { "k8s.v1.cni.cncf.io/networks": "l3-network-ns1"} }, "spec": { "containers": [ { "image": "quay.io/openshifttest/hello-sdn@sha256:c89445416459e7adea9a5a416b3365ed3d74f2491beb904d61dc8d1eb89a72a4", "name": "test-pod", "imagePullPolicy": "IfNotPresent" } ] } } } }, { "apiVersion": "v1", "kind": "Service", "metadata": { "labels": { "name": "test-service" }, "name": "test-service" }, "spec": { "ports": [ { "name": "http", "port": 27017, "protocol": "TCP", "targetPort": 8080 } ], "selector": { "name": "test-pods" } } } ] } oc get pods -n ns1 NAME READY STATUS RESTARTS AGE test-rc-5ns7z 1/1 Running 0 3h7m test-rc-bxf2h 1/1 Running 0 3h7m Using below yaml to create a pod in ns2 % cat data/udn/podns2.yaml kind: Pod apiVersion: v1 metadata: name: hello-pod-ns2 namespace: ns2 annotations: k8s.v1.cni.cncf.io/networks: l3-network-ns2 labels: name: hello-pod-ns2 spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - image: "quay.io/openshifttest/hello-sdn@sha256:c89445416459e7adea9a5a416b3365ed3d74f2491beb904d61dc8d1eb89a72a4" name: hello-pod-ns2 securityContext: allowPrivilegeEscalation: false capabilities: drop: ["ALL"] Using below yaml to create a pod in ns3 % cat data/udn/podns3.yaml kind: Pod apiVersion: v1 metadata: name: hello-pod-ns3 namespace: ns3 annotations: k8s.v1.cni.cncf.io/networks: l3-network-ns3 labels: name: hello-pod-ns3 spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - image: "quay.io/openshifttest/hello-sdn@sha256:c89445416459e7adea9a5a416b3365ed3d74f2491beb904d61dc8d1eb89a72a4" name: hello-pod-ns3 securityContext: allowPrivilegeEscalation: false capabilities: drop: ["ALL"]
4. Test the pods connection in primary network in ns1, it worked well
% oc rsh -n ns1 test-rc-5ns7z ~ $ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if157: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1360 qdisc noqueue state UP group default link/ether 0a:58:0a:80:02:1e brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.128.2.30/23 brd 10.128.3.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fe80:21e/64 scope link valid_lft forever preferred_lft forever 3: net1@if158: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1300 qdisc noqueue state UP group default link/ether 0a:58:0a:c8:01:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.200.1.3/24 brd 10.200.1.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fec8:103/64 scope link valid_lft forever preferred_lft forever ~ $ exit % oc rsh -n ns1 test-rc-bxf2h ~ $ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if123: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1360 qdisc noqueue state UP group default link/ether 0a:58:0a:83:00:0c brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.131.0.12/23 brd 10.131.1.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fe83:c/64 scope link valid_lft forever preferred_lft forever 3: net1@if124: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1300 qdisc noqueue state UP group default link/ether 0a:58:0a:c8:02:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.200.2.3/24 brd 10.200.2.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fec8:203/64 scope link valid_lft forever preferred_lft forever ~ $ ping 10.200.1.3 PING 10.200.1.3 (10.200.1.3) 56(84) bytes of data. 64 bytes from 10.200.1.3: icmp_seq=1 ttl=62 time=3.20 ms 64 bytes from 10.200.1.3: icmp_seq=2 ttl=62 time=1.06 ms ^C --- 10.200.1.3 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1001ms rtt min/avg/max/mdev = 1.063/2.131/3.199/1.068 ms
5. Restart all ovn pods
% oc delete pods --all -n openshift-ovn-kubernetes
pod "ovnkube-control-plane-97f479fdc-qxh2g" deleted
pod "ovnkube-control-plane-97f479fdc-shkcm" deleted
pod "ovnkube-node-b4crf" deleted
pod "ovnkube-node-k2lzs" deleted
pod "ovnkube-node-nfnhn" deleted
pod "ovnkube-node-npltt" deleted
pod "ovnkube-node-pgz4z" deleted
pod "ovnkube-node-r9qbl" deleted
% oc get pods -n openshift-ovn-kubernetes
NAME READY STATUS RESTARTS AGE
ovnkube-control-plane-97f479fdc-4cxkc 2/2 Running 0 43s
ovnkube-control-plane-97f479fdc-prpcn 2/2 Running 0 43s
ovnkube-node-g2x5q 8/8 Running 0 41s
ovnkube-node-jdpzx 8/8 Running 0 40s
ovnkube-node-jljrd 8/8 Running 0 41s
ovnkube-node-skd9g 8/8 Running 0 40s
ovnkube-node-tlkgn 8/8 Running 0 40s
ovnkube-node-v9qs2 8/8 Running 0 39s
Check pods connection in primary network in ns1 again
Actual results:
The connection was broken in primary network
% oc rsh -n ns1 test-rc-bxf2h ~ $ ping 10.200.1.3 PING 10.200.1.3 (10.200.1.3) 56(84) bytes of data. From 10.200.2.3 icmp_seq=1 Destination Host Unreachable From 10.200.2.3 icmp_seq=2 Destination Host Unreachable From 10.200.2.3 icmp_seq=3 Destination Host Unreachable
Expected results:
The connection was not broken in primary network.
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Description of problem:
https://github.com/openshift/api/pull/1829 needs to be backported to 4.15 and 4.14. The API team asked (https://redhat-internal.slack.com/archives/CE4L0F143/p1715024118699869) to have an test before they can review and approve a backport. This bug's goal is to implement an e2e test which would use the connect timeout tunning option.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Always
Steps to Reproduce:
N/A
Actual results:
Expected results:
Additional info:
The e2e test could have been a part of the initial implementation PR (https://github.com/openshift/cluster-ingress-operator/pull/1035).
Description of problem:
This bug is created for tracking Automation of OCPBUGS-35347 (https://issues.redhat.com/browse/OCPBUGS-35347)
Version-Release number of selected component (if applicable):
4.17.0
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-31367. The following is the description of the original issue:
—
Description of problem:
Alert that have been silenced are still seen on Console overview page,
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Steps to Reproduce:
1.for a cluster installed on version 4.15 2. Silence a alert that is firing by going to Console --> Observe --> Alerting --> Alerts 3. Check if the alert is added to silenced alert Console --> Observe --> Alerting --> Silences 4. Go back to Console (Overview page) silenced alert is still seen there
Actual results:
Silenced alert can be seen on ocp overview page
Expected results:
Silenced alert should not be seen on overview page
Additional info:
The new-in-4.16 oc adm prune renderedmachineconfigs is dry-run by default. But as of 4.16.0-rc.9, the wording can be a bit alarming:
$ oc adm prune renderedmachineconfigs Dry run enabled - no modifications will be made. Add --confirm to remove rendered machine configs. Error deleting rendered MachineConfig rendered-master-3fff60688940de967f8aa44e5aa0e87e: deleting rendered MachineConfig rendered-master-3fff60688940de967f8aa44e5aa0e87e failed: machineconfigs.machineconfiguration.openshift.io "rendered-master-3fff60688940de967f8aa44e5aa0e87e" is forbidden: User "wking" cannot delete resource "machineconfigs" in API group "machineconfiguration.openshift.io" at the cluster scope Error deleting rendered MachineConfig rendered-master-c4d5b90a040ed6026ccc5af8838f7031: deleting rendered MachineConfig rendered-master-c4d5b90a040ed6026ccc5af8838f7031 failed: machineconfigs.machineconfiguration.openshift.io "rendered-master-c4d5b90a040ed6026ccc5af8838f7031" is forbidden: User "wking" cannot delete resource "machineconfigs" in API group "machineconfiguration.openshift.io" at the cluster scope ...
Those are actually dry-run requests, as you can see with --v=8:
$ oc --v=8 adm prune renderedmachineconfigs ... I0625 10:49:36.291173 7200 request.go:1212] Request Body: {"kind":"DeleteOptions","apiVersion":"machineconfiguration.openshift.io/v1","dryRun":["All"]} I0625 10:49:36.291209 7200 round_trippers.go:463] DELETE https://api.build02.gcp.ci.openshift.org:6443/apis/machineconfiguration.openshift.io/v1/machineconfigs/rendered-master-3fff60688940de967f8aa44e5aa0e87e ...
But Error deleting ... failed isn't explicit about it being a dry-run deletion that failed. Even with appropriate privileges:
$ oc --as system:admin adm prune renderedmachineconfigs Dry run enabled - no modifications will be made. Add --confirm to remove rendered machine configs. DRY RUN: Deleted rendered MachineConfig rendered-master-3fff60688940de967f8aa44e5aa0e87e ...
could be more clear that it was doing a dry-run deletion.
4.16 and 4.17.
Every time.
1. Install a cluster.
2. Get some outdated rendered MachineConfig by bumping something, unless your cluster has some by default.
3. Run oc adm prune renderedmachineconfigs, both with and without permission to do the actual deletion.
Wording like Error deleting ... failed and Deleted rendered... can spook folks who don't understand that it was an attempted dry-run deletion.
Soothing wording that makes it very clear that the API request was a dry-run deletion.
This is a clone of issue OCPBUGS-41532. The following is the description of the original issue:
—
cns-migration tool should check for supported versions of vcenter before starting migration of CNS volumes.
Description of problem:
When enabled virtualHostedStyle with regionEndpoint set in config.image/cluster , image registry failed to be running. errors throw: time="2024-04-22T14:14:31.057192227Z" level=error msg="s3aws: RequestError: send request failed\ncaused by: Get \"https://s3-fips.us-west-1.amazonaws.com/ci-ln-67zbmzk-76ef8-4n6wb-image-registry-us-west-1-xjyfbabyboc?list-type=2&max-keys=1&prefix=\": dial tcp: lookup s3-fips.us-west-1.amazonaws.com on 172.30.0.10:53: no such host" go.version="go1.20.12 X:strictfipsruntime"
Version-Release number of selected component (if applicable):
4.14.18
How reproducible:
always
Steps to Reproduce:
1. $ oc get config.imageregistry/cluster -ojsonpath="{.status.storage}"|jq { "managementState": "Managed", "s3": { "bucket": "ci-ln-67zbmzk-76ef8-4n6wb-image-registry-us-west-1-xjyfbabyboc", "encrypt": true, "region": "us-west-1", "regionEndpoint": "https://s3-fips.us-west-1.amazonaws.com", "trustedCA": { "name": "" }, "virtualHostedStyle": true } } 2. Check registry pod $ oc get co image-registry NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE image-registry 4.15.5 True True True 79m Degraded: Registry deployment has timed out progressing: ReplicaSet "image-registry-b6c58998d" has timed out progressing
Actual results:
$ oc get pods image-registry-b6c58998d-m8pnb -oyaml| yq '.spec.containers[0].env' - name: REGISTRY_STORAGE_S3_REGIONENDPOINT value: https://s3-fips.us-west-1.amazonaws.com [...] - name: REGISTRY_STORAGE_S3_VIRTUALHOSTEDSTYLE value: "true" [...] $ oc logs image-registry-b6c58998d-m8pnb [...] time="2024-04-22T14:14:31.057192227Z" level=error msg="s3aws: RequestError: send request failed\ncaused by: Get \"https://s3-fips.us-west-1.amazonaws.com/ci-ln-67zbmzk-76ef8-4n6wb-image-registry-us-west-1-xjyfbabyboc?list-type=2&max-keys=1&prefix=\": dial tcp: lookup s3-fips.us-west-1.amazonaws.com on 172.30.0.10:53: no such host" go.version="go1.20.12 X:strictfipsruntime"
Expected results:
virtual hosted-style should work
Additional info:
Description of problem:
Additional IBM Cloud Services require the ability to override their service endpoints within the Installer. The list of available services provided in openshift/api must be expanded to account for this.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%
Steps to Reproduce:
1. Create an install-config for IBM Cloud 2. Define serviceEndpoints, including one for "resourceCatalog" 3. Attempt to run IPI
Actual results:
Expected results:
Successful IPI installation, using additional IBM Cloud Service endpoint overrides.
Additional info:
IBM Cloud is working on multiple patches to incorporate these additional services. The full list is still a work in progress, but currently includes: - Resource (Global) Catalog endpoint - COS Config endpoint Changes are required in the follow components currently. May open separate Jira's (if required) to track their progress. - openshift/api - openshift-installer - openshift/cluster-image-registry-operator
Please review the following PR: https://github.com/openshift/csi-operator/pull/242
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When running must-gather, a DaemonSet is created to collect performance related information for nodes in a cluster. If a node is tainted (for example with well defined OpenShift taints for Infra nodes, ODF nodes, master nodes etc), then the DaemonSet does not create Pods on these nodes and the information is not collected.
Version-Release number of selected component (if applicable):
4.14.z
How reproducible:
Reproducible
Steps to Reproduce:
1. Taint a node in a cluster with a custom taint i.e. "oc adm taint node <node_name> node-role.kubernetes.io/infra=reserved:NoSchedule node-role.kubernetes.io/infra=reserved:NoExecute". Ensure at least one node is not tainted. 2.Run `oc adm must-gather` to generate report to local filesystem
Actual results:
The performance stats collected under directory <must_gather_dir>/nodes/ only contains results for nodes without taints.
Expected results:
The performance stats collected under directory <must_gather_dir>/nodes/ should contain entries for all nodes in the cluster.
Additional info:
This issue has been identified by using the Performance Profile Creator. This tool requires the output of must-gather as its input (as described in the instructions here: https://docs.openshift.com/container-platform/4.14/scalability_and_performance/cnf-create-performance-profiles.html#running-the-performance-profile-profile-cluster-using-podman_cnf-create-performance-profiles). When following this guide, the missing performance information for tainted nodes results in being returned the error "failed to load node's worker's GHW snapshot: can't obtain the path: <node_name>" when running the tool in discovery mode
Please review the following PR: https://github.com/openshift/bond-cni/pull/64
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The ACM perf/scale hub OCP has 3 baremetal nodes, each has 480GB for the installation disk. metal3 pod uses too much disk space for logs and make the node has disk presure and start evicting pods. which make the ACM stop provisioning clusters. below is the log size of the metal3 pods: # du -h -d 1 /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83 4.0K /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/machine-os-images 276M /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-httpd 181M /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-ironic 384G /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-ramdisk-logs 77M /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-ironic-inspector 385G /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83 # ls -l -h /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-ramdisk-logs total 384G -rw-------. 1 root root 203G Jun 10 12:44 0.log -rw-r--r--. 1 root root 6.5G Jun 10 09:05 0.log.20240610-084807.gz -rw-r--r--. 1 root root 8.1G Jun 10 09:27 0.log.20240610-090606.gz -rw-------. 1 root root 167G Jun 10 09:27 0.log.20240610-092755
the logs are too huge to be attached. Please contact me if you need access to the cluster to check.
Version-Release number of selected component (if applicable):
the one has the issue is 4.16.0-rc4. 4.16.0.rc3 does not have the issue
How reproducible:
Steps to Reproduce:
1.Install latest ACM 2.11.0 build on OCP 4.16.0-rc4 and deploy 3500 SNOs on baremetal hosts 2. 3.
Actual results:
ACM stop deploying the rest of SNOs after 1913 SNOs are deployed b/c ACM pods are being evicated.
Expected results:
3500 SNOs are deployed.
Additional info:
To reduce QE load, we've decided to block up the hole drilled in OCPBUGS-24535. We might not want a pure revert, if some of the changes are helpful (e.g. more helpful error messages).
We also want to drop the oc adm upgrade rollback subcommand which was the client-side tooling associated with the OCPBUGS-24535 hole.
Both 4.16 and 4.17 currently have the rollback subcommand and associated CVO-side hole.
Every time.
Try to perform the rollbacks that OCPBUGS-24535 allowed.
They work, as verified in OCPBUGS-24535.
They stop working, with reasonable ClusterVersion conditions explaining that even those rollback requests will not be accepted.
Description of problem:
Since 4.16.0 pods with memory limits tend to OOM very frequently when writing files larger than memory limit to PVC
Version-Release number of selected component (if applicable):
4.16.0-rc.4
How reproducible:
100% on certain types of storage (AWS FSx, certain LVMS setups, see additional info)
Steps to Reproduce:
1. Create pod/pvc that writes a file larger than the container memory limit (attached example) 2. 3.
Actual results:
OOMKilled
Expected results:
Success
Additional info:
Reproducer in OpenShift terms: https://gist.github.com/akalenyu/949200f48ec89c42429ddb177a2a4dee The following is relevant for eliminating the OpenShift layer from the issue. For simplicity, I will focus on BM setup that produces this with LVM storage. This is also reproducible on AWS clusters with NFS backed NetApp ONTAP FSx. Further reduced to exclude the OpenShift layer, LVM on a separate (non root) disk: Prepare disk lvcreate -T vg1/thin-pool-1 -V 10G -n oom-lv mkfs.ext4 /dev/vg1/oom-lv mkdir /mnt/oom-lv mount /dev/vg1/oom-lv /mnt/oom-lv Run container podman run -m 600m --mount type=bind,source=/mnt/oom-lv,target=/disk --rm -it quay.io/centos/centos:stream9 bash [root@2ebe895371d2 /]# curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-x86_64-9-20240527.0.x86_64.qcow2 -o /disk/temp % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 47 1157M 47 550M 0 0 111M 0 0:00:10 0:00:04 0:00:06 111MKilled (Notice the process gets killed, I don't think podman ever whacks the whole container over this though) The same process on the same hardware on a 4.15 node (9.2) does not produce an OOM (vs 4.16 which is RHEL 9.4) For completeness, I will provide some details about the setup behind the LVM pool, though I believe it should not impact the decision about whether this is an issue: sh-5.1# pvdisplay --- Physical volume --- PV Name /dev/sdb VG Name vg1 PV Size 446.62 GiB / not usable 4.00 MiB Allocatable yes PE Size 4.00 MiB Total PE 114335 Free PE 11434 Allocated PE 102901 PV UUID <UUID> Hardware: SSD (INTEL SSDSC2KG480G8R) behind a RAID 0 of a PERC H330 Mini controller At the very least, this seems like a change in behavior but tbh I am leaning towards an outright bug.
Please review the following PR: https://github.com/openshift/machine-api-provider-azure/pull/108
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Recent metal-ipi serial jobs are taking a lot longer then they previously had been,
e.g.
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift-metal3_dev-scripts/1668/pull-ci-openshift-metal3-dev-scripts-master-e2e-metal-ipi-serial-ipv4/1808978586380537856
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-metal3_dev-scripts/1668/pull-ci-openshift-metal3-dev-scripts-master-e2e-metal-ipi-serial-ipv4/1808978586380537856/
sometimes the tests are timing out after 3 hours
They seem to be spending a lot of the time in these 3 etcd test (40 minutes in total)
passed: (10m13s) 2024-07-05T00:01:50 "[sig-etcd][OCPFeatureGate:HardwareSpeed][Serial] etcd is able to set the hardware speed to \"\" [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]" passed: (20m53s) 2024-07-05T01:12:19 "[sig-etcd][OCPFeatureGate:HardwareSpeed][Serial] etcd is able to set the hardware speed to Slower [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]" passed: (10m13s) 2024-07-05T01:25:33 "[sig-etcd][OCPFeatureGate:HardwareSpeed][Serial] etcd is able to set the hardware speed to Standard [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"
This is a clone of issue OCPBUGS-37945. The following is the description of the original issue:
—
Description of problem:
openshift-install create cluster leads to error: ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed during pre-provisioning: unable to initialize folders and templates: failed to import ova: failed to lease wait: Invalid configuration for device '0'. Vsphere standard port group
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. openshift-install create cluster 2. Choose Vsphere 3. fill in the blanks 4. Have a standard port group
Actual results:
error
Expected results:
cluster creation
Additional info:
Please review the following PR: https://github.com/openshift/cluster-baremetal-operator/pull/419
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
CPO unit tests fail
Version-Release number of selected component (if applicable):
4.17
How reproducible:
https://github.com/openshift/cloud-provider-openstack/pull/282
Description of problem:
documentationBaseURL still points to 4.16
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-06-20-005211
How reproducible:
Always
Steps to Reproduce:
1. check documentationBaseURL on a 417 cluster $ oc get cm console-config -n openshift-console -o yaml | grep documentation documentationBaseURL: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.16/ 2. 3.
Actual results:
documentationBaseURL still links to 4.16
Expected results:
documentationBaseURL should link to 4.17
Additional info:
I haven't traced out the trigger-pathway yet, but 4.13 and 4.16 machine-config controllers seem to react to Node status updates with makeMasterNodeUnSchedulable calls that result in Kubernetes API PATCH calls, even when the patch being requested is empty. This creates unnecessary API volume, loading the control plane and resulting in distracting Kube-API audit log activity.
Seen in 4.13.34, and reproduced in 4.16 CI builds, so likely all intervening versions. Possibly all versions.
Every time.
mco#4277 reproduces this by reverting OCPBUGS-29713 to get frequent Node status updates. And then in presubmit CI, e2e-aws-ovn > Artifacts > ... > gather-extra pod logs :
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4277/pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn/1770970949110206464/artifacts/e2e-aws-ovn/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-bdcdf554f-ct5hh_machine-config-controller.log | tail
and:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4277/pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn/1770970949110206464/artifacts/e2e-aws-ovn/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-bdcdf554f-ct5hh_machine-config-controller.log | grep -o 'makeMasterNodeUnSchedulable\|UpdateNodeRetry' | sort | uniq -c
give...
I0322 02:10:13.027938 1 node_controller.go:310] makeMasterNodeUnSchedulable ip-10-0-34-167.us-east-2.compute.internal I0322 02:10:13.029671 1 kubeutils.go:48] UpdateNodeRetry patch ip-10-0-34-167.us-east-2.compute.internal: {} I0322 02:10:13.669568 1 node_controller.go:310] makeMasterNodeUnSchedulable ip-10-0-84-206.us-east-2.compute.internal I0322 02:10:13.671023 1 kubeutils.go:48] UpdateNodeRetry patch ip-10-0-84-206.us-east-2.compute.internal: {} I0322 02:10:21.095260 1 node_controller.go:310] makeMasterNodeUnSchedulable ip-10-0-114-0.us-east-2.compute.internal I0322 02:10:21.098410 1 kubeutils.go:48] UpdateNodeRetry patch ip-10-0-114-0.us-east-2.compute.internal: {} I0322 02:10:23.215168 1 node_controller.go:310] makeMasterNodeUnSchedulable ip-10-0-34-167.us-east-2.compute.internal I0322 02:10:23.219672 1 kubeutils.go:48] UpdateNodeRetry patch ip-10-0-34-167.us-east-2.compute.internal: {} I0322 02:10:24.049456 1 node_controller.go:310] makeMasterNodeUnSchedulable ip-10-0-84-206.us-east-2.compute.internal I0322 02:10:24.050939 1 kubeutils.go:48] UpdateNodeRetry patch ip-10-0-84-206.us-east-2.compute.internal: {}
showing frequent, no-op patch attempts and:
1408 makeMasterNodeUnSchedulable 1414 UpdateNodeRetry
showing many attempts over the life of the MCC container.
No need to PATCH on makeMasterNodeUnSchedulable unless the generated patch content contained more than a no-op patch.
setDesiredMachineConfigAnnotation has a "DesiredMachineConfigAnnotationKey already matches what I want" no-op out here
setUpdateInProgressTaint seems to lack a similar guard here , and it should probably grow a check for "NodeUpdateInProgressTaint is already present". Same for removeUpdateInProgressTaint. But the hot loop in response to these Node status updates is makeMasterNodeUnSchedulable calling UpdateNodeRetry, and UpdateNodeRetry also lacks this kind of "no need to PATCH when no changes are requested" logic.
For this bug, we should:
I0516 19:40:24.080597 1 controller.go:156] mbooth-psi-ph2q7-worker-0-9z9nn: reconciling Machine I0516 19:40:24.113866 1 controller.go:200] mbooth-psi-ph2q7-worker-0-9z9nn: reconciling machine triggers delete I0516 19:40:32.487925 1 controller.go:115] "msg"="Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" "controller"="machine-controller" "name"="mbooth-psi-ph2q7-worker-0-9z9nn" "namespace"="openshift-machine-api" "object"={"name":"mbooth-psi-ph2q7-worker-0-9z9nn","namespace":"openshift-machine-api"} "reconcileID"="f477312c-dd62-49b2-ad08-28f48c506c9a" panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x242a275] goroutine 317 [running]: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1() /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116 +0x1e5 panic({0x29cfb00?, 0x40f1d50?}) /usr/lib/golang/src/runtime/panic.go:914 +0x21f sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/compute.(*Service).constructPorts(0x3056b80?, 0xc00074d3d0, 0xc0004fe100) /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/compute/instance.go:188 +0xb5 sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/compute.(*Service).DeleteInstance(0xc00074d388, 0xc000c61300?, {0x3038ae8, 0xc0008b7440}, 0xc00097e2a0, 0xc0004fe100) /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/compute/instance.go:678 +0x42d github.com/openshift/machine-api-provider-openstack/pkg/machine.(*OpenstackClient).Delete(0xc0001f2380, {0x304f708?, 0xc000c6df80?}, 0xc0008b7440) /go/src/sigs.k8s.io/cluster-api-provider-openstack/pkg/machine/actuator.go:341 +0x305 github.com/openshift/machine-api-operator/pkg/controller/machine.(*ReconcileMachine).Reconcile(0xc00045de50, {0x304f708, 0xc000c6df80}, {{{0xc00066c7f8?, 0x0?}, {0xc000dce980?, 0xc00074dd48?}}}) /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/github.com/openshift/machine-api-operator/pkg/controller/machine/controller.go:216 +0x1cfe sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x3052e08?, {0x304f708?, 0xc000c6df80?}, {{{0xc00066c7f8?, 0xb?}, {0xc000dce980?, 0x0?}}}) /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119 +0xb7 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0004eb900, {0x304f740, 0xc00045c500}, {0x2ac0340?, 0xc0001480c0?}) /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316 +0x3cc sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0004eb900, {0x304f740, 0xc00045c500}) /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266 +0x1c9 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2() /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x79 created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 269 /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x565
> kc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-ec.6 True False 7d3h Cluster version is 4.16.0-ec.6
> kc -n openshift-machine-api get machines.m mbooth-psi-ph2q7-worker-0-9z9nn -o yaml apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: machine.openshift.io/instance-state: ERROR openstack-resourceId: dc08c2a2-cbda-4892-a06b-320d02ec0c6c creationTimestamp: "2024-05-16T16:53:16Z" deletionGracePeriodSeconds: 0 deletionTimestamp: "2024-05-16T19:23:44Z" finalizers: - machine.machine.openshift.io generateName: mbooth-psi-ph2q7-worker-0- generation: 3 labels: machine.openshift.io/cluster-api-cluster: mbooth-psi-ph2q7 machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: mbooth-psi-ph2q7-worker-0 machine.openshift.io/instance-type: ci.m1.xlarge machine.openshift.io/region: regionOne machine.openshift.io/zone: "" name: mbooth-psi-ph2q7-worker-0-9z9nn namespace: openshift-machine-api ownerReferences: - apiVersion: machine.openshift.io/v1beta1 blockOwnerDeletion: true controller: true kind: MachineSet name: mbooth-psi-ph2q7-worker-0 uid: f715dba2-b0b2-4399-9ab6-19daf6407bd7 resourceVersion: "8391649" uid: 6d1ad181-5633-43eb-9b19-7c73c86045c3 spec: lifecycleHooks: {} metadata: {} providerID: openstack:///dc08c2a2-cbda-4892-a06b-320d02ec0c6c providerSpec: value: apiVersion: machine.openshift.io/v1alpha1 cloudName: openstack cloudsSecret: name: openstack-cloud-credentials namespace: openshift-machine-api flavor: ci.m1.xlarge image: "" kind: OpenstackProviderSpec metadata: creationTimestamp: null networks: - filter: {} subnets: - filter: tags: openshiftClusterID=mbooth-psi-ph2q7 rootVolume: diskSize: 50 sourceUUID: rhcos-4.16 volumeType: tripleo securityGroups: - filter: {} name: mbooth-psi-ph2q7-worker serverGroupName: mbooth-psi-ph2q7-worker serverMetadata: Name: mbooth-psi-ph2q7-worker openshiftClusterID: mbooth-psi-ph2q7 tags: - openshiftClusterID=mbooth-psi-ph2q7 trunk: true userDataSecret: name: worker-user-data status: addresses: - address: mbooth-psi-ph2q7-worker-0-9z9nn type: Hostname - address: mbooth-psi-ph2q7-worker-0-9z9nn type: InternalDNS conditions: - lastTransitionTime: "2024-05-16T16:56:05Z" status: "True" type: Drainable - lastTransitionTime: "2024-05-16T19:24:26Z" message: Node drain skipped status: "True" type: Drained - lastTransitionTime: "2024-05-16T17:14:59Z" status: "True" type: InstanceExists - lastTransitionTime: "2024-05-16T16:56:05Z" status: "True" type: Terminable lastUpdated: "2024-05-16T19:23:52Z" phase: Deleting
Description of problem: OCP doesn't resume from "hibernation" (shutdown/restart of cloud instances).
NB: This is not related to certs.
Version-Release number of selected component (if applicable): 4.16 nightlies, at least 4.16.0-0.nightly-2024-05-14-095225 through 4.16.0-0.nightly-2024-05-21-043355
How reproducible: 100%
Steps to Reproduce:
1. Install 4.16 nightly on AWS. (Other platforms may be affected, don't know.)
2. Shut down all instances. (I've done this via hive hibernation; Vadim Rutkovsky has done it via cloud console.)
3. Start instances. (Ditto.)
Actual results: OCP doesn't start. Per Vadim:
"kubelet says host IP unknown; known addresses: [] so etcd can't start."
Expected results: OCP starts normally.
Additional info: We originally thought this was related to OCPBUGS-30860, but reproduced with nightlies containing the updated AMIs.
Description of problem:
When building ODF Console Plugin, webpack issues tons of PatternFly dynamic module related warnings like:
<w> No dynamic module found for Button in @patternfly/react-core
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. git clone https://github.com/red-hat-storage/odf-console.git
2. cd odf-console
3. yarn install && yarn build-mco
Actual results: tons of warnings about missing dynamic modules.
Expected results: no warnings about missing dynamic modules.
The cluster-ingress-operator repository vendors controller-runtime v0.17.3, which uses Kubernetes 1.29 packages. The cluster-ingress-operator repository also vendors k8s.io/client-go v0.29.0. However, OpenShift 4.17 is based on Kubernetes 1.30.
4.17.
Always.
Check https://github.com/openshift/cluster-ingress-operator/blob/release-4.17/go.mod.
The sigs.k8s.io/controller-runtime package is at v0.17.3, and the k8s.io/client-go package is at v0.29.0.
The sigs.k8s.io/controller-runtime package is at v0.18.0 or newer, and k8s.io/client-go is at v0.30.0 or newer. The k8s.io/client-go package version should match other k8s.io packages, such as k8s.io/api.
https://github.com/openshift/cluster-ingress-operator/pull/1046 already bumped the k8s.io/* packages other than k8s.io/client-go to v0.29.0. Presumably k8s.io/client-go was missed by accident because of a replace rule in go.mod. In general, k8s.io/client-go should be at the same version as k8s.io/api and other k8s.io/* packages, and the controller-runtime package should be bumped to a version that uses the same minor version of the k8s.io/* packages.
The controller-runtime v0.18 release includes some breaking changes; see the release notes at https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.18.0.
Our ocp Dockerfile curently uses the build target. This however now also builds the react frontend and requires npm in turn. We build the frontend during mirroring already.
Switch out Dockerfile to use the common-build target. This should enable the bump to 0.27, tracked through this issue as well.
Description of problem:
When selecting a runtime icon while deploying an image (e.g., when uploading a JAR file or importing from a container registry), the default icon is not checked on the dropdown menu. However, when I select the same icon from the dropdown menu, it is now checked. But, it should have already been checked when I first opened the dropdown menu.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Always
Steps to Reproduce:
1. In the developer perspective, on the page sidebar, select "+Add", and then select either "Container images" or "Upload JAR file" 2. On the form body, open the runtime icon dropdown menu 3. Scroll down until you see the default icon in the dropdown menu ("openshift" and "java" respectively)
Actual results:
It is not checked
Expected results:
The icon is checked in the dropdown menu
Additional info:
Description of problem:
The issue is found when QE testing the minimal Firewall list required by an AWS installation (https://docs.openshift.com/container-platform/4.15/installing/install_config/configuring-firewall.html) for 4.16. The way we're verifying this is by setting all the URLs listed in the doc into the whitelist of a proxy server[1], adding the proxy to install-config.yaml, so addresses outside of the doc will be rejected by the proxy server during cluster installation. [1]https://steps.ci.openshift.org/chain/proxy-whitelist-aws We're seeing such error from Masters' console ``` [ 344.982244] ignition[782]: GET https://api-int.ci-op-b2hcg02h-ce587.qe.devcluster.openshift.com:22623/config/master: attempt #73 [ 344.985074] ignition[782]: GET error: Get "https://api-int.ci-op-b2hcg02h-ce587.qe.devcluster.openshift.com:22623/config/master": Forbidden ``` And the deny log from proxy server ``` 1717653185.468 0 10.0.85.91 TCP_DENIED/403 2252 CONNECT api-int.ci-op-b2hcg02h-ce587.qe.devcluster.openshift.com:22623 - HIER_NONE/- text/html ``` So looks Master is using proxy to visit the MCS address, and the Internal API domain - api-int.ci-op-b2hcg02h-ce587.qe.devcluster.openshift.com is not in the whitelist of proxy, so the request is denied by proxy. But actually such Internal API address should be already in the NoProxy list, so master shouldn't use proxy to send the internal request. This is a proxy info collected from another cluster, the api-int.<cluter_domain> is added in the no proxy list by default. ``` [root@ip-10-0-11-89 ~]# cat /etc/profile.d/proxy.sh export HTTP_PROXY="http://ec2-3-16-83-95.us-east-2.compute.amazonaws.com:3128" export HTTPS_PROXY="http://ec2-3-16-83-95.us-east-2.compute.amazonaws.com:3128" export NO_PROXY=".cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-dis3.qe.devcluster.openshift.com,localhost,test.no-proxy.com" ```
Version-Release number of selected component (if applicable):
registry.ci.openshift.org/ocp/release:4.16.0-0.nightly-2024-06-02-202327
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
If infrastructure or machine provisioning is slow, the installer may wait several minutes before declaring provisioning successful due to the exponential backoff. For instance, if dns resolution from load balancers is slow to propogate and we
Version-Release number of selected component (if applicable):
How reproducible:
Sometimes, it depends on provisioning being slow.
Steps to Reproduce:
1. Provision a cluster in an environment that has slow dns resolution (unclear how to set this up) 2. 3.
Actual results:
The installer will only check for infrastructure or machine readiness at intervals of several minutes after a certain threshold (say 10 minutes).
Expected results:
Installer should just check regularly, e.g. every 15 seconds.
Additional info:
It may not be possible to definitively test this. We may want to just check ci logs for an improvement in provisioning time and check for lack of regressions.
This is a clone of issue OCPBUGS-39081. The following is the description of the original issue:
—
If the network to the bootstrap VM is slow, the extract-machine-os.service can time out (after 180s). If this happens, it will be restarted but services that depend on it (like ironic) will never be started even once it succeeds. systemd added support for Restart:on-failure for Type:oneshot services, but they still don't behave the same way as other types of services.
This can be simulated in dev-scripts by doing:
sudo tc qdisc add dev ostestbm root netem rate 33Mbit
Description of problem:
VIP's are on a different network than the machine network on a 4.14 cluster
failing cluster: 4:14
Infrastructure
--------------
Platform: VSphere
Install Type: IPI
apiServerInternalIP: 10.8.0.83
apiServerInternalIPs: 10.8.0.83
ingressIP: 10.8.0.84
ingressIPs: 10.8.0.84
All internal IP addresses of all nodes match the Machine Network.
Machine Network: 10.8.42.0/23
Node name IP Address Matches CIDR
..............................................................................................................
sv1-prd-ocp-int-bn8ln-master-0 10.8.42.24 YES
sv1-prd-ocp-int-bn8ln-master-1 10.8.42.35 YES
sv1-prd-ocp-int-bn8ln-master-2 10.8.42.36 YES
sv1-prd-ocp-int-bn8ln-worker-0-5rbwr 10.8.42.32 YES
sv1-prd-ocp-int-bn8ln-worker-0-h7fq7 10.8.42.49 YES
logs from one of the haproxy pods
oc logs -n openshift-vsphere-infra haproxy-sv1-prd-ocp-int-bn8ln-master-0 haproxy-monitor
.....
2024-04-02T18:48:57.534824711Z time="2024-04-02T18:48:57Z" level=info msg="An error occurred while trying to read master nodes details from api-vip:kube-apiserver: failed find a interface for the ip 10.8.0.83"
2024-04-02T18:48:57.534849744Z time="2024-04-02T18:48:57Z" level=info msg="Trying to read master nodes details from localhost:kube-apiserver"
2024-04-02T18:48:57.544507441Z time="2024-04-02T18:48:57Z" level=error msg="Could not retrieve subnet for IP 10.8.0.83" err="failed find a interface for the ip 10.8.0.83"
2024-04-02T18:48:57.544507441Z time="2024-04-02T18:48:57Z" level=error msg="Failed to retrieve API members information" kubeconfigPath=/var/lib/kubelet/kubeconfig
2024-04-02T18:48:57.544507441Z time="2024-04-02T18:48:57Z" level=info msg="GetLBConfig failed, sleep half of interval and retry" kubeconfigPath=/var/lib/kubelet/kubeconfig
2024-04-02T18:49:00.572652095Z time="2024-04-02T18:49:00Z" level=error msg="Could not retrieve subnet for IP 10.8.0.83" err="failed find a interface for the ip 10.8.0.83"
There is a kcs that addresses this:
https://access.redhat.com/solutions/7037425
Howerver, this same configuration works in production on 4.12
working cluster:
Infrastructure
--------------
Platform: VSphere
Install Type: IPI
apiServerInternalIP: 10.8.0.73
apiServerInternalIPs: 10.8.0.73
ingressIP: 10.8.0.72
ingressIPs: 10.8.0.72
All internal IP addresses of all nodes match the Machine Network.
Machine Network: 10.8.38.0/23
Node name IP Address Matches CIDR
..............................................................................................................
sb1-prd-ocp-int-qls2m-cp4d-4875s 10.8.38.29 YES
sb1-prd-ocp-int-qls2m-cp4d-phczw 10.8.38.19 YES
sb1-prd-ocp-int-qls2m-cp4d-ql5sj 10.8.38.43 YES
sb1-prd-ocp-int-qls2m-cp4d-svzl7 10.8.38.27 YES
sb1-prd-ocp-int-qls2m-cp4d-x286s 10.8.38.18 YES
sb1-prd-ocp-int-qls2m-cp4d-xk48m 10.8.38.40 YES
sb1-prd-ocp-int-qls2m-master-0 10.8.38.25 YES
sb1-prd-ocp-int-qls2m-master-1 10.8.38.24 YES
sb1-prd-ocp-int-qls2m-master-2 10.8.38.30 YES
sb1-prd-ocp-int-qls2m-worker-njzdx 10.8.38.15 YES
sb1-prd-ocp-int-qls2m-worker-rhqn5 10.8.38.39 YES
logs from one of the haproxy pods
2023-08-18T21:12:19.730010034Z time="2023-08-18T21:12:19Z" level=info msg="API is not reachable through HAProxy"
2023-08-18T21:12:19.755357706Z time="2023-08-18T21:12:19Z" level=info msg="Config change detected" configChangeCtr=1 curConfig="{6443 9445 29445 [
{sb1-prd-ocp-int-qls2m-master-0 10.8.38.25 6443} {sb1-prd-ocp-int-qls2m-master-2 10.8.38.30 6443}] }"
{sb1-prd-ocp-int-qls2m-master-2 10.8.38.30 6443}
] }"
The data is being redirected
found this in the sos report: sos_commands/firewall_tables/
nft_-a_list_ruleset
table ip nat { # handle 2
chain PREROUTING
chain INPUT
{ # handle 2 type nat hook input priority 100; policy accept; }chain POSTROUTING
{ # handle 3 type nat hook postrouting priority srcnat; policy accept; counter packets 245475292 bytes 16221809463 jump OVN-KUBE-EGRESS-SVC # handle 25 oifname "ovn-k8s-mp0" counter packets 58115015 bytes 4184247096 jump OVN-KUBE-SNAT-MGMTPORT # handle 16 counter packets 187360548 bytes 12037581317 jump KUBE-POSTROUTING # handle 10 }chain OUTPUT
{ # handle 4 type nat hook output priority -100; policy accept; oifname "lo" meta l4proto tcp ip daddr 10.8.0.73 tcp dport 6443 counter packets 0 bytes 0 redirect to :9445 # handle 67 counter packets 245122162 bytes 16200621351 jump OVN-KUBE-EXTERNALIP # handle 29 counter packets 245122163 bytes 16200621411 jump OVN-KUBE-NODEPORT # handle 27 counter packets 245122166 bytes 16200621591 jump OVN-KUBE-ITP # handle 24 }... many more lines ...
This code was not added by the customer
None of the redirect statements are in the same file for 4.14 (the failing cluster)
ocp 4.14: (if applicable):{code:none}
How reproducible:100%
Steps to Reproduce:{code:none} This is the install script that our ansible job uses to install 4.12 If you need it cleared up let me know, all the items in {{}} are just variables for file paths cp -r {{ item.0.cluster_name }}/install-config.yaml {{ openshift_base }}{{ item.0.cluster_name }}/ ./openshift-install create manifests --dir {{ openshift_base }}{{ item.0.cluster_name }}/ cp -r machineconfigs/* {{ openshift_base }}{{ item.0.cluster_name }}/openshift/ cp -r {{ item.0.cluster_name }}/customizations/* {{ openshift_base }}{{ item.0.cluster_name }}/openshift/ ./openshift-install create ignition-configs --dir {{ openshift_base }}{{ item.0.cluster_name }}/ ./openshift-install create cluster --dir {{ openshift_base }}{{ item.0.cluster_name }} --log-level=debug We are installing IPI on vmware API and Ingress VIPs are configured on our external load balancer appliance. (Citrix ADCs if that matters)
Actual results:
haproxy pods crashloop and do not work In 4.14 following the same install workflow neither the API or Ingress IP binds to masters or workers and we see HAPROXY crashlooping
Expected results:
for 4.12 Following a completion of 4.12 if we look in vmware at our master and worker nodes we will see all of them have an IP address from the machine network assigned to them, and one node from both masters and workers will have the VIP bound to them as well.
Additional info:
Description of problem:
If one attempts to create more than one MachineOSConfig at the same time that requires a canonicalized secret, only one will build. The rest will not build.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Create multiple MachineConfigPools. Wait for the MachineConfigPool to get a rendered config. 2. Create multiple MachineOSConfigs at the same time for each of the newly-created MachineConfigPools that uses a legacy Docker pull secret. A legacy Docker pull secret is one which does not have each of its secrets under a top-level auths key. One can use the builder-dockercfg secret in the MCO namespace for this purpose. 3. Wait for the machine-os-builder pod to start.
Actual results:
Only one of the MachineOSBuilds begins building. The remaining MachineOSBuilds do not build nor do they get a status assigned to them. The root cause is because if they both attempt to use the same legacy Docker pull secret, one will create the canonicalized version of it. Subsequent requests that occur concurrently will fail because the canonicalized secret already exists.
Expected results:
Each MachineOSBuild should occur whenever it is created. It should also have some kind of status assigned to it as well.
Additional info:
Multiple pr failing with this error...
Deploy git workload with devfile from topology page: A-04-TC01: Create the different workloads from Add page Deploy git workload with devfile from topology page: A-04-TC01 expand_less18s{`cy.focus()` can only be called on a single element. Your subject contained 14 elements. https://on.cypress.io/focus CypressError CypressError: `cy.focus()` can only be called on a single element. Your subject contained 14 elements.
This is a clone of issue OCPBUGS-44162. The following is the description of the original issue:
—
Description of problem:
We were told that adding connections to a Transit Gateway also costs an exorbitant amount of money. So the create option tgName now means that we will not clean up the connections during destroy cluster.
Description of problem:
Cluster-ingress-operator logs an update when one didn't happen.
% grep -e 'successfully updated Infra CR with Ingress Load Balancer IPs' -m 1 -- ingress-operator.log 2024-05-17T14:46:01.434Z INFO operator.ingress_controller ingress/controller.go:326 successfully updated Infra CR with Ingress Load Balancer IPs % grep -e 'successfully updated Infra CR with Ingress Load Balancer IPs' -c -- ingress-operator.log 142
https://github.com/openshift/cluster-ingress-operator/pull/1016 has a logic error, which causes the operator to log this message even when it didn't do an update:
// If the lbService exists for the "default" IngressController, then update Infra CR's PlatformStatus with the Ingress LB IPs. if haveLB && ci.Name == manifests.DefaultIngressControllerName { if updated, err := computeUpdatedInfraFromService(lbService, infraConfig); err != nil { errs = append(errs, fmt.Errorf("failed to update Infrastructure PlatformStatus: %w", err)) } else if updated { if err := r.client.Status().Update(context.TODO(), infraConfig); err != nil { errs = append(errs, fmt.Errorf("failed to update Infrastructure CR after updating Ingress LB IPs: %w", err)) } } log.Info("successfully updated Infra CR with Ingress Load Balancer IPs") }
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%
Steps to Reproduce:
1. Create a LB service for the default Ingress Operator 2. Watch ingress operator logs for the search strings mentioned above
Actual results:
Lots of these log entries will be seen even though no further updates are made to the default ingress operator: 2024-05-17T14:46:01.434Z INFO operator.ingress_controller ingress/controller.go:326 successfully updated Infra CR with Ingress Load Balancer IPs
Expected results:
Only see this log entry when an update to Infra CR is made. Perhaps just one the first time you add a LB service to the default ingress operator.
Additional info:
https://github.com/openshift/cluster-ingress-operator/pull/1016 was backported to 4.15, so it would be nice to fix it and backport the fix to 4.15. It is rather noisy, and it's trivial to fix.
Description of problem: ovnkube-node and multus DaemonSets have hostPath volumes which prevent clean unmount of CSI Volumes because of missing "mountPropagation: HostToContainer" parameter in volumeMount
Version-Release number of selected component (if applicable): OpenShift 4.14
How reproducible: Always
Steps to Reproduce:
1. on a node mount a file system underneath /var/lib/kubelet/ simulating the mount of a CSI driver PersistentVolume
2. restart the ovnkube-node pod running on that node
3. unmount the filesystem from 1. The mount will then be removed from the host list of mounted devices however a copy of the mount is still active in the mount namespace of the ovnkube-node pod.
This is blocking some CSI drivers relying on multipath to properly delete a block device, since mounts are still registered on the block device.
Actual results:
CSI Volume Mount cleanly unmounted.
Expected results:
CSI Volume Mount uncleanly unmounted.
Additional info:
The mountPropagation parameter is already implememted in the volumeMount for the host rootFS:
- name: host-slash
readOnly: true
mountPath: /host
mountPropagation: HostToContainer
However the same parameter is missing for the volumeMount of /var/lib/kubelet
It is possible to workaround the issue with a kubectl patch command like this:
$ kubectl patch daemonset ovnkube-node --type='json' -p='[
{
"op": "replace",
"path": "/spec/template/spec/containers/7/volumeMounts/1",
"value": {
"name": "host-kubelet",
"mountPath": "/var/lib/kubelet",
"mountPropagation": "HostToContainer",
"readOnly": true
}
}
]'
Affected Platforms: Platform Agnostic UPI
Description of problem:
Using "accessTokenInactivityTimeoutSeconds: 900" for "OAuthClient" config. One inactive or idle tab causes session expiry for all other tabs. Following are the tests performed: Test 1 - a single window with a single tab no activity would time out after 15 minutes. Test 2 - a single window two tabs. No activity in the first tab, but was active in the second tab. Timeout occurred for both tabs after 15 minutes. Test 3 - a single window with a single tab and activity, does not time out after 15 minutes. Hence single idle tab causes the user logout from rest of the tabs.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Set the OAuthClient.accessTokenInactivityTimeoutSeconds to 300(or any value) 2. Login using to OCP web console and open multiple tabs. 3. Keep one tab idle and work on the other open tabs. 4. After 5 minutes the session expires for all tabs.
Actual results:
One inactive or idle tab causes session expiry for all other tabs.
Expected results:
Session should not be expired if any tab is not idle.
Additional info:
Description of problem:
The TechPreviewNoUpgrade featureset could be disabled on a 4.16 cluster after enabling it. But according to the official doc `Enabling this feature set cannot be undone and prevents minor version updates`, it should not be disabled. # ./oc get featuregate cluster -ojson|jq .spec { "featureSet": "TechPreviewNoUpgrade"} # ./oc patch featuregate cluster --type=json -p '[{"op":"remove", "path":"/spec/featureSet"}] 'featuregate.config.openshift.io/cluster patched # ./oc get featuregate cluster -ojson|jq .spec {}
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-03-060250
How reproducible:
always
Steps to Reproduce:
1. enable the TechPreviewNoUpgrade fs on a 4.16 cluster 2. then remove it 3.
Actual results:
TechPreviewNoUpgrade featureset was disabled
Expected results:
Enabling this feature set cannot be undone
Additional info:
https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L43-L44
This is a clone of issue OCPBUGS-42412. The following is the description of the original issue:
—
Description of problem:
When running 4.17 installer QE full function test, following am64 instances types are detected and tested passed, so append them in installer doc[1]: * standardBasv2Family * StandardNGADSV620v1Family * standardMDSHighMemoryv3Family * standardMIDSHighMemoryv3Family * standardMISHighMemoryv3Family * standardMSHighMemoryv3Family [1] https://github.com/openshift/installer/blob/master/docs/user/azure/tested_instance_types_x86_64.md
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/120
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
After successfully mirroring the ibm-ftm-operator via the latest oc-mirror command to internal registry and applying the newly generated IBM CatalogSource YAML file. The created catalog pod in the openShift-marketplace namespace enters CrashLoopBackOff. Customer is trying to mirror operators and list the catalogue the command has no issues, but catalog pod is crashing with the following error: ~~~ time="2024-07-10T13:43:07Z" level=info msg="starting pprof endpoint" address="localhost:6060" time="2024-07-10T13:43:08Z" level=fatal msg="cache requires rebuild: cache reports digest as \"e891bfd5a4cb5702\", but computed digest is \"1922475dc0ee190c\"" ~~~
Version-Release number of selected component (if applicable):
oc-mirror 4.16 OCP 4.14.z
How reproducible:
Steps to Reproduce:
1. Create catalog image with the following imagesetconfiguration: ~~~ kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 archiveSize: 4 storageConfig: registry: imageURL: <internal-registry>:Port/oc-mirror-metadata/12july24 skipTLS: false mirror: platform: architectures: - "amd64" channels: - name: stable-4.14 minVersion: 4.14.11 maxVersion: 4.14.30 type: ocp shortestPath: true graph: true operators: - catalog: icr.io/cpopen/ibm-operator-catalog:v1.22 packages: - name: ibm-ftm-operator channels: - name: v4.4 ~~~ 2. Run the following command: ~~~ /oc-mirror --config=./imageset-config.yaml docker://Internal-registry:Port --rebuild-catalogs ~~~ 3. Create catalogsourcepod under openshift-marketplace namespace: ~~~ cat oc-mirror-workspace/results-1721222945/catalogSource-cs-ibm-operator-catalog.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: cs-ibm-operator-catalog namespace: openshift-marketplace spec: image: Internal-registry:Port/cpopen/ibm-operator-catalog:v1.22 sourceType: grpc ~~~
Actual results:
catalog pod is crashing with the following error: ~~~ time="2024-07-10T13:43:07Z" level=info msg="starting pprof endpoint" address="localhost:6060" time="2024-07-10T13:43:08Z" level=fatal msg="cache requires rebuild: cache reports digest as \"e891bfd5a4cb5702\", but computed digest is \"1922475dc0ee190c\"" ~~~
Expected results:
The pod should run without any issue.
Additional info:
1. The issue is reproducible with the OCP 4.14.14 and OCP 4.14.29 2. Customer is already using oc-mirror 4.16: ~~~ ./oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202407030803.p0.g394b1f8.assembly.stream.el9-394b1f8", GitCommit:"394b1f814f794f4f01f473212c9a7695726020bf", GitTreeState:"clean", BuildDate:"2024-07-03T10:18:49Z", GoVersion:"go1.21.11 (Red Hat 1.21.11-1.module+el8.10.0+21986+2112108a) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"} ~~~ 3. Customer tried with workaround described in the KB[1]: https://access.redhat.com/solutions/7006771 but no luck 4. Customer also tried to set the OPM_BINARY, but didn't work. They download OPM with respective arch: https://github.com/operator-framework/operator-registry/releases rename the downloaded binary as opm and set below variable before executing oc-mirror OPM_BINARY=/path/to/opm
Please review the following PR: https://github.com/openshift/csi-external-provisioner/pull/97
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
- Pods that reside in a namespace utilizing EgressIP are experiencing intermittent TCP IO timeouts when attempting to communicate with external services.
❯ oc exec gitlab-runner-aj-02-56998875b-n6xxb -- bash -c 'while true; do timeout 3 bash -c "</dev/tcp/10.135.108.56/443" && echo "Connection success" || echo "Connection timeout"; sleep 0.5; done'
Connection success
Connection timeout
Connection timeout
Connection timeout
Connection timeout
Connection timeout
Connection success
Connection timeout
Connection success
# Get pod node and podIP variable for the problematic pod ❯ oc get pod gitlab-runner-aj-02-56998875b-n6xxb -ojson 2>/dev/null | jq -r '"\(.metadata.name) \(.spec.nodeName) \(.status.podIP)"' | read -r pod node podip # Find the ovn-kubernetes pod running on the same node as gitlab-runner-aj-02-56998875b-n6xxb ❯ oc get pods -n openshift-ovn-kubernetes -lapp=ovnkube-node -ojson | jq --arg node "$node" -r '.items[] | select(.spec.nodeName == $node)| .metadata.name' | read -r ovn_pod # Collect each possible logical switch port address into variable LSP_ADDRESSES ❯ LSP_ADDRESSES=$(oc -n openshift-ovn-kubernetes exec ${ovn_pod} -it -c northd -- bash -c 'ovn-nbctl lsp-list transit_switch | while read guid name; do printf "%s " "${name}"; ovn-nbctl lsp-get-addresses "${guid}"; done') # List the logical router policy for the problematic pod ❯ oc -n openshift-ovn-kubernetes exec ${ovn_pod} -c northd -- ovn-nbctl find logical_router_policy match="\"ip4.src == ${podip}\"" _uuid : c55bec59-6f9a-4f01-a0b1-67157039edb8 action : reroute external_ids : {name=gitlab-runner-caasandpaas-egress} match : "ip4.src == 172.40.114.40" nexthop : [] nexthops : ["100.88.0.22", "100.88.0.57"] options : {} priority : 100 # Check whether each nexthop entry exists in the LSP addresses table ❯ echo $LSP_ADDRESSES | grep 100.88.0.22 (tstor-c1nmedi01-9x2g9-worker-cloud-paks-m9t6b) 0a:58:64:58:00:16 100.88.0.22/16 ❯ echo $LSP_ADDRESSES | grep 100.88.0.57
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Description of problem:
When setting up openshift-sdn(multitenant) cluster. live migration to ovn should be blocked. however network operator still going on and migration to ovn When setting up openshift-sdn cluster and create egress router pod. live migration to ovn should be blocked. however network operator still going on and migration to ovn
Version-Release number of selected component (if applicable):
pre-merge https://github.com/openshift/cluster-network-operator/pull/2392
How reproducible:
always
Steps to Reproduce:
1. setup openshift-sdn(multitenant) cluster 2. do migration with oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}' 3.
Actual results:
After showing the error. the migration progress still be going on.
Expected results:
network operator should block the migration
Additional info:
This is a clone of issue OCPBUGS-38274. The following is the description of the original issue:
—
Description of problem:
When the vSphere CSI driver is removed (using managementState: Removed), it leaves all existing conditions in the ClusterCSIDriver. IMO it should delete all of them and keep some something like"Disabled: true" that we use for Manila CSI driver operator.
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-08-09-031511
How reproducible: always
Steps to Reproduce:
Actual results: All Deployment + DaemonSet conditions are present
Expected results: The conditions are pruned.
https://redhat-internal.slack.com/archives/C01CQA76KMX/p1717069106405899
Joseph Callen reported this test is failing a fair bit on vsphere, and it looks like it's usually the only thing failing. Thomas has some etcd comments in the thread, need to decide what we should do here.
Also new vsphere hardware being phased in which doesn't seem to be showing the problem.
Move to a flake on vsphere? Kill the test?
Please review the following PR: https://github.com/openshift/telemeter/pull/532
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
ConsoleYAMLSample CRD redirect to home ensure perspective switcher is set to Administrator 1) creates, displays, tests and deletes a new ConsoleYAMLSample instance 0 passing (2m) 1 failing 1) ConsoleYAMLSample CRD creates, displays, tests and deletes a new ConsoleYAMLSample instance: AssertionError: Timed out retrying after 30000ms: Expected to find element: `[data-test-action="View instances"]:not([disabled])`, but never found it. at Context.eval (webpack:///./support/selectors.ts:47:5)
Tracker issue for bootimage bump in 4.17. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-41283.
Tracker issue for bootimage bump in 4.17. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-41622.
Description of problem:
Intermittent error during the installation process when enabling Cluster API (CAPI) in the install-config for OCP 4.16 tech preview IPI installation on top of OSP. The error occurs during the post-machine creation hook, specifically related to Floating IP association.
Version-Release number of selected component (if applicable):
OCP: 4.16.0-0.nightly-2024-05-16-092402 TP enabled on top of OSP: RHOS-17.1-RHEL-9-20240123.n.1
How reproducible:
The issue occurs intermittently, sometimes the installation succeeds, and other times it fails.
Steps to Reproduce:
1.Install OSP 2.Initiate OCP installation with TP and CAPI enabled 3.Observe the installation logs of the failed installation.
Actual results:
The installation fails intermittently with the following error message: ... 2024-05-17 23:37:51.590 | level=debug msg=E0517 23:37:29.833599 266622 controller.go:329] "Reconciler error" err="failed to create cluster accessor: error creating http client and mapper for remote cluster \"openshift-cluster-api-guests/ostest-4qrz2\": error creating client for remote cluster \"openshift-cluster-api-guests/ostest-4qrz2\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://api.ostest.shiftstack.com:6443/api/v1?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="openshift-cluster-api-guests/ostest-4qrz2-master-0" namespace="openshift-cluster-api-guests" name="ostest-4qrz2-master-0" reconcileID="985ba50c-2a1d-41f6-b494-f5af7dca2e7b" 2024-05-17 23:37:51.597 | level=debug msg=E0517 23:37:39.838706 266622 controller.go:329] "Reconciler error" err="failed to create cluster accessor: error creating http client and mapper for remote cluster \"openshift-cluster-api-guests/ostest-4qrz2\": error creating client for remote cluster \"openshift-cluster-api-guests/ostest-4qrz2\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://api.ostest.shiftstack.com:6443/api/v1?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="openshift-cluster-api-guests/ostest-4qrz2-master-0" namespace="openshift-cluster-api-guests" name="ostest-4qrz2-master-0" reconcileID="dfe5f138-ac8e-4790-948f-72d6c8631f21" 2024-05-17 23:37:51.603 | level=debug msg=Machine ostest-4qrz2-master-0 is ready. Phase: Provisioned 2024-05-17 23:37:51.610 | level=debug msg=Machine ostest-4qrz2-master-1 is ready. Phase: Provisioned 2024-05-17 23:37:51.615 | level=debug msg=Machine ostest-4qrz2-master-2 is ready. Phase: Provisioned 2024-05-17 23:37:51.619 | level=info msg=Control-plane machines are ready 2024-05-17 23:37:51.623 | level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed during post-machine creation hook: Resource not found: [POST https://10.46.44.159:13696/v2.0/floatingips], error message: {"NeutronError": {"type": "ExternalGatewayForFloatingIPNotFound", "message": "External network 654792e9-dead-485a-beec-f3c428ef71da is not reachable from subnet d9829374-f0de-4a41-a1c0-a2acdd4841da. Therefore, cannot associate Port 01c518a9-5d5f-42d8-a090-6e3151e8af3f with a Floating IP.", "detail": ""}} 2024-05-17 23:37:51.629 | level=info msg=Shutting down local Cluster API control plane... 2024-05-17 23:37:51.637 | level=info msg=Stopped controller: Cluster API 2024-05-17 23:37:51.643 | level=warning msg=process cluster-api-provider-openstack exited with error: signal: killed 2024-05-17 23:37:51.653 | level=info msg=Stopped controller: openstack infrastructure provider 2024-05-17 23:37:51.659 | level=info msg=Local Cluster API system has completed operations
Expected results:
The installation should complete successfully
Additional info: CAPI is enabled by adding the following to the install-config:
featureSet: 'CustomNoUpgrade' featureGates: ['ClusterAPIInstall=true']
Please review the following PR: https://github.com/openshift/machine-api-provider-ibmcloud/pull/37
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/2186
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-34800. The following is the description of the original issue:
—
The APIRemovedInNextReleaseInUse and APIRemovedInNextEUSReleaseInUse need to be updated for kube 1.30 in OCP 4.17.
This is a clone of issue OCPBUGS-37491. The following is the description of the original issue:
—
Description of problem:
co/ingress is always good even operator pod log error: 2024-07-24T06:42:09.580Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "error sending canary HTTP Request: Timeout: Get \"https://canary-openshift-ingress-canary.apps.hongli-aws.qe.devcluster.openshift.com\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-07-20-191204
How reproducible:
100%
Steps to Reproduce:
1. install AWS cluster 2. update ingresscontroller/default and adding "endpointPublishingStrategy.loadBalancer.allowedSourceRanges", eg spec: endpointPublishingStrategy: loadBalancer: allowedSourceRanges: - 1.1.1.2/32 3. above setting drop most traffic to LB, so some operator degraded
Actual results:
co/authentication and console degraded but co/ingress is still good $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.17.0-0.nightly-2024-07-20-191204 False False True 22m OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.hongli-aws.qe.devcluster.openshift.com/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) console 4.17.0-0.nightly-2024-07-20-191204 False False True 22m RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.hongli-aws.qe.devcluster.openshift.com): Get "https://console-openshift-console.apps.hongli-aws.qe.devcluster.openshift.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers) ingress 4.17.0-0.nightly-2024-07-20-191204 True False False 3h58m check the ingress operator log and see: 2024-07-24T06:59:09.588Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "error sending canary HTTP Request: Timeout: Get \"https://canary-openshift-ingress-canary.apps.hongli-aws.qe.devcluster.openshift.com\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
Expected results:
co/ingress status should reflect the real condition timely
Additional info:
even co/ingress status can be updated in some scenarios, but it is always less sensitive than authentication and console, we always rely on authentication/console to know the route healthy, the purpose of ingress canary route becomes meaningless.
Description of the problem:
it is allowed to create a patch file with the name:
".yaml.patch"
which actually mean a patch file for .yaml file
How reproducible:
Steps to reproduce:
1. create any cluster
2. try to add patch manifest file with the name .yaml.patch
3.
Actual results:
no exception when trying to add it
Expected results:
should be blocked since something.yaml not make sense to be empty since it is not possible to create .yaml manifest file
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
After enabling separate alertmanager instance for user-defined alert routing, the alertmanager-user-workload pods are initialized but the configmap alertmanager-trusted-ca-bundle is not injected in the pods. [-] https://docs.openshift.com/container-platform/4.15/observability/monitoring/enabling-alert-routing-for-user-defined-projects.html#enabling-a-separate-alertmanager-instance-for-user-defined-alert-routing_enabling-alert-routing-for-user-defined-projects
Version-Release number of selected component (if applicable):
RHOCP 4.13, 4.14 and 4.15
How reproducible:
100%
Steps to Reproduce:
1. Enable user-workload monitoring using[a] 2. Enable separate alertmanager instance for user-defined alert routing using [b] 3. Check if alertmanager-trusted-ca-bundle configmap is injected in alertmanager-user-workload pods which are running in openshift-user-workload-monitoring project. $ oc describe pod alertmanager-user-workload-0 -n openshift-user-workload-monitoring | grep alertmanager-trusted-ca-bundle [a] https://docs.openshift.com/container-platform/4.15/observability/monitoring/enabling-monitoring-for-user-defined-projects.html#enabling-monitoring-for-user-defined-projects_enabling-monitoring-for-user-defined-projects [b] https://docs.openshift.com/container-platform/4.15/observability/monitoring/enabling-alert-routing-for-user-defined-projects.html#enabling-a-separate-alertmanager-instance-for-user-defined-alert-routing_enabling-alert-routing-for-user-defined-projects
Actual results:
alertmanager-user-workload pods are NOT injected with alertmanager-trusted-ca-bundle configmap.
Expected results:
alertmanager-user-workload pods should be injected with alertmanager-trusted-ca-bundle configmap.
Additional info:
Similar configmap is injected fine in alertmanager-main pods which are running in openshift-monitoring project.
For a while now we have a "nasty" carry titled "UPSTREAM: <carry>: don't fail integration due to too many goroutines" which only prints information about leaking goroutines but doesn't fail.
See https://github.com/openshift/kubernetes/commit/501f19354bb79f0566039907b179974444f477a2 as an example of that commit.
Upstream went a major refactoring reported under https://github.com/kubernetes/kubernetes/issues/108483 which was meant to prevent those leaks, unfortunately in our case we still are a subject to this problem.
Please review the following PR: https://github.com/openshift/cluster-samples-operator/pull/547
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Would like to be able to use an annotation to set verbosity of kube-apiserver for Hypershift. By default the verbosity is locked to a level of 2. Allowing this to be configurable would enable better debugging when it is desired. This is configurable in Openshift and would like to extend it to Hypershift as well.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
n/a
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-41936. The following is the description of the original issue:
—
Description of problem:
IBM Cloud CCM was reconfigured to use loopback as the bind address in 4.16. However, the liveness probe was not configured to use loopback too, so the CCM constantly fails the liveness probe and restarts continuously.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%
Steps to Reproduce:
1. Create a IPI cluster on IBM Cloud 2. Watch the IBM Cloud CCM pod and restarts, increase every 5 mins (liveness probe timeout)
Actual results:
# oc --kubeconfig cluster-deploys/eu-de-4.17-rc2-3/auth/kubeconfig get po -n openshift-cloud-controller-manager NAME READY STATUS RESTARTS AGE ibm-cloud-controller-manager-58f7747d75-j82z8 0/1 CrashLoopBackOff 262 (39s ago) 23h ibm-cloud-controller-manager-58f7747d75-l7mpk 0/1 CrashLoopBackOff 261 (2m30s ago) 23h Normal Killing 34m (x2 over 40m) kubelet Container cloud-controller-manager failed liveness probe, will be restarted Normal Pulled 34m (x2 over 40m) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5ac9fb24a0e051aba6b16a1f9b4b3f9d2dd98f33554844953dd4d1e504fb301e" already present on machine Normal Created 34m (x3 over 45m) kubelet Created container cloud-controller-manager Normal Started 34m (x3 over 45m) kubelet Started container cloud-controller-manager Warning Unhealthy 29m (x8 over 40m) kubelet Liveness probe failed: Get "https://10.242.129.4:10258/healthz": dial tcp 10.242.129.4:10258: connect: connection refused Warning ProbeError 3m4s (x22 over 40m) kubelet Liveness probe error: Get "https://10.242.129.4:10258/healthz": dial tcp 10.242.129.4:10258: connect: connection refused body:
Expected results:
CCM runs continuously, as it does on 4.15 # oc --kubeconfig cluster-deploys/eu-de-4.15.10-1/auth/kubeconfig get po -n openshift-cloud-controller-manager NAME READY STATUS RESTARTS AGE ibm-cloud-controller-manager-66d4779cb8-gv8d4 1/1 Running 0 63m ibm-cloud-controller-manager-66d4779cb8-pxdrs 1/1 Running 0 63m
Additional info:
IBM Cloud have a PR open to fix the liveness probe. https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/360
Please review the following PR: https://github.com/openshift/oc/pull/1779
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-42231. The following is the description of the original issue:
—
Description of problem:
OCP Conformance MonitorTests can fail based on CSI Drivers pod and ClusterRole applied order. SA, CR, CRB likely should be applied first prior to deployment/pods.
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
60%
Steps to Reproduce:
1. Create IPI cluster on IBM Cloud 2. Run OCP Conformance w/ MonitorTests
Actual results:
: [sig-auth][Feature:SCC][Early] should not have pod creation failures during install [Suite:openshift/conformance/parallel] { fail [github.com/openshift/origin/test/extended/authorization/scc.go:76]: 1 pods failed before test on SCC errors Error creating: pods "ibm-vpc-block-csi-node-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[5]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[6]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[7]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[9]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, provider restricted-v2: .containers[0].runAsUser: Invalid value: 0: must be in the ranges: [1000180000, 1000189999], provider restricted-v2: .containers[1].runAsUser: Invalid value: 0: must be in the ranges: [1000180000, 1000189999], provider restricted-v2: .containers[1].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[2].runAsUser: Invalid value: 0: must be in the ranges: [1000180000, 1000189999], provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for DaemonSet.apps/v1/ibm-vpc-block-csi-node -n openshift-cluster-csi-drivers happened 7 times Ginkgo exit error 1: exit with code 1}
Expected results:
No pod creation failures using the wrong SCC, because the ClusterRole/ClusterRoleBinding, etc. had not been applied yet.
Additional info:
Sorry, I did not see an IBM Cloud Storage listed in the targeted Component for this bug, so selected the generic Storage component. Please forward as necessary/possible. Items to consider: ClusterRole: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/assets/rbac/privileged_role.yaml ClusterRoleBinding: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/assets/rbac/node_privileged_binding.yaml The ibm-vpc-block-csi-node-* pods eventually reach running using privileged SCC. I do not know whether it is possible to stage the resources that get created first, within the CSI Driver Operator https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/9288e5078f2fe3ce2e69a4be3d94622c164c3dbd/pkg/operator/starter.go#L98-L99 Prior to the CSI Driver daemonset (`node.yaml`), perhaps order matters within the list. Example of failure in CI: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/8235/pull-ci-openshift-installer-master-e2e-ibmcloud-ovn/1836521032031145984
Description of problem:
Can't access the openshift namespace images without auth after grant public access to openshift namespace
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-05-102537
How reproducible:
always
Steps to Reproduce:
1. $ oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge $ HOST=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}') 2. $ oc adm policy add-role-to-group system:image-puller system:unauthenticated --namespace openshift Warning: Group 'system:unauthenticated' not found clusterrole.rbac.authorization.k8s.io/system:image-puller added: "system:unauthenticated" 3. Try to fetch image metadata: $ oc image info --insecure "${HOST}/openshift/cli:latest"
Actual results:
$ oc image info default-route-openshift-image-registry.apps.wxj-a41659.qe.azure.devcluster.openshift.com/openshift/cli:latest --insecure error: unable to read image default-route-openshift-image-registry.apps.wxj-a41659.qe.azure.devcluster.openshift.com/openshift/cli:latest: unauthorized: authentication required
Expected results:
Could get the public image info without auth
Additional info:
This is a regression for 4.16, this feature works on 4.15 and below.
This is a clone of issue OCPBUGS-42100. The following is the description of the original issue:
—
Description of problem:
HyperShift currently runs 3 replicas of active/passive HA deployments such as kube-controller-manager, kube-scheduler, etc. In order to reduce the overhead of running a HyperShift control plane, we should be able to run these deployments with 2 replicas. In a 3 zone environment with 2 replicas, we can still use a rolling update strategy, and set the maxSurge value to 1, as the new pod would schedule into the unoccupied zone.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
release-4.17 of openshift/cloud-provider-openstack is missing some commits that were backported in upstream project into the release-1.30 branch. We should import them in our downstream fork.
Description of problem:
Installation of 4.16 fails with a AWS AccessDenied error trying to attach a bootstrap s3 bucket policy.
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
Every time
Steps to Reproduce:
1. Create an installer policy with the permissions listed in the installer [here|https://github.com/openshift/installer/blob/master/pkg/asset/installconfig/aws/permissions.go] 2. Run a install in AWS IPI
Actual results:
Install fails attempting to attach a policy to the bootstrap s3 bucket {code:java} time="2024-06-11T14:58:15Z" level=debug msg="I0611 14:58:15.485718 132 s3.go:256] \"Created bucket\" controller=\"awscluster\" controllerGroup=\"infrastru cture.cluster.x-k8s.io\" controllerKind=\"AWSCluster\" AWSCluster=\"openshift-cluster-api-guests/jamesh-sts-8tl72\" namespace=\"openshift-cluster-api-guests\" name=\"jamesh-sts-8tl72\" reconcileID=\"c390f027-a2ee-4d37-9e5d-b6a11882c46b\" cluster=\"openshift-cluster-api-guests/jamesh-sts-8tl72\" bucket_name=\"opensh ift-bootstrap-data-jamesh-sts-8tl72\"" time="2024-06-11T14:58:15Z" level=debug msg="E0611 14:58:15.643613 132 controller.go:329] \"Reconciler error\" err=<" time="2024-06-11T14:58:15Z" level=debug msg="\tfailed to reconcile S3 Bucket for AWSCluster openshift-cluster-api-guests/jamesh-sts-8tl72: ensuring bucket pol icy: creating S3 bucket policy: AccessDenied: Access Denied"
Expected results:{code:none} Install completes successfully
Additional info:
The installer did not attach an S3 bootstrap bucket policy in the past as far as I can tell [here|https://github.com/openshift/installer/blob/release-4.15/data/data/aws/cluster/main.tf#L133-L148], this new permission is required because of new functionality. CAPA is placing a policy that denies non SSL encrypted traffic to the bucket, this shouldn't have an effect on installs, adding the IAM policy to allow the policy to be added results in a successful install. S3 bootstrap bucket policy: {code:java} "Statement": [ { "Sid": "ForceSSLOnlyAccess", "Principal": { "AWS": [ "*" ] }, "Effect": "Deny", "Action": [ "s3:*" ], "Resource": [ "arn:aws:s3:::openshift-bootstrap-data-jamesh-sts-2r5f7/*" ], "Condition": { "Bool": { "aws:SecureTransport": false } } } ] },
This is a clone of issue OCPBUGS-44068. The following is the description of the original issue:
—
Description of problem:
When the user provides an existing VPC, the IBM CAPI will not add ports 443, 5000, and 6443 to the VPC's security group. It is safe to always check for these ports since we only add them if they are missing.
Description of the problem:
Trying to create a cluster from UI , fails.
How reproducible:
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
This is a clone of issue OCPBUGS-42660. The following is the description of the original issue:
—
There were remaining issues from the original issue. A new bug has been opened to address this. This is a clone of issue OCPBUGS-32947. The following is the description of the original issue:
—
Description of problem:
[vSphere] network.devices, template and workspace will be cleared when deleting the controlplanemachineset, updating these fields will not trigger an update
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-04-23-032717
How reproducible:
Always
Steps to Reproduce:
1.Install a vSphere 4.16 cluster, we use automated template: ipi-on-vsphere/versioned-installer liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-04-23-032717 True False 24m Cluster version is 4.16.0-0.nightly-2024-04-23-032717 2.Check the controlplanemachineset, you can see network.devices, template and workspace have value. liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE cluster 3 3 3 3 Active 51m liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset cluster -oyaml apiVersion: machine.openshift.io/v1 kind: ControlPlaneMachineSet metadata: creationTimestamp: "2024-04-25T02:52:11Z" finalizers: - controlplanemachineset.machine.openshift.io generation: 1 labels: machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl name: cluster namespace: openshift-machine-api resourceVersion: "18273" uid: f340d9b4-cf57-4122-b4d4-0f45f20e4d79 spec: replicas: 3 selector: matchLabels: machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master state: Active strategy: type: RollingUpdate template: machineType: machines_v1beta1_machine_openshift_io machines_v1beta1_machine_openshift_io: failureDomains: platform: VSphere vsphere: - name: generated-failure-domain metadata: labels: machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master spec: lifecycleHooks: {} metadata: {} providerSpec: value: apiVersion: machine.openshift.io/v1beta1 credentialsSecret: name: vsphere-cloud-credentials diskGiB: 120 kind: VSphereMachineProviderSpec memoryMiB: 16384 metadata: creationTimestamp: null network: devices: - networkName: devqe-segment-221 numCPUs: 4 numCoresPerSocket: 4 snapshot: "" template: huliu-vs425c-f5tfl-rhcos-generated-region-generated-zone userDataSecret: name: master-user-data workspace: datacenter: DEVQEdatacenter datastore: /DEVQEdatacenter/datastore/vsanDatastore folder: /DEVQEdatacenter/vm/huliu-vs425c-f5tfl resourcePool: /DEVQEdatacenter/host/DEVQEcluster/Resources server: vcenter.devqe.ibmc.devcluster.openshift.com status: conditions: - lastTransitionTime: "2024-04-25T02:59:37Z" message: "" observedGeneration: 1 reason: AsExpected status: "False" type: Error - lastTransitionTime: "2024-04-25T03:03:45Z" message: "" observedGeneration: 1 reason: AllReplicasAvailable status: "True" type: Available - lastTransitionTime: "2024-04-25T03:03:45Z" message: "" observedGeneration: 1 reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2024-04-25T03:01:04Z" message: "" observedGeneration: 1 reason: AllReplicasUpdated status: "False" type: Progressing observedGeneration: 1 readyReplicas: 3 replicas: 3 updatedReplicas: 3 3.Delete the controlplanemachineset, it will recreate a new one, but those three fields that had values before are now cleared. liuhuali@Lius-MacBook-Pro huali-test % oc delete controlplanemachineset cluster controlplanemachineset.machine.openshift.io "cluster" deleted liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE cluster 3 3 3 3 Inactive 6s liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset cluster -oyaml apiVersion: machine.openshift.io/v1 kind: ControlPlaneMachineSet metadata: creationTimestamp: "2024-04-25T03:45:51Z" finalizers: - controlplanemachineset.machine.openshift.io generation: 1 name: cluster namespace: openshift-machine-api resourceVersion: "46172" uid: 45d966c9-ec95-42e1-b8b0-c4945ea58566 spec: replicas: 3 selector: matchLabels: machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master state: Inactive strategy: type: RollingUpdate template: machineType: machines_v1beta1_machine_openshift_io machines_v1beta1_machine_openshift_io: failureDomains: platform: VSphere vsphere: - name: generated-failure-domain metadata: labels: machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master spec: lifecycleHooks: {} metadata: {} providerSpec: value: apiVersion: machine.openshift.io/v1beta1 credentialsSecret: name: vsphere-cloud-credentials diskGiB: 120 kind: VSphereMachineProviderSpec memoryMiB: 16384 metadata: creationTimestamp: null network: devices: null numCPUs: 4 numCoresPerSocket: 4 snapshot: "" template: "" userDataSecret: name: master-user-data workspace: {} status: conditions: - lastTransitionTime: "2024-04-25T03:45:51Z" message: "" observedGeneration: 1 reason: AsExpected status: "False" type: Error - lastTransitionTime: "2024-04-25T03:45:51Z" message: "" observedGeneration: 1 reason: AllReplicasAvailable status: "True" type: Available - lastTransitionTime: "2024-04-25T03:45:51Z" message: "" observedGeneration: 1 reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2024-04-25T03:45:51Z" message: "" observedGeneration: 1 reason: AllReplicasUpdated status: "False" type: Progressing observedGeneration: 1 readyReplicas: 3 replicas: 3 updatedReplicas: 3 4.I active the controlplanemachineset and it does not trigger an update, I continue to add these field values back and it does not trigger an update, I continue to edit these fields to add a second network device and it still does not trigger an update. network: devices: - networkName: devqe-segment-221 - networkName: devqe-segment-222 By the way, I can create worker machines with other network device or two network devices. huliu-vs425c-f5tfl-worker-0a-ldbkh Running 81m huliu-vs425c-f5tfl-worker-0aa-r8q4d Running 70m
Actual results:
network.devices, template and workspace will be cleared when deleting the controlplanemachineset, updating these fields will not trigger an update
Expected results:
The fields value should not be changed when deleting the controlplanemachineset, Updating these fields should trigger an update, or if these fields should not be modified, then it should not take effect when modifying the controlplanemachineset, as such an inconsistency seems confusing.
Additional info:
Must gather: https://drive.google.com/file/d/1mHR31m8gaNohVMSFqYovkkY__t8-E30s/view?usp=sharing
This is a clone of issue OCPBUGS-43048. The following is the description of the original issue:
—
Description of problem:
When deploying 4.16, customer identified an inbound rule security risk for the "node" security group allowing access from 0.0.0.0/0 to node port range 30000-32767. This issue did not exist in versions prior to 4.16 and suspect this may be a regression. It seems to be related to the use of CAPI which could have changed the behavior. Trying to understand why this was allowed.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Steps to Reproduce:
1. Install 4.16 cluster *** On 4.12 installations, this is not the case ***
Actual results:
The installer configures an inbound rule for the node security group allowing access from 0.0.0.0/0 for port range 30000-32767.
Expected results:
The installer should *NOT* create an inbound security rule allowing access to node port range 30000-32767 from any CIDR range (0.0.0.0/0)
Additional info:
#forum-ocp-cloud slack discussion: https://redhat-internal.slack.com/archives/CBZHF4DHC/p1728484197441409
Relevant Code :
https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/v2.4.0/pkg/cloud/services/securitygroup/securitygroups.go#L551
This is a clone of issue OCPBUGS-42717. The following is the description of the original issue:
—
Description of problem:
When using an internal publishing strategy, the client is not properly initialized and will cause a code path to be hit which tries to access a field of a null pointer.
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
Easily
Steps to Reproduce:
1. Try to deploy a private cluster 2. segfault 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-38177. The following is the description of the original issue:
—
Description of problem:
When adding nodes, agent-register-cluster.service and start-cluster-installation.service service status should not be checked and in their place agent-import-cluster.service and agent-add-node.service should be checked.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
console message shows start installation service and agent register service has not started
Expected results:
console message shows agent import cluster and add host services has started
Additional info:
This is a clone of issue OCPBUGS-43417. The following is the description of the original issue:
—
Description of problem:
4.17: [VSphereCSIDriverOperator] [Upgrade] VMwareVSphereControllerDegraded: runtime error: invalid memory address or nil pointer dereference UPI installed vsphere cluster upgrade failed caused by CSO degrade Upgrade path: 4.8 -> 4.17
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-10-12-174022
How reproducible:
Always
Steps to Reproduce:
1. Install the OCP cluster on vSphere by UPI with version 4.8. 2. Upgrade the cluster to 4.17 nightly.
Actual results:
In Step 2: The upgrade failed from path 4.16 to 4.17.
Expected results:
In Step 2: The upgrade should be successful.
Additional info:
$ omc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-10-12-102620 True True 1h8m Unable to apply 4.17.0-0.nightly-2024-10-12-174022: wait has exceeded 40 minutes for these operators: storage $ omc get co storage NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE storage 4.17.0-0.nightly-2024-10-12-174022 True True True 15h $ omc get co storage -oyaml ... status: conditions: - lastTransitionTime: "2024-10-13T17:22:06Z" message: |- VSphereCSIDriverOperatorCRDegraded: VMwareVSphereControllerDegraded: panic caught: VSphereCSIDriverOperatorCRDegraded: VMwareVSphereControllerDegraded: runtime error: invalid memory address or nil pointer dereference reason: VSphereCSIDriverOperatorCR_VMwareVSphereController_SyncError status: "True" type: Degraded ... $ omc logs vmware-vsphere-csi-driver-operator-5c7db457-nffp4|tail -n 50 2024-10-13T19:00:02.531545739Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).runWorker(0xc000ab7e00?, {0x3900f30?, 0xc0000b9ae0?}) 2024-10-13T19:00:02.531545739Z github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:183 +0x4d 2024-10-13T19:00:02.531545739Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run.func2() 2024-10-13T19:00:02.531545739Z github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:117 +0x65 2024-10-13T19:00:02.531545739Z created by github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run in goroutine 500 2024-10-13T19:00:02.531545739Z github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:112 +0x2c9 2024-10-13T19:00:02.534308382Z I1013 19:00:02.532858 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-csi-drivers", Name:"vmware-vsphere-csi-driver-operator", UID:"e44ce388-4878-4400-afae-744530b62281", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'Vmware-Vsphere-Csi-Driver-OperatorPanic' Panic observed: runtime error: invalid memory address or nil pointer dereference 2024-10-13T19:00:03.532125885Z E1013 19:00:03.532044 1 config_yaml.go:208] Unmarshal failed: yaml: unmarshal errors: 2024-10-13T19:00:03.532125885Z line 1: cannot unmarshal !!seq into config.CommonConfigYAML 2024-10-13T19:00:03.532498631Z I1013 19:00:03.532460 1 config.go:272] ReadConfig INI succeeded. INI-based cloud-config is deprecated and will be removed in 2.0. Please use YAML based cloud-config. 2024-10-13T19:00:03.532708025Z I1013 19:00:03.532571 1 config.go:283] Config initialized 2024-10-13T19:00:03.533270439Z E1013 19:00:03.533160 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) 2024-10-13T19:00:03.533270439Z goroutine 701 [running]: 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/runtime.logPanic({0x2cf3100, 0x54fd210}) 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery@v0.30.3/pkg/util/runtime/runtime.go:75 +0x85 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0xc0014c54e8, 0x1, 0xc000e7e1c0?}) 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery@v0.30.3/pkg/util/runtime/runtime.go:49 +0x6b 2024-10-13T19:00:03.533270439Z panic({0x2cf3100?, 0x54fd210?}) 2024-10-13T19:00:03.533270439Z runtime/panic.go:770 +0x132 2024-10-13T19:00:03.533270439Z github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller.(*VSphereController).createVCenterConnection(0xc0008b2788, {0xc0022cf600?, 0xc0014c57c0?}, 0xc0006a3448) 2024-10-13T19:00:03.533270439Z github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller/vspherecontroller.go:491 +0x94 2024-10-13T19:00:03.533270439Z github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller.(*VSphereController).loginToVCenter(0xc0008b2788, {0x3900f30, 0xc0000b9ae0}, 0x3377a7c?) 2024-10-13T19:00:03.533270439Z github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller/vspherecontroller.go:446 +0x5e 2024-10-13T19:00:03.533270439Z github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller.(*VSphereController).sync(0xc0008b2788, {0x3900f30, 0xc0000b9ae0}, {0x38ee700, 0xc0011d08d0}) 2024-10-13T19:00:03.533270439Z github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller/vspherecontroller.go:240 +0x6fc 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).reconcile(0xc000b3ecf0, {0x3900f30, 0xc0000b9ae0}, {0x38ee700?, 0xc0011d08d0?}) 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:201 +0x43 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).processNextWorkItem(0xc000b3ecf0, {0x3900f30, 0xc0000b9ae0}) 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:260 +0x1ae 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).runWorker.func1({0x3900f30, 0xc0000b9ae0}) 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:192 +0x89 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1() 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:259 +0x1f 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc002bb1e80?) 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:226 +0x33 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0014c5f10, {0x38cf7e0, 0xc00142b470}, 0x1, 0xc0013ae960) 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:227 +0xaf 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00115bf10, 0x3b9aca00, 0x0, 0x1, 0xc0013ae960) 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:204 +0x7f 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x3900f30, 0xc0000b9ae0}, 0xc00115bf70, 0x3b9aca00, 0x0, 0x1) 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:259 +0x93 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.UntilWithContext(...) 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:170 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).runWorker(0xc000ab7e00?, {0x3900f30?, 0xc0000b9ae0?}) 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:183 +0x4d 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run.func2() 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:117 +0x65 2024-10-13T19:00:03.533270439Z created by github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run in goroutine 500 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:112 +0x2c9
Description of problem:
Long cluster names are trimmed by the installer. Warn the user before this happens because if the user intended to distinguish these based on some suffix at the end of a long name, the suffix will get chopped off. If some resources are created on the basis of cluster name alone (rare), there could even be conflicts.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Use a "long cluster name" (at the moment > 27 characters) 2. Deploy a cluster 3. Look at the names of resources, the name will have been trimmed.
Actual results:
Cluster resources with trimmed names are created.
Expected results:
The same as Actual results, but a warning should be shown.
Additional info:
Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/2176
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cloud-network-config-controller/pull/148
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Machine stuck in Provisioned when the cluster is upgraded from 4.1 to 4.15
Version-Release number of selected component (if applicable):
Upgrade from 4.1 to 4.15 4.1.41-x86_64, 4.2.36-x86_64, 4.3.40-x86_64, 4.4.33-x86_64, 4.5.41-x86_64, 4.6.62-x86_64, 4.7.60-x86_64, 4.8.57-x86_64, 4.9.59-x86_64, 4.10.67-x86_64, 4.11 nightly, 4.12 nightly, 4.13 nightly, 4.14 nightly, 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest
How reproducible:
Seems always, the issue was found in our prow ci, and I also reproduce it.
Steps to Reproduce:
1.Create an aws IPI 4.1 cluster, then upgrade it one by one to 4.14 liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2024-01-19-110702 True True 26m Working towards 4.12.0-0.nightly-2024-02-04-062856: 654 of 830 done (78% complete), waiting on authentication, openshift-apiserver, openshift-controller-manager liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2024-02-04-062856 True False 5m12s Cluster version is 4.12.0-0.nightly-2024-02-04-062856 liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2024-02-04-062856 True True 61m Working towards 4.13.0-0.nightly-2024-02-04-042638: 713 of 841 done (84% complete), waiting up to 40 minutes on machine-config liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-0.nightly-2024-02-04-042638 True False 10m Cluster version is 4.13.0-0.nightly-2024-02-04-042638 liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-0.nightly-2024-02-04-042638 True True 17m Working towards 4.14.0-0.nightly-2024-02-02-173828: 233 of 860 done (27% complete), waiting on control-plane-machine-set, machine-api liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2024-02-02-173828 True False 18m Cluster version is 4.14.0-0.nightly-2024-02-02-173828 2.When it upgrade to 4.14, check the machine scale successfully liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms1.yaml machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa created liuhuali@Lius-MacBook-Pro huali-test % oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a 1 1 1 1 14h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa 0 0 3s ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f 2 2 2 2 14h liuhuali@Lius-MacBook-Pro huali-test % oc scale machineset ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa --replicas=1 machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa scaled liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE ci-op-trzci0vq-8a8c4-dq95h-master-0 Running m6a.xlarge us-east-1 us-east-1f 15h ci-op-trzci0vq-8a8c4-dq95h-master-1 Running m6a.xlarge us-east-1 us-east-1a 15h ci-op-trzci0vq-8a8c4-dq95h-master-2 Running m6a.xlarge us-east-1 us-east-1f 15h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt Running m6a.xlarge us-east-1 us-east-1a 15h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa-mt9kh Running m6a.xlarge us-east-1 us-east-1a 15m ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k Running m6a.xlarge us-east-1 us-east-1f 15h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb Running m6a.xlarge us-east-1 us-east-1f 15h liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-128-51.ec2.internal Ready master 15h v1.27.10+28ed2d7 ip-10-0-143-198.ec2.internal Ready worker 14h v1.27.10+28ed2d7 ip-10-0-143-64.ec2.internal Ready worker 14h v1.27.10+28ed2d7 ip-10-0-143-80.ec2.internal Ready master 15h v1.27.10+28ed2d7 ip-10-0-144-123.ec2.internal Ready master 15h v1.27.10+28ed2d7 ip-10-0-147-94.ec2.internal Ready worker 14h v1.27.10+28ed2d7 ip-10-0-158-61.ec2.internal Ready worker 3m40s v1.27.10+28ed2d7 liuhuali@Lius-MacBook-Pro huali-test % oc scale machineset ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa --replicas=0 machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa scaled liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-128-51.ec2.internal Ready master 15h v1.27.10+28ed2d7 ip-10-0-143-198.ec2.internal Ready worker 15h v1.27.10+28ed2d7 ip-10-0-143-64.ec2.internal Ready worker 15h v1.27.10+28ed2d7 ip-10-0-143-80.ec2.internal Ready master 15h v1.27.10+28ed2d7 ip-10-0-144-123.ec2.internal Ready master 15h v1.27.10+28ed2d7 ip-10-0-147-94.ec2.internal Ready worker 15h v1.27.10+28ed2d7 liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE ci-op-trzci0vq-8a8c4-dq95h-master-0 Running m6a.xlarge us-east-1 us-east-1f 15h ci-op-trzci0vq-8a8c4-dq95h-master-1 Running m6a.xlarge us-east-1 us-east-1a 15h ci-op-trzci0vq-8a8c4-dq95h-master-2 Running m6a.xlarge us-east-1 us-east-1f 15h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt Running m6a.xlarge us-east-1 us-east-1a 15h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k Running m6a.xlarge us-east-1 us-east-1f 15h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb Running m6a.xlarge us-east-1 us-east-1f 15h liuhuali@Lius-MacBook-Pro huali-test % oc delete machineset ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa machineset.machine.openshift.io "ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa" deleted liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2024-02-02-173828 True False 43m Cluster version is 4.14.0-0.nightly-2024-02-02-173828 3.Upgrade to 4.15 As upgrade to 4.15 nightly stuck on operator-lifecycle-manager-packageserver which is a bug https://issues.redhat.com/browse/OCPBUGS-28744 so I build image with the fix pr (job build openshift/operator-framework-olm#679 succeeded) and upgrade to the image, upgrade successfully liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2024-02-02-173828 True True 7s Working towards 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest: 10 of 875 done (1% complete) liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False 23m Cluster version is 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 9h baremetal 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 11h cloud-controller-manager 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 8h cloud-credential 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 16h cluster-autoscaler 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 16h config-operator 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 13h console 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 3h19m control-plane-machine-set 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 5h csi-snapshot-controller 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 7h10m dns 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 9h etcd 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 14h image-registry 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 33m ingress 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 9h insights 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 16h kube-apiserver 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 14h kube-controller-manager 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 14h kube-scheduler 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 14h kube-storage-version-migrator 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 34m machine-api 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 16h machine-approver 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 13h machine-config 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 10h marketplace 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 10h monitoring 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 9h network 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 16h node-tuning 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 56m openshift-apiserver 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 9h openshift-controller-manager 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 4h56m openshift-samples 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 58m operator-lifecycle-manager 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 16h operator-lifecycle-manager-catalog 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 16h operator-lifecycle-manager-packageserver 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 57m service-ca 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 16h storage 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 9h liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE ci-op-trzci0vq-8a8c4-dq95h-master-0 Running m6a.xlarge us-east-1 us-east-1f 16h ci-op-trzci0vq-8a8c4-dq95h-master-1 Running m6a.xlarge us-east-1 us-east-1a 16h ci-op-trzci0vq-8a8c4-dq95h-master-2 Running m6a.xlarge us-east-1 us-east-1f 16h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt Running m6a.xlarge us-east-1 us-east-1a 16h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k Running m6a.xlarge us-east-1 us-east-1f 16h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb Running m6a.xlarge us-east-1 us-east-1f 16h 4.Check machine scale stuck in Provisioned, no csr pending liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms1.yaml machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1 created liuhuali@Lius-MacBook-Pro huali-test % oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a 1 1 1 1 16h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1 0 0 6s ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f 2 2 2 2 16h liuhuali@Lius-MacBook-Pro huali-test % oc scale machineset ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1 --replicas=1 machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1 scaled liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE ci-op-trzci0vq-8a8c4-dq95h-master-0 Running m6a.xlarge us-east-1 us-east-1f 16h ci-op-trzci0vq-8a8c4-dq95h-master-1 Running m6a.xlarge us-east-1 us-east-1a 16h ci-op-trzci0vq-8a8c4-dq95h-master-2 Running m6a.xlarge us-east-1 us-east-1f 16h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt Running m6a.xlarge us-east-1 us-east-1a 16h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1-5g877 Provisioning m6a.xlarge us-east-1 us-east-1a 4s ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k Running m6a.xlarge us-east-1 us-east-1f 16h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb Running m6a.xlarge us-east-1 us-east-1f 16h liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE ci-op-trzci0vq-8a8c4-dq95h-master-0 Running m6a.xlarge us-east-1 us-east-1f 18h ci-op-trzci0vq-8a8c4-dq95h-master-1 Running m6a.xlarge us-east-1 us-east-1a 18h ci-op-trzci0vq-8a8c4-dq95h-master-2 Running m6a.xlarge us-east-1 us-east-1f 18h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt Running m6a.xlarge us-east-1 us-east-1a 18h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1-5g877 Provisioned m6a.xlarge us-east-1 us-east-1a 97m ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k Running m6a.xlarge us-east-1 us-east-1f 18h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb Running m6a.xlarge us-east-1 us-east-1f 18h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f1-4ln47 Provisioned m6a.xlarge us-east-1 us-east-1f 50m liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-128-51.ec2.internal Ready master 18h v1.28.6+a373c1b ip-10-0-143-198.ec2.internal Ready worker 18h v1.28.6+a373c1b ip-10-0-143-64.ec2.internal Ready worker 18h v1.28.6+a373c1b ip-10-0-143-80.ec2.internal Ready master 18h v1.28.6+a373c1b ip-10-0-144-123.ec2.internal Ready master 18h v1.28.6+a373c1b ip-10-0-147-94.ec2.internal Ready worker 18h v1.28.6+a373c1b liuhuali@Lius-MacBook-Pro huali-test % oc get csr NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-596n7 21m kubernetes.io/kube-apiserver-client-kubelet system:node:ip-10-0-147-94.ec2.internal <none> Approved,Issued csr-7nr9m 42m kubernetes.io/kubelet-serving system:node:ip-10-0-147-94.ec2.internal <none> Approved,Issued csr-bc9n7 16m kubernetes.io/kube-apiserver-client-kubelet system:node:ip-10-0-128-51.ec2.internal <none> Approved,Issued csr-dmk27 18m kubernetes.io/kubelet-serving system:node:ip-10-0-128-51.ec2.internal <none> Approved,Issued csr-ggkgd 64m kubernetes.io/kube-apiserver-client-kubelet system:node:ip-10-0-143-198.ec2.internal <none> Approved,Issued csr-rs9cz 70m kubernetes.io/kubelet-serving system:node:ip-10-0-143-80.ec2.internal <none> Approved,Issued liuhuali@Lius-MacBook-Pro huali-test %
Actual results:
Machine stuck in Provisioned
Expected results:
Machine should get Running
Additional info:
Must gather: https://drive.google.com/file/d/1TrZ_mb-cHKmrNMsuFl9qTdYo_eNPuF_l/view?usp=sharing I can see the provisioned machine on AWS console: https://drive.google.com/file/d/1-OcsmvfzU4JBeGh5cil8P2Hoe5DQsmqF/view?usp=sharing System log of ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1-5g877: https://drive.google.com/file/d/1spVT_o0S4eqeQxE5ivttbAazCCuSzj1e/view?usp=sharing Some log on the instance: https://drive.google.com/file/d/1zjxPxm61h4L6WVHYv-w7nRsSz5Fku26w/view?usp=sharing
Please review the following PR: https://github.com/openshift/cluster-config-operator/pull/419
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/ironic-image/pull/501
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Query the CAPI provider for the timeouts needed during provisioning. This is optional to support. The current default of 15 minutes is sufficient for normal CAPI installations. However, given how the current PowerVS CAPI provider waits for some resources to be created before creating the load balancers, it is possible that the LBs will not create before the 15 minute timeout. An issue was created to track this [1]. [1] kubernetes-sigs/cluster-api-provider-ibmcloud#1837
This is a clone of issue OCPBUGS-44049. The following is the description of the original issue:
—
Description of problem:
When the machineconfig tab is opened on the console the below error is displayed. Oh no! Something went wrong Type Error Description: Cannot read properties of undefined (reading 'toString")
Version-Release number of selected component (if applicable):
OCP version 4.17.3
How reproducible:
Every time at customers end.
Steps to Reproduce:
1. Go on console. 2. Under compute tab go to machineconfig tab.
Actual results:
Oh no! Something went wrong
Expected results:
Should be able to see all the available mc.
Additional info:
Description of problem:
maybe the same error as OCPBUGS-37232, admin user login admin console, go to "Observe - Alerting", check alert details, example Watchdog alert, will see the "S is not a function" in the graph, see picture: https://drive.google.com/file/d/1FxHz0yk1w_8Np3Whm-qAhBTSt3VXGG8j/view?usp=drive_link, same error as "Observe - Metrics", query any metrics, would see the "S is not a function" in the graph. no such error for the dev console
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-07-17-183402
How reproducible:
always
Steps to Reproduce:
1. see the descitpion
Actual results:
"S is not a function" in the admin console graph
Expected results:
no error
Additional info:
Description of problem:
oc-mirror crane export fails with latest docker registry/2 on s390x
Version-Release number of selected component (if applicable):
How reproducible:
Everytime
Steps to Reproduce:
1.git clone https://github.com/openshift/oc-mirror/ 2.cd oc-mirror 3. mkdir -p bin 4.curl -o bin/oc-mirror.tar.gz https://mirror.openshift.com/pub/openshift-v4/s390x/clients/ocp/4.16.0-rc.2/oc-mirror.tar.gz 5.cd bin 6.tar xvf oc-mirror.tar.gz oc-mirror 7.chmod +x oc-mirror 8.cd .. 9.podman build -f Dockerfile -t local/go-toolset:latest 10.podman run -it -v $(pwd):/build:z --env ENV_CATALOGORG="powercloud" --env ENV_CATALOGNAMESPACE="powercloud/oc-mirror-dev-s390x" --env ENV_CATALOG_ID="17282f4c" --env ENV_OCI_REGISTRY_NAMESPACE="powercloud" --entrypoint /bin/bash local/go-toolset:latest ./test/e2e/e2e-simple.sh bin/oc-mirror 2>&1 | tee ../out.log
Actual results:
/build/test/e2e/operator-test.18664 /build % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 ---- ---- ---- 0 100 52.1M 100 52.1M 0 0 779k 0 0:01:08 0:01:08 ---- 301k go: downloading github.com/google/go-containerregistry v0.19.1 go: downloading github.com/docker/cli v24.0.0+incompatible go: downloading github.com/spf13/cobra v1.7.0 go: downloading github.com/opencontainers/image-spec v1.1.0-rc3 go: downloading github.com/mitchellh/go-homedir v1.1.0 go: downloading golang.org/x/sync v0.2.0 go: downloading github.com/opencontainers/go-digest v1.0.0 go: downloading github.com/docker/distribution v2.8.2+incompatible go: downloading github.com/containerd/stargz-snapshotter/estargz v0.14.3 go: downloading github.com/google/go-cmp v0.5.9 go: downloading github.com/klauspost/compress v1.16.5 go: downloading github.com/spf13/pflag v1.0.5 go: downloading github.com/vbatts/tar-split v0.11.3 go: downloading github.com/pkg/errors v0.9.1 go: downloading github.com/docker/docker v24.0.0+incompatible go: downloading golang.org/x/sys v0.15.0 go: downloading github.com/sirupsen/logrus v1.9.1 go: downloading github.com/docker/docker-credential-helpers v0.7.0 Error: pulling Image s390x/registry:2: no child with platform linux/amd64 in index s390x/registry:2 /build/test/e2e/lib/util.sh: line 17: PID_DISCONN: unbound variable
Expected results:
Should not give any error
Additional info:
This is a clone of issue OCPBUGS-38620. The following is the description of the original issue:
—
Our e2e jobs fail with:
pods/aws-efs-csi-driver-controller-66f7d8bcf5-zf8vr initContainers[init-aws-credentials-file] must have terminationMessagePolicy="FallbackToLogsOnError" pods/aws-efs-csi-driver-node-7qj9p containers[csi-driver] must have terminationMessagePolicy="FallbackToLogsOnError" pods/aws-efs-csi-driver-operator-fcc56998b-2d5x6 containers[aws-efs-csi-driver-operator] must have terminationMessagePolicy="FallbackToLogsOnError"
The jobs should succeed.
This is a clone of issue OCPBUGS-43674. The following is the description of the original issue:
—
Description of problem:
The assisted service is throwing an error message stating that the Cloud Controller Manager (CCM) is not enabled, even though the CCM value is correctly set in the install-config file.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-19-045205
How reproducible:
Always
Steps to Reproduce:
1. Prepare install-config and agent-config for external OCI platform. example of install-config configuration ....... ....... platform: external platformName: oci cloudControllerManager: External ....... ....... 2. Create agent ISO for external OCI platform 3. Boot up nodes using created agent ISO
Actual results:
Oct 21 16:40:47 agent-sno.private.agenttest.oraclevcn.com service[2829]: time="2024-10-21T16:40:47Z" level=info msg="Register cluster: agenttest with id 2666753a-0485-420b-b968-e8732da6898c and params {\"api_vips\":[],\"base_dns_domain\":\"abitest.oci-rhelcert.edge-sro.rhecoeng.com\",\"cluster_networks\":[{\"cidr\":\"10.128.0.0/14\",\"host_prefix\":23}],\"cpu_architecture\":\"x86_64\",\"high_availability_mode\":\"None\",\"ingress_vips\":[],\"machine_networks\":[{\"cidr\":\"10.0.0.0/20\"}],\"name\":\"agenttest\",\"network_type\":\"OVNKubernetes\",\"olm_operators\":null,\"openshift_version\":\"4.18.0-0.nightly-2024-10-19-045205\",\"platform\":{\"external\":{\"cloud_controller_manager\":\"\",\"platform_name\":\"oci\"},\"type\":\"external\"},\"pull_secret\":\"***\",\"schedulable_masters\":false,\"service_networks\":[{\"cidr\":\"172.30.0.0/16\"}],\"ssh_public_key\":\"ssh-rsa XXXXXXXXXXXX\",\"user_managed_networking\":true,\"vip_dhcp_allocation\":false}" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).RegisterClusterInternal" file="/src/internal/bminventory/inventory.go:515" cluster_id=2666753a-0485-420b-b968-e8732da6898c go-id=2110 pkg=Inventory request_id=82e83b31-1c1b-4dea-b435-f7316a1965e
Expected results:
The cluster installation should be successful.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The story is to track i18n upload/download routine tasks which are perform every sprint.
A.C.
- Upload strings to Memosource at the start of the sprint and reach out to localization team
- Download translated strings from Memsource when it is ready
- Review the translated strings and open a pull request
- Open a followup story for next sprint
Description of problem:
Setting a non-existant network ID in install-config as control plane additionalNetworkID makes CAPO panic with nil pointer dereference, and Installer does not give out a more explicit error. Installer should run a pre-flight check on the network ID, and CAPO should not panic.
Version-Release number of selected component (if applicable):
How reproducible:
install-config: apiVersion: v1 controlPlane: name: master platform: openstack: type: ${CONTROL_PLANE_FLAVOR} additionalNetworkIDs: [43e553c2-9d45-4fdc-b29e-233231faf46e]
Steps to Reproduce:
1. Add a non-existant network ID in controlPlane.platform.openstack.additionalNetworkIDs 2. openshift-install create cluster 3. enjoy
Actual results:
DEBUG I0613 15:32:14.683137 314433 machine_controller_noderef.go:60] "Waiting for infrastructure provider to report spec.providerID" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="openshift-cluster-api-guests/ocp1-f4dwz-bootstrap" namespace="openshift-cluster-api-guests" name="ocp1-f4dwz-bootstrap" reconcileID="f89c5c84-4832-44ae-b522-bdfc8e1b0fdf" Cluster="openshift-cluster-api-guests/ocp1-f4dwz" Cluster="openshift-cluster-api-guests/ocp1-f4dwz" OpenStackMachine="openshift-cluster-api-guests/ocp1-f4dwz-bootstrap" DEBUG panic: runtime error: invalid memory address or nil pointer dereference [recovered] DEBUG panic: runtime error: invalid memory address or nil pointer dereference DEBUG [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x1b737b5] DEBUG DEBUG goroutine 326 [running]: DEBUG sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1() DEBUG /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116 +0x1e5 DEBUG panic({0x1db4540?, 0x367bd90?}) DEBUG /var/home/pierre/sdk/go1.22.3/src/runtime/panic.go:770 +0x132 DEBUG sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/networking.(*Service).CreatePort(0xc0003c71a0, {0x24172d0, 0xc000942008}, 0xc000a4e688) DEBUG /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/networking/port.go:195 +0xd55 DEBUG sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/networking.(*Service).CreatePorts(0xc0003c71a0, {0x24172d0, 0xc000942008}, {0xc000a4e5a0, 0x2, 0x1b9b265?}, 0xc0008595f0) DEBUG /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/networking/port.go:336 +0x66 DEBUG sigs.k8s.io/cluster-api-provider-openstack/controllers.getOrCreateMachinePorts(0xc000c53d10?, 0x242ebd8?) DEBUG /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/cluster-api-provider-openstack/controllers/openstackmachine_controller.go:759 +0x59 DEBUG sigs.k8s.io/cluster-api-provider-openstack/controllers.(*OpenStackMachineReconciler).reconcileNormal(0xc00052e480, {0x242ebd8, 0xc000af5c50}, 0xc000c53d10, {0xc000f27c50, 0x27}, 0xc000943908, 0xc000943188, 0xc000942008) DEBUG /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/cluster-api-provider-openstack/controllers/openstackmachine_controller.go:602 +0x307 DEBUG sigs.k8s.io/cluster-api-provider-openstack/controllers.(*OpenStackMachineReconciler).Reconcile(0xc00052e480, {0x242ebd8, 0xc000af5c50}, {{{0xc00064b280?, 0x0?}, {0xc000f3ecd8?, 0xc00076bd50?}}}) DEBUG /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/cluster-api-provider-openstack/controllers/openstackmachine_controller.go:162 +0xb6d DEBUG sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x2434e10?, {0x242ebd8?, 0xc000af5c50?}, {{{0xc00064b280?, 0xb?}, {0xc000f3ecd8?, 0x0?}}}) DEBUG /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119 +0xb7 DEBUG sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0005961e0, {0x242ec10, 0xc000988b40}, {0x1e7eda0, 0xc0009805e0}) DEBUG /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316 +0x3bc DEBUG sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0005961e0, {0x242ec10, 0xc000988b40}) DEBUG /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266 +0x1c9 DEBUG sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2() DEBUG /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x79 DEBUG created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 213 DEBUG /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x50c DEBUG Checking that machine ocp1-f4dwz-bootstrap has provisioned... DEBUG Machine ocp1-f4dwz-bootstrap has not yet provisioned: Pending DEBUG Checking that machine ocp1-f4dwz-master-0 has provisioned... DEBUG Machine ocp1-f4dwz-master-0 has not yet provisioned: Pending DEBUG Checking that machine ocp1-f4dwz-master-1 has provisioned... DEBUG Machine ocp1-f4dwz-master-1 has not yet provisioned: Pending DEBUG Checking that machine ocp1-f4dwz-master-2 has provisioned... DEBUG Machine ocp1-f4dwz-master-2 has not yet provisioned: Pending DEBUG Checking that machine ocp1-f4dwz-bootstrap has provisioned... DEBUG Machine ocp1-f4dwz-bootstrap has not yet provisioned: Pending DEBUG Checking that machine ocp1-f4dwz-master-0 has provisioned... DEBUG Machine ocp1-f4dwz-master-0 has not yet provisioned: Pending [...]
Expected results:
ERROR "The additional network $ID was not found in OpenStack."
Additional info:
A separate report will be filed against CAPO.
This is a clone of issue OCPBUGS-41136. The following is the description of the original issue:
—
Description of problem:
Customer is unable to scale deploymentConfig in RHOCP 4.14.21 cluster. If they scale a DeploymentConfig they get the error: "New size: 4; reason: cpu resource utilization (percentage of request) above target; error: Internal error occurred: converting (apps.DeploymentConfig) to (v1beta1.Scale): unknown conversion"
Version-Release number of selected component (if applicable):
4.14.21
How reproducible:
N/A
Steps to Reproduce:
1. deploy apps using DC 2. create HPA 3. observe pods unable to scale. Also manual scaling fails
Actual results:
Pods are not getting scaled
Expected results:
Pods should be scaled using HPA
Additional info:
This is a clone of issue OCPBUGS-38392. The following is the description of the original issue:
—
Description of problem:
For CFE-920: Update GCP userLabels and userTags configs description
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Azure Disk CSI driver operator runs node DaemonSet that exposes CSI driver metrics on loopback, but there is no kube-rbac-proxy in front of it and there is no Service / ServiceMonitor for it. Therefore OCP doesn't collect these metrics.
Description of problem:
When creating IPI cluster, following unexpected traceback appears in terminal occasionally, it won't cause any failure and install succeed finally. # ./openshift-install create cluster --dir cluster --log-level debug ... INFO Importing OVA sgao-nest-ktqck-rhcos-generated-region-generated-zone into failure domain generated-failure-domain. [controller-runtime] log.SetLogger(...) was never called; logs will not be displayed. Detected at: > goroutine 131 [running]: > runtime/debug.Stack() > /usr/lib/golang/src/runtime/debug/stack.go:24 +0x5e > sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot() > /go/src/github.com/openshift/installer/vendor/sigs.k8s.io/controller-runtime/pkg/log/log.go:60 +0xcd > sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).Error(0xc000e37200, {0x26fd23c0, 0xc0016b4270}, {0x77d22d3, 0x3d}, {0x0, 0x0, 0x0}) > /go/src/github.com/openshift/installer/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:139 +0x5d > github.com/go-logr/logr.Logger.Error({{0x270398d8?, 0xc000e37200?}, 0x0?}, {0x26fd23c0, 0xc0016b4270}, {0x77d22d3, 0x3d}, {0x0, 0x0, 0x0}) > /go/src/github.com/openshift/installer/vendor/github.com/go-logr/logr/logr.go:301 +0xda > sigs.k8s.io/cluster-api-provider-vsphere/pkg/session.newClient.func1({0x26fd6c40?, 0xc0021f0160?}) > /go/src/github.com/openshift/installer/vendor/sigs.k8s.io/cluster-api-provider-vsphere/pkg/session/session.go:265 +0xda > sigs.k8s.io/cluster-api-provider-vsphere/pkg/session.newClient.KeepAliveHandler.func2() > /go/src/github.com/openshift/installer/vendor/github.com/vmware/govmomi/session/keep_alive.go:36 +0x22 > github.com/vmware/govmomi/session/keepalive.(*handler).Start.func1() > /go/src/github.com/openshift/installer/vendor/github.com/vmware/govmomi/session/keepalive/handler.go:124 +0x98 > created by github.com/vmware/govmomi/session/keepalive.(*handler).Start in goroutine 1 > /go/src/github.com/openshift/installer/vendor/github.com/vmware/govmomi/session/keepalive/handler.go:116 +0x116
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-13-213831
How reproducible:
sometimes
Steps to Reproduce:
1. Create IPI cluster on vSphere multiple times 2, Check output in terminal
Actual results:
unexpected log traceback appears in terminal
Expected results:
unexpected log traceback should not appear in terminal
Additional info:
This is a clone of issue OCPBUGS-38717. The following is the description of the original issue:
—
The Telemetry userPreference added to the General tab in https://github.com/openshift/console/pull/13587 results in empty nodes being output to the DOM. This results in extra spacing any time a new user preference is added to the bottom of the General tab.
The openshift/router repository vendors k8s.io/* v0.29.1. OpenShift 4.17 is based on Kubernetes 1.30.
4.17.
Always.
Check https://github.com/openshift/router/blob/release-4.17/go.mod.
The k8s.io/* packages are at v0.29.1.
The k8s.io/* packages are at v0.30.0 or newer.
This is a clone of issue OCPBUGS-44305. The following is the description of the original issue:
—
Description of problem:
The finally tasks do not get removed and remain in the pipeline.
Version-Release number of selected component (if applicable):
In all supported OCP version
How reproducible:
Always
Steps to Reproduce:
1. Create a finally task in a pipeline in pipeline builder 2. Save pipeline 3. Edit pipeline and remove finally task in pipeline builder 4. Save pipeline 5. Observe that the finally task has not been removed
Actual results:
The finally tasks do not get removed and remain in the pipeline.
Expected results:
Finally task gets removed from pipeline when removing the finally tasks and saving the pipeline in the "pipeline builder" mode.
Additional info:
In all releases tested, in particular, 4.16.0-0.okd-scos-2024-08-21-155613, Samples operator uses incorrect templates, resulting in following alert:
Samples operator is detecting problems with imagestream image imports. You can look at the "openshift-samples" ClusterOperator object for details. Most likely there are issues with the external image registry hosting the images that needs to be investigated. Or you can consider marking samples operator Removed if you do not care about having sample imagestreams available. The list of ImageStreams for which samples operator is retrying imports: fuse7-eap-openshift fuse7-eap-openshift-java11 fuse7-java-openshift fuse7-java11-openshift fuse7-karaf-openshift-jdk11 golang httpd java jboss-datagrid73-openshift jboss-eap-xp3-openjdk11-openshift jboss-eap-xp3-openjdk11-runtime-openshift jboss-eap-xp4-openjdk11-openshift jboss-eap-xp4-openjdk11-runtime-openshift jboss-eap74-openjdk11-openshift jboss-eap74-openjdk11-runtime-openshift jboss-eap74-openjdk8-openshift jboss-eap74-openjdk8-runtime-openshift jboss-webserver57-openjdk8-tomcat9-openshift-ubi8 jenkins jenkins-agent-base mariadb mysql nginx nodejs perl php postgresql13-for-sso75-openshift-rhel8 postgresql13-for-sso76-openshift-rhel8 python redis ruby sso75-openshift-rhel8 sso76-openshift-rhel8 fuse7-karaf-openshift jboss-webserver57-openjdk11-tomcat9-openshift-ubi8 postgresql
For example, the sample image for Mysql 8.0 is being pulled from registry.redhat.io/rhscl/mysql-80-rhel7:latest (and cannot be found using the dummy pull secret).
Works correctly on OKD FCOS builds.
Description of problem:
Cu's reporting TelemeterClientFailures Warnings seen in their multiple clusters. Multiple cases opened in roughly last ~36 hours
Version-Release number of selected component (if applicable):
OCP 4.13.38, OCP 4.12.40
How reproducible:
As per the latest update from one of the CU's:| `After 5 May 18:00 (HKT), this alert resolved by itself on all cluster. "gateway error" also not appear after After 5 May 18:00 (HKT)`
Steps to Reproduce:
1. 2. 3.
Actual results:
Telemeter-client container logs report the below errors:
2024-05-05T06:40:32.162068012Z level=error caller=forwarder.go:276 ts=2024-05-05T06:40:32.161990057Z component=forwarder/worker msg="unable to forward results" err="gateway server reported unexpected error code: 503: <html>\r\n <head>\r\n <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">\r\n\r\n <style type=\"text/css\">\r\n body {\r\n font-family: \"Helvetica Neue\", Helvetica, Arial, sans-serif;\r\n line-height: 1.66666667;\r\n font-size: 16px;\r\n color: #333;\r\n background-color: #fff;\r\n margin: 2em 1em;\r\n }\r\n h1 {\r\n font-size: 28px;\r\n font-weight: 400;\r\n }\r\n p {\r\n margin: 0 0 10px;\r\n }\r\n .alert.alert-info {\r\n background-color: #F0F0F0;\r\n margin-top: 30px;\r\n padding: 30px;\r\n }\r\n .alert p {\r\n padding-left: 35px;\r\n }\r\n ul {\r\n padding-left: 51px;\r\n position: relative;\r\n }\r\n li {\r\n font-size: 14px;\r\n margin-bottom: 1em;\r\n }\r\n p.info {\r\n position: relative;\r\n font-size: 20px;\r\n }\r\n p.info:before, p.info:after {\r\n content: \"\";\r\n left: 0;\r\n position: absolute;\r\n top: 0;\r\n }\r\n"
Expected results:
TelemeterClientFailures alerts should not be seen
Additional info:
What could be the reason behind TelemeterClientFailures alerts firing all of a sudden and likely disappeared after a while ?
Please review the following PR: https://github.com/openshift/cluster-api/pull/208
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
During IPI CAPI cluster creation, it is possible that the load balancer is currently busy. So wrap AddIPToLoadBalancerPool in a PollUntilContextCancel loop.
Description of problem:
The version info is useful, however, I couldn't get the cluster-olm-operator's. As follows,
jiazha-mac:~ jiazha$ oc rsh cluster-olm-operator-7cc6c89999-hql9m Defaulted container "cluster-olm-operator" out of: cluster-olm-operator, copy-catalogd-manifests (init), copy-operator-controller-manifests (init), copy-rukpak-manifests (init) sh-5.1$ ps -elf|cat F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 4 S 1000790+ 1 0 0 80 0 - 490961 futex_ 04:39 ? 00:00:28 /cluster-olm-operator start -v=2 4 S 1000790+ 15 0 0 80 0 - 1113 do_wai 07:33 pts/0 00:00:00 /bin/sh 4 R 1000790+ 22 15 0 80 0 - 1787 - 07:33 pts/0 00:00:00 ps -elf 4 S 1000790+ 23 15 0 80 0 - 1267 pipe_r 07:33 pts/0 00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=cat /usr/bin/cat sh-5.1$ ./cluster-olm-operator -h OpenShift Cluster OLM Operator Usage: cluster-olm-operator [command] Available Commands: completion Generate the autocompletion script for the specified shell help Help about any command start Start the Cluster OLM Operator Flags: -h, --help help for cluster-olm-operator --log-flush-frequency duration Maximum number of seconds between log flushes (default 5s) -v, --v Level number for the log level verbosity --vmodule moduleSpec comma-separated list of pattern=N settings for file-filtered logging (only works for the default text log format) Use "cluster-olm-operator [command] --help" for more information about a command. sh-5.1$ ./cluster-olm-operator start -h Start the Cluster OLM Operator Usage: cluster-olm-operator start [flags] Flags: --config string Location of the master configuration file to run from. -h, --help help for start --kubeconfig string Location of the master configuration file to run from. --listen string The ip:port to serve on. --namespace string Namespace where the controller is running. Auto-detected if run in cluster. --terminate-on-files stringArray A list of files. If one of them changes, the process will terminate. Global Flags: --log-flush-frequency duration Maximum number of seconds between log flushes (default 5s) -v, --v Level number for the log level verbosity --vmodule moduleSpec comma-separated list of pattern=N settings for file-filtered logging (only works for the default text log format)
Version-Release number of selected component (if applicable):
jiazha-mac:~ jiazha$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.17.0-0.nightly-2024-07-20-191204 True False 9h Cluster version is 4.17.0-0.nightly-2024-07-20-191204
How reproducible:
always
Steps to Reproduce:
1. build an OCP cluster and Enable TP. $ oc patch featuregate cluster -p '{"spec": {"featureSet": "TechPreviewNoUpgrade"}}' --type=merge 2. Check cluster-olm-operator version info.
Actual results:
Couldn't get it.
Expected results:
The cluster-olm-operator should have a global flag to output the version info.
Additional info:
Please review the following PR: https://github.com/openshift/csi-node-driver-registrar/pull/73
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The value box in the ConfigMap Form view is no longer resizable. It is resizable as expected in OCP version 4.14.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
OCP Console -> Administrator -> Workloads -> ConfigMaps -> Create ConfigMap -> Form view -> value
Actual results:
Value window box should be resizable.
Expected results:
It is not resizable anymore in 4.15 OpenShift Clusters.
Additional info:
Description of problem:
Removing imageContentSources from HostedCluster does not update IDMS for the cluster.
Version-Release number of selected component (if applicable):
Tested with 4.15.14
How reproducible:
100%
Steps to Reproduce:
1. add imageContentSources to HostedCluster 2. verify it is applied to IDMS 3. remove imageContentSources from HostedCluster
Actual results:
IDMS is not updated to remove imageDigestMirrors contents
Expected results:
IDMS is updated to remove imageDigestMirrors contents
Additional info:
Workaround, set imageContentSources=[]
Description of problem:
Multiple failures with this error
{Timed out retrying after 30000ms: Expected to find element: `#page-sidebar`, but never found it. AssertionError AssertionError: Timed out retrying after 30000ms: Expected to find element: `#page-sidebar`, but never found it. at Context.eval (webpack:////go/src/github.com/openshift/console/frontend/packages/integration-tests-cypress/support/index.ts:48:5)}
Test failures
Additional findings:
Initial investigation of the test failure artifacts traces to the following element is not found.
get [data-test="catalogSourceDisplayName-red-hat"]
Based on the following screenshots from the failure video, the Red Hat catalog source is not available on the test cluster.
https://drive.google.com/file/d/18xV5wviekcS6KJ4ObBNQdtwsnfkSpFxl/view?usp=drive_link
https://drive.google.com/file/d/17yMDb42CM2Mc3z-DkLKiz1P4HEjqAr-k/view?usp=sharing
Refactor name to Dockerfile.ocp as a better, version independent alternative
Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/311
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Snapshot support is being delivered for kubevirt-csi in 4.16, but the cli used to configure snapshot support did not expose the argument that makes using snapshots possible. The cli arg [--infra-volumesnapshot-class-mapping] was added to the developer cli [hypershift] but never made it to the productized cli [hcp] that end users will use.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100%
Steps to Reproduce:
1. hcp create cluster kubevirt -h | grep infra-volumesnapshot-class-mapping 2. 3.
Actual results:
no value is found
Expected results:
the infra-volumesnapshot-class-mapping cli arg should be found
Additional info:
Description of problem:
I see that if a release does not contain kubevirt coreos container image and if kubeVirtContainer flag is set to true oc-mirror fails to continue.
Version-Release number of selected component (if applicable):
[fedora@preserve-fedora-yinzhou test]$ ./oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.2.0-alpha.1-280-g8a42369", GitCommit:"8a423691", GitTreeState:"clean", BuildDate:"2024-08-03T08:02:06Z", GoVersion:"go1.22.4", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. use imageSetConfig.yaml as shown below 2. Run command oc-mirror -c clid-179.yaml file://clid-179 --v2 3.
Actual results:
fedora@preserve-fedora-yinzhou test]$ ./oc-mirror -c /tmp/clid-99.yaml file://CLID-412 --v2 2024/08/03 09:24:38 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/08/03 09:24:38 [INFO] : 👋 Hello, welcome to oc-mirror 2024/08/03 09:24:38 [INFO] : ⚙️ setting up the environment for you... 2024/08/03 09:24:38 [INFO] : 🔀 workflow mode: mirrorToDisk 2024/08/03 09:24:38 [INFO] : 🕵️ going to discover the necessary images... 2024/08/03 09:24:38 [INFO] : 🔍 collecting release images... 2024/08/03 09:24:44 [INFO] : kubeVirtContainer set to true [ including : ] 2024/08/03 09:24:44 [ERROR] : unknown image : reference name is empty 2024/08/03 09:24:44 [INFO] : 👋 Goodbye, thank you for using oc-mirror 2024/08/03 09:24:44 [ERROR] : unknown image : reference name is empty
Expected results:
If kubeVirt coreos container does not exist in a relelase oc-mirror should skip and continue mirroring other operators but should not fail.
Additional info:
[fedora@preserve-fedora-yinzhou test]$ cat /tmp/clid-99.yaml apiVersion: mirror.openshift.io/v2alpha1 kind: ImageSetConfiguration mirror: platform: channels: - name: stable-4.12 minVersion: 4.12.61 maxVersion: 4.12.61 kubeVirtContainer: true operators: - catalog: oci:///test/ibm-catalog - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 packages: - name: devworkspace-operator minVersion: "0.26.0" - name: nfd maxVersion: "4.15.0-202402210006" - name: cluster-logging minVersion: 5.8.3 maxVersion: 5.8.4 - name: quay-bridge-operator channels: - name: stable-3.9 minVersion: 3.9.5 - name: quay-operator channels: - name: stable-3.9 maxVersion: "3.9.1" - name: odf-operator channels: - name: stable-4.14 minVersion: "4.14.5-rhodf" maxVersion: "4.14.5-rhodf" additionalImages: - name: registry.redhat.io/ubi8/ubi:latest - name: quay.io/openshifttest/hello-openshift@sha256:61b8f5e1a3b5dbd9e2c35fd448dc5106337d7a299873dd3a6f0cd8d4891ecc27 - name: quay.io/openshifttest/scratch@sha256:b045c6ba28db13704c5cbf51aff3935dbed9a692d508603cc80591d89ab26308
Description of problem:
Based on the results in [Sippy|https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=gcp&Platform=gcp&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=operator-conditions&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Etcd&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20gcp%20unknown%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-08-19%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-08-13%2000%3A00%3A00&testId=Operator%20results%3A45d55df296fbbfa7144600dce70c1182&testName=operator%20conditions%20etcd], it appears that the periodic tests are not waiting for the etcd operator to complete before exiting. The test is supposed to wait for up to 20 mins after the final control plane machine is rolled, to allow operators to settle. But we are seeing the etcd operator triggering 2 further revisions after this happens. We need to understand if the etcd operator is correctly rolling out vs whether these changes should have rolled out prior to the final machine going away, and, understand if there's a way to add more stability to our checks to make sure that all of the operators stabilise, and, that they have been stable for at least some period (1 minute)
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
knative-ci.feature testi is failing with:
Logging in as kubeadmin Installing operator: "Red Hat OpenShift Serverless" Operator Red Hat OpenShift Serverless was not yet installed. Performing Serverless post installation steps User has selected namespace knative-serving 1) "before all" hook for "Create knative workload using Container image with extrenal registry on Add page: KN-05-TC05 (example #1)" 0 passing (3m) 1 failing 1) Perform actions on knative service and revision "before all" hook for "Create knative workload using Container image with extrenal registry on Add page: KN-05-TC05 (example #1)": AssertionError: Timed out retrying after 40000ms: Expected to find element: `[title="knativeservings.operator.knative.dev"]`, but never found it. Because this error occurred during a `before all` hook we are skipping all of the remaining tests. Although you have test retries enabled, we do not retry tests when `before all` or `after all` hooks fail at createKnativeServing (webpack:////go/src/github.com/openshift/console/frontend/packages/dev-console/integration-tests/support/pages/functions/knativeSubscriptions.ts:15:5) at performPostInstallationSteps (webpack:////go/src/github.com/openshift/console/frontend/packages/dev-console/integration-tests/support/pages/functions/installOperatorOnCluster.ts:176:26) at verifyAndInstallOperator (webpack:////go/src/github.com/openshift/console/frontend/packages/dev-console/integration-tests/support/pages/functions/installOperatorOnCluster.ts:221:2) at verifyAndInstallKnativeOperator (webpack:////go/src/github.com/openshift/console/frontend/packages/dev-console/integration-tests/support/pages/functions/installOperatorOnCluster.ts:231:27) at Context.eval (webpack:///./support/commands/hooks.ts:7:33) [mochawesome] Report JSON saved to /go/src/github.com/openshift/console/frontend/gui_test_screenshots/cypress_report_knative.json (Results) ┌────────────────────────────────────────────────────────────────────────────────────────────────┐ │ Tests: 16 │ │ Passing: 0 │ │ Failing: 1 │ │ Pending: 0 │ │ Skipped: 15 │ │ Screenshots: 1 │ │ Video: true │ │ Duration: 3 minutes, 8 seconds │ │ Spec Ran: knative-ci.feature │ └────────────────────────────────────────────────────────────────────────────────────────────────┘ (Screenshots) - /go/src/github.com/openshift/console/frontend/gui_test_screenshots/cypress/scree (1280x720) nshots/knative-ci.feature/Create knative workload using Container image with ext renal registry on Add page KN-05-TC05 (example #1) -- before all hook (failed).p ng
Description of problem:
The `oc set env` command changes apiVersion for route and deploymentconfig
Version-Release number of selected component (if applicable):
4.12, 4.13, 4.14
How reproducible:
100%
Steps to Reproduce:
With oc client 4.10 $ oc410 set env -e FOO="BAR" -f process.json --local -o json { "kind": "Service", "apiVersion": "v1", "metadata": { "name": "test", "creationTimestamp": null, "labels": { "app_name": "test", "template": "immutable" } }, "spec": { "ports": [ { "name": "8080-tcp", "protocol": "TCP", "port": 8080, "targetPort": 8080 } ], "selector": { "app_name": "test", "deploymentconfig": "test" }, "type": "ClusterIP", "sessionAffinity": "None" }, "status": { "loadBalancer": {} } } { "kind": "Route", "apiVersion": "route.openshift.io/v1", "metadata": { "name": "test", "creationTimestamp": null, "labels": { "app_name": "test", "template": "immutable" } }, With oc client 4.12, 4.13 and 4.14 $ oc41245 set env -e FOO="BAR" -f process.json --local -o json { "kind": "Service", "apiVersion": "v1", "metadata": { "name": "test", "creationTimestamp": null, "labels": { "app_name": "test", "template": "immutable" } }, "spec": { "ports": [ { "name": "8080-tcp", "protocol": "TCP", "port": 8080, "targetPort": 8080 } ], "selector": { "app_name": "test", "deploymentconfig": "test" }, "type": "ClusterIP", "sessionAffinity": "None" }, "status": { "loadBalancer": {} } } { "kind": "Route", "apiVersion": "v1", "metadata": { "name": "test" ..... ..... "kind": "DeploymentConfig", "apiVersion": "v1",
Actual results:
The oc client version for 4.12, 4.13, 4.14 changes the apiVersion.
Expected results:
The apiVersion should not change the apiVersion for route and DeploymentConfig.
Additional info:
Description of problem:
The audit-logs container for kas, oapi and oauth apiservers does not terminate within the `TerminationGracePeriodSeconds` timer. This is due to the container not terminating when a `SIGTERM` command is issued. When testing without the audit logs container, oapi and oath-apiserver terminates within a 90-110 second range gracefully. The kas does not terminate with the container gone and I have a hunch that it's the konnectivity container that also does not follow `SIGTERM` (I've attempted 10 minutes and still does not timeout). So this issue is to change the logic for audit-logs to terminate gracefully and increase the TerminationGracePeriodSeconds from the default of 30s to 120s.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create a hypershift cluster with auditing enabled 2. Try deleting apiserver pods and watch the pods being force deleted after 30 seconds (95 for kas) instead of gracefully terminated. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Post featuregate enabling for egressfirewall doesn't work
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Setup 4.16 ovn cluster
2. Following doc to enable feature gate https://docs.openshift.com/container-platform/4.15/nodes/clusters/nodes-cluster-enabling-features.html#nodes-cluster-enabling-features-cli_nodes-cluster-enabling
3. Configure egressfirewall with dnsName
Actual results:
no dnsnameresolver under openshift-ovn-kubernetes
Expected results:
The feature is enabled and should have dnsnameresolver
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Description of problem:
HCP has audit log configuration for Kube API server, OpenShift API server, OAuth API server (like OCP), but does not have audit for oauth-openshift (OAuth server). Discussed with Standa in https://redhat-internal.slack.com/archives/CS05TR7BK/p1714124297376299 , oauth-openshift needs audit too in HCP.
Version-Release number of selected component (if applicable):
4.11 ~ 4.16
How reproducible:
Always
Steps to Reproduce:
1. Launch HCP env. 2. Check audit log configuration: $ oc get deployment -n clusters-hypershift-ci-279389 kube-apiserver openshift-apiserver openshift-oauth-apiserver oauth-openshift -o yaml | grep -e '^ name:' -e 'audit\.log'
Actual results:
2. It outputs oauth-openshift (OAuth server) has no audit: name: kube-apiserver - /var/log/kube-apiserver/audit.log name: openshift-apiserver - /var/log/openshift-apiserver/audit.log name: openshift-oauth-apiserver - --audit-log-path=/var/log/openshift-oauth-apiserver/audit.log - /var/log/openshift-oauth-apiserver/audit.log name: oauth-openshift
Expected results:
2. oauth-openshift (OAuth server) needs to have audit too.
Additional info:
OCP has audit for OAuth server since 4.11 AUTH-6 https://docs.openshift.com/container-platform/4.11/security/audit-log-view.html saying "You can view the logs for the OpenShift API server, Kubernetes API server, OpenShift OAuth API server, and OpenShift OAuth server".
Description of problem:
Console UI alerting page show as `Not found`
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-07-13-042606
How reproducible:
100%
Steps to Reproduce:
1. open console UI navigate to Observe-> Alerting
Actual results:
Alerts, Silences, Alerting rules pages display as not found
Expected results:
able to see detils
Additional info:
Request URL:https://console-openshift-console.apps.tagao-417.qe.devcluster.openshift.com/api/kubernetes/apis/operators.coreos.com/v1alpha1/namespaces/ Request Method:GET Status Code:404 Not Found Remote Address:10.68.5.32:3128 Referrer Policy:strict-origin-when-cross-origin
This is a clone of issue OCPBUGS-37534. The following is the description of the original issue:
—
Description of problem:
Prow jobs upgrading from 4.9 to 4.16 are failing when they upgrade from 4.12 to 4.13. Nodes become NotReady when MCO tries to apply the new 4.13 configuration to the MCPs. The failing job is: periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.9-azure-ipi-f28 We have reproduced the issue and we found an ordering cycle error in the journal log Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 systemd-journald.service[838]: Runtime Journal (/run/log/journal/960b04f10e4f44d98453ce5faae27e84) is 8.0M, max 641.9M, 633.9M free. Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found ordering cycle on network-online.target/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on node-valid-hostname.service/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on ovs-configuration.service/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on firstboot-osupdate.target/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on machine-config-daemon-firstboot.service/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on machine-config-daemon-pull.service/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Job network-online.target/start deleted to break ordering cycle starting with machine-config-daemon-pull.service/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: Queued start job for default target Graphical Interface. Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: systemd-journald.service: unit configures an IP firewall, but the local system does not support BPF/cgroup firewalling. Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: (This warning is only shown for the first unit using IP firewalling.) Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: systemd-journald.service: Deactivated successfully.
Version-Release number of selected component (if applicable):
Using IPI on Azure, these are the version involved in the current issue upgrading from 4.9 to 4.13: version: 4.13.0-0.nightly-2024-07-23-154444 version: 4.12.0-0.nightly-2024-07-23-230744 version: 4.11.59 version: 4.10.67 version: 4.9.59
How reproducible:
Always
Steps to Reproduce:
1. Upgrade an IPI on Azure cluster from 4.9 to 4.13. Theoretically, upgrading from 4.12 to 4.13 should be enough, but we reproduced it following the whole path.
Actual results:
Nodes become not ready $ oc get nodes NAME STATUS ROLES AGE VERSION ci-op-g94jvswm-cc71e-998q8-master-0 Ready master 6h14m v1.25.16+306a47e ci-op-g94jvswm-cc71e-998q8-master-1 Ready master 6h13m v1.25.16+306a47e ci-op-g94jvswm-cc71e-998q8-master-2 NotReady,SchedulingDisabled master 6h13m v1.25.16+306a47e ci-op-g94jvswm-cc71e-998q8-worker-centralus1-c7ngb NotReady,SchedulingDisabled worker 6h2m v1.25.16+306a47e ci-op-g94jvswm-cc71e-998q8-worker-centralus2-2ppf6 Ready worker 6h4m v1.25.16+306a47e ci-op-g94jvswm-cc71e-998q8-worker-centralus3-nqshj Ready worker 6h6m v1.25.16+306a47e And in the NotReady nodes we can see the ordering cycle error mentioned in the description of this ticket.
Expected results:
No ordering cycle error should happen and the upgrade should be executed without problems.
Additional info:
Please review the following PR: https://github.com/openshift/prometheus/pull/203
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/console-operator/pull/906
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Storage degraded by VSphereProblemDetectorStarterStaticControllerDegraded during uprading to 4.16.0-0.nightly
Version-Release number of selected component (if applicable):
How reproducible:
Once
Steps to Reproduce:
1.Run prow ci job: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.15-vsphere-ipi-disk-encryption-tang-fips-f28/1790991142867701760 2.Storage degraded by VSphereProblemDetectorStarterStaticControllerDegraded during uprading to 4.16.0-0.nightly from 4.15.13: Last Transition Time: 2024-05-16T09:35:05Z Message: VSphereProblemDetectorStarterStaticControllerDegraded: "vsphere_problem_detector/04_clusterrole.yaml" (string): client rate limiter Wait returned an error: context canceled VSphereProblemDetectorStarterStaticControllerDegraded: "vsphere_problem_detector/05_clusterrolebinding.yaml" (string): client rate limiter Wait returned an error: context canceled VSphereProblemDetectorStarterStaticControllerDegraded: "vsphere_problem_detector/10_service.yaml" (string): client rate limiter Wait returned an error: context canceled VSphereProblemDetectorStarterStaticControllerDegraded: Reason: VSphereProblemDetectorStarterStaticController_SyncError Status: True Type: Degraded 3.must-gather is available: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.15-vsphere-ipi-disk-encryption-tang-fips-f28/1790991142867701760/artifacts/vsphere-ipi-disk-encryption-tang-fips-f28/gather-must-gather/
Actual results:
Storage degraded by VSphereProblemDetectorStarterStaticControllerDegraded during uprading to 4.16.0-0.nightly from 4.15.13
Expected results:
Upgrade should be successful
Additional info:
Description of problem:
When specifying imageDigestSources (or the deprecated imageContentSources), SNAT should be disabled to prevent public internet traffic.
Version-Release number of selected component (if applicable):
How reproducible:
Easily
Steps to Reproduce:
1. Specify imageDigestSources or imageContentSources along with an Internal publish strategy 2. DHCP service will not have SNAT disabled
Actual results:
DHCP service will not have SNAT disabled
Expected results:
DHCP service will have SNAT disabled
Additional info:
Please review the following PR: https://github.com/openshift/machine-api-provider-openstack/pull/116
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Customer is running Openshift on AHV and their Tenable Security Scan reported the following vulnerability on the Nutanix Cloud Controller Manager Deployment. https://www.tenable.com/plugins/nessus/42873 on port 10258 SSL Medium Strength Cipher Suites Supported (SWEET32) The Nutanix Cloud Controller Manager deployment runs two pods and exposes port 10258 to the outside world. sh-4.4# netstat -ltnp|grep -w '10258' tcp6 0 0 :::10258 :::* LISTEN 10176/nutanix-cloud sh-4.4# ps aux|grep 10176 root 10176 0.0 0.2 1297832 59764 ? Ssl Feb15 4:40 /bin/nutanix-cloud-controller-manager --v=3 --cloud-provider=nutanix --cloud-config=/etc/cloud/nutanix_config.json --controllers=* --configure-cloud-routes=false --cluster-name=trulabs-8qmx4 --use-service-account-credentials=true --leader-elect=true --leader-elect-lease-duration=137s --leader-elect-renew-deadline=107s --leader-elect-retry-period=26s --leader-elect-resource-namespace=openshift-cloud-controller-manager root 1403663 0.0 0.0 9216 1100 pts/0 S+ 14:17 0:00 grep 10176 [centos@provisioner-trulabs-0-230518-065321 ~]$ oc get pods -A -o wide | grep nutanix openshift-cloud-controller-manager nutanix-cloud-controller-manager-5c4cdbb9c-jnv7c 1/1 Running 0 4d18h 172.17.0.249 trulabs-8qmx4-master-1 <none> <none> openshift-cloud-controller-manager nutanix-cloud-controller-manager-5c4cdbb9c-vtrz5 1/1 Running 0 4d18h 172.17.0.121 trulabs-8qmx4-master-0 <none> <none> [centos@provisioner-trulabs-0-230518-065321 ~]$ oc describe pod -n openshift-cloud-controller-manager nutanix-cloud-controller-manager-5c4cdbb9c-jnv7c Name: nutanix-cloud-controller-manager-5c4cdbb9c-jnv7c Namespace: openshift-cloud-controller-manager Priority: 2000000000 Priority Class Name: system-cluster-critical Service Account: cloud-controller-manager Node: trulabs-8qmx4-master-1/172.17.0.249 Start Time: Thu, 15 Feb 2024 19:24:52 +0000 Labels: infrastructure.openshift.io/cloud-controller-manager=Nutanix k8s-app=nutanix-cloud-controller-manager pod-template-hash=5c4cdbb9c Annotations: operator.openshift.io/config-hash: b3e08acdcd983115fe7a2b94df296362b20c35db781c8eec572fbe24c3a7c6aa Status: Running IP: 172.17.0.249 IPs: IP: 172.17.0.249 Controlled By: ReplicaSet/nutanix-cloud-controller-manager-5c4cdbb9c Containers: cloud-controller-manager: Container ID: cri-o://f5c0f39e1907093c9359aa2ac364c5bcd591918b06103f7955b30d350c730a8a Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7f3e7b600d94d1ba0be1edb328ae2e32393acba819742ac3be5e6979a3dcbf4c Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7f3e7b600d94d1ba0be1edb328ae2e32393acba819742ac3be5e6979a3dcbf4c Port: 10258/TCP Host Port: 10258/TCP Command: /bin/bash -c #!/bin/bash set -o allexport if [[ -f /etc/kubernetes/apiserver-url.env ]]; then source /etc/kubernetes/apiserver-url.env fi exec /bin/nutanix-cloud-controller-manager \ --v=3 \ --cloud-provider=nutanix \ --cloud-config=/etc/cloud/nutanix_config.json \ --controllers=* \ --configure-cloud-routes=false \ --cluster-name=$(OCP_INFRASTRUCTURE_NAME) \ --use-service-account-credentials=true \ --leader-elect=true \ --leader-elect-lease-duration=137s \ --leader-elect-renew-deadline=107s \ --leader-elect-retry-period=26s \ --leader-elect-resource-namespace=openshift-cloud-controller-manager State: Running Started: Thu, 15 Feb 2024 19:24:56 +0000 Ready: True Restart Count: 0 Requests: cpu: 200m memory: 128Mi Environment: OCP_INFRASTRUCTURE_NAME: trulabs-8qmx4 NUTANIX_SECRET_NAMESPACE: openshift-cloud-controller-manager NUTANIX_SECRET_NAME: nutanix-credentials POD_NAMESPACE: openshift-cloud-controller-manager (v1:metadata.namespace) Mounts: /etc/cloud from nutanix-config (ro) /etc/kubernetes from host-etc-kube (ro) /etc/pki/ca-trust/extracted/pem from trusted-ca (ro) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4ht28 (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: nutanix-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: cloud-conf Optional: false trusted-ca: Type: ConfigMap (a volume populated by a ConfigMap) Name: ccm-trusted-ca Optional: false host-etc-kube: Type: HostPath (bare host directory volume) Path: /etc/kubernetes HostPathType: Directory kube-api-access-4ht28: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master:NoSchedule op=Exists node.cloudprovider.kubernetes.io/uninitialized:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 120s node.kubernetes.io/not-ready:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists for 120s Events: <none> Medium Strength Ciphers (> 64-bit and < 112-bit key, or 3DES) Name Code KEX Auth Encryption MAC ---------------------- ---------- --- ---- --------------------- --- ECDHE-RSA-DES-CBC3-SHA 0xC0, 0x12 ECDH RSA 3DES-CBC(168) SHA1 DES-CBC3-SHA 0x00, 0x0A RSA RSA 3DES-CBC(168) SHA1 The fields above are : {Tenable ciphername} {Cipher ID code} Kex={key exchange} Auth={authentication} Encrypt={symmetric encryption method} MAC={message authentication code} {export flag} [centos@provisioner-trulabs-0-230518-065321 ~]$ curl -v telnet://172.17.0.2:10258 * About to connect() to 172.17.0.2 port 10258 (#0) * Trying 172.17.0.2... * Connected to 172.17.0.2 (172.17.0.2) port 10258 (#0)
Version-Release number of selected component (if applicable):
How reproducible:
The nutanix CCM manager pod running in the OCP cluster does not set the option "--tls-cipher-suites".
Steps to Reproduce:
Create an OCP Nutanix cluster.
Actual results:
Run the below cli returns nothing. $ oc describe pod -n openshift-cloud-controller-manager nutanix-cloud-controller-manager-... | grep "\--tls-cipher-suites"
Expected results:
Expect the nutanix CCM manager deployment set the proper option "--tls-cipher-suites".
Additional info:
Description of problem:
Navigate to Node overview and check the Utilization of CPU and memory, it shows something like: "6.53 GiB available of 300 MiB total limit", which looks very confuse.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Navigate to Node overview 2. Check the Utilization of CPU and memory 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-34647. The following is the description of the original issue:
—
Description of problem:
When we enable OCB functionality and we create a MC that configures an eforcing=0 kernel argumnent the MCP is degraded reporting this message { "lastTransitionTime": "2024-05-30T09:37:06Z", "message": "Node ip-10-0-29-166.us-east-2.compute.internal is reporting: \"unexpected on-disk state validating against quay.io/mcoqe/layering@sha256:654149c7e25a1ada80acb8eedc3ecf9966a8d29e9738b39fcbedad44ddd15ed5: missing expected kernel arguments: [enforcing=0]\"", "reason": "1 nodes are reporting degraded status on sync", "status": "True", "type": "NodeDegraded" },
Version-Release number of selected component (if applicable):
IPI on AWS $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-05-30-021120 True False 97m Error while reconciling 4.16.0-0.nightly-2024-05-30-021120: the cluster operator olm is not available
How reproducible:
Alwasy
Steps to Reproduce:
1. Enable techpreview $ oc patch featuregate cluster --type=merge -p '{"spec":{"featureSet": "TechPreviewNoUpgrade"}}' 2. Configure a MSOC resource to enable OCB functionality in the worker pool When we hit this problem we were using the mcoqe quay repository. A copy of the pull-secret for baseImagePullSecret and renderedImagePushSecret and no currentImagePullSecret configured. apiVersion: machineconfiguration.openshift.io/v1alpha1 kind: MachineOSConfig metadata: name: worker spec: machineConfigPool: name: worker # buildOutputs: # currentImagePullSecret: # name: "" buildInputs: imageBuilder: imageBuilderType: PodImageBuilder baseImagePullSecret: name: pull-copy renderedImagePushSecret: name: pull-copy renderedImagePushspec: "quay.io/mcoqe/layering:latest" 3. Create a MC to use enforing=0 kernel argument { "kind": "List", "apiVersion": "v1", "metadata": {}, "items": [ { "apiVersion": "machineconfiguration.openshift.io/v1", "kind": "MachineConfig", "metadata": { "labels": { "machineconfiguration.openshift.io/role": "worker" }, "name": "change-worker-kernel-selinux-gvr393x2" }, "spec": { "config": { "ignition": { "version": "3.2.0" } }, "kernelArguments": [ "enforcing=0" ] } } ] }
Actual results:
The worker MCP is degraded reporting this message: oc get mcp worker -oyaml .... { "lastTransitionTime": "2024-05-30T09:37:06Z", "message": "Node ip-10-0-29-166.us-east-2.compute.internal is reporting: \"unexpected on-disk state validating against quay.io/mcoqe/layering@sha256:654149c7e25a1ada80acb8eedc3ecf9966a8d29e9738b39fcbedad44ddd15ed5: missing expected kernel arguments: [enforcing=0]\"", "reason": "1 nodes are reporting degraded status on sync", "status": "True", "type": "NodeDegraded" },
Expected results:
The MC should be applied without problems and selinux should be using enforcing=0
Additional info:
Description of problem:
When there is more than one password-based IDP (like htpasswd) and its name contains whitespaces, it causes the oauth-server to panic, if Golang is v1.22 or higher.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Create a cluster with OCP 4.17 2. Create at least two password-based IDP (like htpasswd) with whitespaces in the name. 3. oauth-server panics.
Actual results:
oauth-server panics (if Go is at version 1.22 or higher).
Expected results:
NO REGRESSION, it worked with Go 1.21 and lower.
Additional info:
Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/232
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Now that capi/aws is the default in 4.16+, the old terraform aws configs won't be maintained since there is no way to use them. Users interested in the configs can still access them in the 4.15 branch where they are still maintained as the installer still uses terraform.
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
terraform aws configs are left in the repo.
Expected results:
Configs are removed.
Additional info:
This is a clone of issue OCPBUGS-38733. The following is the description of the original issue:
—
Description of problem:
In OpenShift 4.13-4.15, when a "rendered" MachineConfig in use is deleted, it's automatically recreated. In OpenShift 4.16, it's not recreated, and nodes and MCP becomes degraded due to the "rendered" not found error.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Create a MC to deploy any file in the worker MCP 2. Get the name of the new rendered MC, like for example "rendered-worker-bf829671270609af06e077311a39363e" 3. When the first node starts updating, delete the new rendered MC oc delete mc rendered-worker-bf829671270609af06e077311a39363e
Actual results:
Node degraded with "rendered" not found error
Expected results:
In OCP 4.13 to 4.15, the "rendered" MC is automatically re-created, and the node continues updating to the MC content without issues. It should be the same in 4.16.
Additional info:
The same behavior in 4.12 and older than now in 4.16. In 4.13-4.15, the "rendered" is re-created and no issues with the nodes/MCPs are shown.
Description of problem:
Config custom AMI for cluster: platform.aws.defaultMachinePlatform.amiID Or installconfig.controlPlane.platform.aws.amiID installconfig.compute.platform.aws.amiID Master machines still use default AMI instead of custom one. aws ec2 describe-instances --filters "Name=tag:kubernetes.io/cluster/yunjiang-cap6-qjc5t,Va│ lues=owned" "Name=tag:Name,Values=*worker*" --output json | jq '.Reservations[].Instances[].ImageId' | sort | uniq │ "ami-0f71147cab4dbfb61" aws ec2 describe-instances --filters "Name=tag:kubernetes.io/cluster/yunjiang-cap6-qjc5t,Va│ lues=owned" "Name=tag:Name,Values=*master*" --output json | jq '.Reservations[].Instances[].ImageId' | sort | uniq │ "Ami-0ae9b509738034a2c" <- default ami
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-08-222442
How reproducible:
Steps to Reproduce:
1.See description 2. 3.
Actual results:
See description
Expected results:
master machines use custom AMI
Additional info:
This is a clone of issue OCPBUGS-43084. The following is the description of the original issue:
—
Description of problem:
While accessing the node terminal of the cluster from web-console the below warning message observed. ~~~ Admission Webhook WarningPod master-0.americancluster222.lab.psi.pnq2.redhat.com-debug violates policy 299 - "metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]" ~~~ Note: This is not impacting the cluster. However creating confusion among customers due to the warning message.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Everytime.
Steps to Reproduce:
1. Install cluster of version 4.16.11 2. Upgrade the cluster from web-console to the next-minor version 4.16.13 3. Try to access the node terminal from UI
Actual results:
Showing warning while accessing the node terminal.
Expected results:
Does not show any warning.
Additional info:
Description of problem:
console-operator is fetching the organization ID from OCM on every sync call, which is too often. We need to reduce the fetch period.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-41776. The following is the description of the original issue:
—
Description of problem:
the section is: https://docs.openshift.com/container-platform/4.16/installing/installing_aws/ipi/installing-aws-vpc.html#installation-aws-arm-tested-machine-types_installing-aws-vpc all tesed arm instances for 4.14+: c6g.* c7g.* m6g.* m7g.* r8g.* we need to ensure all sections include "Tested instance types for AWS on 64-bit ARM infrastructures" section been updated for 4.14+
Additional info:
Description of problem:
Found a panic at the end of the catalog-operator/catalog-operator/logs/previous.log 2024-07-23T23:37:48.446406276Z panic: runtime error: invalid memory address or nil pointer dereference
Version-Release number of selected component (if applicable):
Cluster profile: aws with ipi installation with localzone and fips on 4.17.0-0.nightly-2024-07-20-191204
How reproducible:
once
Steps to Reproduce:
Searched the panic in log files of must-gather.
Actual results:
Panic occurred with catalog-operator, seems to have caused it to restart, $ tail -20 namespaces/openshift-operator-lifecycle-manager/pods/catalog-operator-77c8dd875-d4dpf/catalog-operator/catalog-operator/logs/previous.log 2024-07-23T23:37:48.425902169Z time="2024-07-23T23:37:48Z" level=info msg="of 1 pods matching label selector, 1 have the correct images and matching hash" correctHash=true correctImages=true current-pod.name=certified-operators-rrm5v current-pod.namespace=openshift-marketplace 2024-07-23T23:37:48.440899013Z time="2024-07-23T23:37:48Z" level=error msg="error updating InstallPlan status" id=a9RUB ip=install-spcrz namespace=e2e-test-storage-lso-h9nqf phase=Installing updateError="Operation cannot be fulfilled on installplans.operators.coreos.com \"install-spcrz\": the object has been modified; please apply your changes to the latest version and try again" 2024-07-23T23:37:48.446406276Z panic: runtime error: invalid memory address or nil pointer dereference 2024-07-23T23:37:48.446406276Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1ef8d9b] 2024-07-23T23:37:48.446406276Z 2024-07-23T23:37:48.446406276Z goroutine 273 [running]: 2024-07-23T23:37:48.446406276Z github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog.(*Operator).syncInstallPlans(0xc000212480, {0x25504a0?, 0xc000328000?}) 2024-07-23T23:37:48.446406276Z /build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/operator.go:2012 +0xb9b 2024-07-23T23:37:48.446406276Z github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog.NewOperator.LegacySyncHandler.ToSyncer.LegacySyncHandler.ToSyncerWithDelete.func107({0x20?, 0x2383f40?}, {0x298e2d0, 0xc002cf0140}) 2024-07-23T23:37:48.446406276Z /build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer.go:181 +0xbc 2024-07-23T23:37:48.446406276Z github.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubestate.SyncFunc.Sync(0x2383f40?, {0x29a60b0?, 0xc000719720?}, {0x298e2d0?, 0xc002cf0140?}) 2024-07-23T23:37:48.446406276Z /build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubestate/kubestate.go:184 +0x37 2024-07-23T23:37:48.446406276Z github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*QueueInformer).Sync(...) 2024-07-23T23:37:48.446406276Z /build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer.go:35 2024-07-23T23:37:48.446406276Z github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).processNextWorkItem(0xc00072a0b0, {0x29a60b0, 0xc000719720}, 0xc0009829c0) 2024-07-23T23:37:48.446406276Z /build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:316 +0x59f 2024-07-23T23:37:48.446406276Z github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).worker(...) 2024-07-23T23:37:48.446406276Z /build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:260 2024-07-23T23:37:48.446406276Z created by github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start in goroutine 142 2024-07-23T23:37:48.446406276Z /build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:250 +0x4e5
Expected results:
Should not panic with catalog-operator
Additional info:
From the e2e [test log summary|https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-aws-ipi-localzone-fips-f2/1815851708257931264/artifacts/aws-ipi-localzone-fips-f2/openshift-extended-test/artifacts/extended.log], we got one information that catalog-operator container exited with panic, Jul 23 23:37:49.361 E ns/openshift-operator-lifecycle-manager pod/catalog-operator-77c8dd875-d4dpf node/ip-10-0-24-201.ec2.internal container=catalog-operator container exited with code 2 (Error): d memory address or nil pointer dereference\n[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1ef8d9b]\n\ngoroutine 273 [running]:\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog.(*Operator).syncInstallPlans(0xc000212480, {0x25504a0?, 0xc000328000?})\n /build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/operator.go:2012 +0xb9b\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog.NewOperator.LegacySyncHandler.ToSyncer.LegacySyncHandler.ToSyncerWithDelete.func107({0x20?, 0x2383f40?}, {0x298e2d0, 0xc002cf0140})\n /build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer.go:181 +0xbc\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubestate.SyncFunc.Sync(0x2383f40?, {0x29a60b0?, 0xc000719720?}, {0x298e2d0?, 0xc002cf0140?})\n /build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubestate/kubestate.go:184 +0x37\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*QueueInformer).Sync(...)\n /build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer.go:35\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).processNextWorkItem(0xc00072a0b0, {0x29a60b0, 0xc000719720}, 0xc0009829c0)\n /build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:316 +0x59f\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).worker(...)\n /build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:260\ncreated by github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start in goroutine 142\n /build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:250 +0x4e5\n
Description of problem:
When performing a UPI installation, the installer fails with: time="2024-05-29T14:38:59-04:00" level=fatal msg="failed to fetch Cluster API Machine Manifests: failed to generate asset \"Cluster API Machine Manifests\": unable to generate CAPI machines for vSphere unable to get network inventory path: unable to find network ci-vlan-896 in resource pool /cidatacenter/host/cicluster/Resources/ci-op-yrhjini6-9ef4a" If I pre-create the resource pool(s), the installation proceeds.
Version-Release number of selected component (if applicable):
4.16 nightly
How reproducible:
consistently
Steps to Reproduce:
1. Follow documentation to perform a UPI installation 2. Installation will fail during manifest creation 3.
Actual results:
Installation fails
Expected results:
Installation should proceed
Additional info:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/51894/rehearse-51894-periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-upi-zones/1795883271666536448
Description of problem:
E2E flake for all of the test cases in TestMTLSWithCRLs. It's logging:
client_tls_test.go:1076: failed to find host name for default router in route:
Example Flakes:
Search.CI Link Impact is moderate, I've seen 5-6 failures in the last week.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
10-25%
Steps to Reproduce:
1.Run TestMTLSWithCRLs
Actual results:
Fails with "failed to find host name"
Expected results:
Shouldn't fail.
Additional info:
It appears the logic for the `getRouteHost` is incorrect. There is a poll loop that waits for it to become not "", but `getRouteHost` returns a Fatal if it can't find it, so the poll is useless.
This is a clone of issue OCPBUGS-39438. The following is the description of the original issue:
—
Description of problem: If a customer applies ethtool configuration to the interface used in br-ex, that configuration will be dropped when br-ex is created. We need to read and apply the configuration from the interface to the phys0 connection profile, as described in https://issues.redhat.com/browse/RHEL-56741?focusedId=25465040&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25465040
Version-Release number of selected component (if applicable): 4.16
How reproducible: Always
Steps to Reproduce:
1. Deploy a cluster with an NMState config that sets the ethtool.feature.esp-tx-csum-hw-offload field to "off"
2.
3.
Actual results: The ethtool setting is only applied to the interface profile which is disabled after configure-ovs runs
Expected results: The ethtool setting is present on the configure-ovs-created profile
Additional info:
Affected Platforms: VSphere. Probably baremetal too and possibly others.
Description of problem:
router pod is in CrashLoopBackup after y-stream upgrade from 4.13->4.14
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. create a cluster with 4.13 2. upgrade HC to 4.14 3.
Actual results:
router pod in CrashLoopBackoff
Expected results:
router pod is running after upgrade HC from 4.13->4.14
Additional info:
images: ====== HO image: 4.15 upgrade HC from 4.13.0-0.nightly-2023-12-19-114348 to 4.14.0-0.nightly-2023-12-19-120138 router pod log: ============== jiezhao-mac:hypershift jiezhao$ oc get pods router-9cfd8b89-plvtc -n clusters-jie-test NAME READY STATUS RESTARTS AGE router-9cfd8b89-plvtc 0/1 CrashLoopBackOff 11 (45s ago) 32m jiezhao-mac:hypershift jiezhao$ Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 27m default-scheduler Successfully assigned clusters-jie-test/router-9cfd8b89-plvtc to ip-10-0-42-36.us-east-2.compute.internal Normal AddedInterface 27m multus Add eth0 [10.129.2.82/23] from ovn-kubernetes Normal Pulling 27m kubelet Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3d2acba15f69ea3648b3c789111db34ff06d9230a4371c5949ebe3c6218e6ea3" Normal Pulled 27m kubelet Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3d2acba15f69ea3648b3c789111db34ff06d9230a4371c5949ebe3c6218e6ea3" in 14.309s (14.309s including waiting) Normal Created 26m (x3 over 27m) kubelet Created container private-router Normal Started 26m (x3 over 27m) kubelet Started container private-router Warning BackOff 26m (x5 over 27m) kubelet Back-off restarting failed container private-router in pod router-9cfd8b89-plvtc_clusters-jie-test(e6cf40ad-32cd-438c-8298-62d565cf6c6a) Normal Pulled 26m (x3 over 27m) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3d2acba15f69ea3648b3c789111db34ff06d9230a4371c5949ebe3c6218e6ea3" already present on machine Warning FailedToRetrieveImagePullSecret 2m38s (x131 over 27m) kubelet Unable to retrieve some image pull secrets (router-dockercfg-q768b); attempting to pull the image may not succeed. jiezhao-mac:hypershift jiezhao$ jiezhao-mac:hypershift jiezhao$ oc logs router-9cfd8b89-plvtc -n clusters-jie-test [NOTICE] (1) : haproxy version is 2.6.13-234aa6d [NOTICE] (1) : path to executable is /usr/sbin/haproxy [ALERT] (1) : config : [/usr/local/etc/haproxy/haproxy.cfg:52] : 'server ovnkube_sbdb/ovnkube_sbdb' : could not resolve address 'None'. [ALERT] (1) : config : Failed to initialize server(s) addr. jiezhao-mac:hypershift jiezhao$ notes: ===== not sure if it has the same root cause as https://issues.redhat.com/browse/OCPBUGS-24627
Description of problem:
When Cypress runs in CI, videos showing the test runs are missing (e.g., https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_console/14106/pull-ci-openshift-console-master-e2e-gcp-console/1820750269743370240/artifacts/e2e-gcp-console/test/). I suspect changes in https://github.com/openshift/console/pull/13937 resulted in the videos not getting properly copied over.
I noticed this error in previous e2e-azure tests:
logger.go:146: 2024-06-05T15:38:14.058Z INFO Successfully created resource group {"name": "example-xwd7d-"}
which causes an issue when you go to create a subnet:
hypershift_framework.go:275: failed to create cluster, tearing down: failed to create infra: failed to create vnet: PUT https://management.azure.com/subscriptions/5f99720c-6823-4792-8a28-69efb0719eea/resourceGroups/example-xwd7d-/providers/Microsoft.Network/virtualNetworks/example-xwd7d-
--------------------------------------------------------------------------------
RESPONSE 400: 400 Bad Request
ERROR CODE: InvalidResourceName
--------------------------------------------------------------------------------
{
"error":
}
-------
Example - failure here.
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/151
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-42360. The following is the description of the original issue:
—
Description of problem:
due to https://issues.redhat.com/browse/API-1644, no token was generate for sa automatically, it is needed to add one step to create the token manually.
Version-Release number of selected component (if applicable):
After creating a new service account, one step should be added to create a long-lived API token
How reproducible:
always
Steps to Reproduce:
secret yaml file exmaple: xzha@xzha1-mac OCP-24771 % cat secret.yaml apiVersion: v1 kind: Secret metadata: name: scoped annotations: kubernetes.io/service-account.name: scoped type: kubernetes.io/service-account-token
Actual results:
Expected results:
Additional info:
Description of problem:
After we applied the old tlsSecurityProfile to the Hypershift hosted clsuter, the apiserver ran into CrashLoopBackOff failure, this blocked our test.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-03-13-061822 True False 129m Cluster version is 4.16.0-0.nightly-2024-03-13-061822
How reproducible:
always
Steps to Reproduce:
1. Specify KUBECONFIG with kubeconfig of the Hypershift management cluster 2. hostedcluster=$( oc get -n clusters hostedclusters -o json | jq -r .items[].metadata.name) 3. oc patch hostedcluster $hostedcluster -n clusters --type=merge -p '{"spec": {"configuration": {"apiServer": {"tlsSecurityProfile":{"old":{},"type":"Old"}}}}}' hostedcluster.hypershift.openshift.io/hypershift-ci-270930 patched 4. Checked the tlsSecurityProfile, $ oc get HostedCluster $hostedcluster -n clusters -ojson | jq .spec.configuration.apiServer { "audit": { "profile": "Default" }, "tlsSecurityProfile": { "old": {}, "type": "Old" } }
Actual results:
One of the kube-apiserver of Hosted cluster ran into CrashLoopBackOff, stuck in this status, unable to complete the old tlsSecurityProfile configuration. $ oc get pods -l app=kube-apiserver -n clusters-${hostedcluster} NAME READY STATUS RESTARTS AGE kube-apiserver-5b6fc94b64-c575p 5/5 Running 0 70m kube-apiserver-5b6fc94b64-tvwtl 5/5 Running 0 70m kube-apiserver-84c7c8dd9d-pnvvk 4/5 CrashLoopBackOff 6 (20s ago) 7m38s
Expected results:
Applying the old tlsSecurityProfile should be successful.
Additional info:
This also can be reproduced on 4.14, 4.15. We have the last passed log of the test case as below: passed API_Server 2024-02-19 13:34:25(UTC) aws 4.14.0-0.nightly-2024-02-18-123855 hypershift passed API_Server 2024-02-08 02:24:15(UTC) aws 4.15.0-0.nightly-2024-02-07-062935 hypershift passed API_Server 2024-02-17 08:33:37(UTC) aws 4.16.0-0.nightly-2024-02-08-073857 hypershift From the history of the test, it seems that some code changes were introduced in February that caused the bug.
This is a clone of issue OCPBUGS-38289. The following is the description of the original issue:
—
Description of problem:
The cluster-wide proxy is getting injected for remote-write config automatically but not the noProxy URLs in Prometheus k8s CR which is available in openshift-monitoring project which is expected. However, if the remote-write endpoint is in noProxy region, then metrics are not transferred.
Version-Release number of selected component (if applicable):
RHOCP 4.16.4
How reproducible:
100%
Steps to Reproduce:
1. Configure proxy custom resource in RHOCP 4.16.4 cluster 2. Create cluster-monitoring-config configmap in openshift-monitoring project 3. Inject remote-write config (without specifically configuring proxy for remote-write) 4. After saving the modification in cluster-monitoring-config configmap, check the remoteWrite config in Prometheus k8s CR. Now it contains the proxyUrl but NOT the noProxy URL(referenced from cluster proxy). Example snippet: ============== apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: [...] name: k8s namespace: openshift-monitoring spec: [...] remoteWrite: - proxyUrl: http://proxy.abc.com:8080 <<<<<====== Injected Automatically but there is no noProxy URL. url: http://test-remotewrite.test.svc.cluster.local:9090
Actual results:
The proxy URL from proxy CR is getting injected in Prometheus k8s CR automatically when configuring remoteWrite but it doesn't have noProxy inherited from cluster proxy resource.
Expected results:
The noProxy URL should get injected in Prometheus k8s CR as well.
Additional info:
We need to document properly:
During 4.16, we tested the machine series "A3" and "C3D", so add them to "Tested instance types for GCP".
Description of problem:
The current api version used by the registry operator does not include the recently added "ChunkSizeMiB" feature gate. We need to bump the openshift/api to latest so that this feature gate becomes available for use. Initialize the feature "ChunkSizeMiB" behind feature gate as TechPreviewNoUpgrade
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
https://issues.redhat.com//browse/IR-471
Description of problem:
When running the bootstrap e2e test, the featuregate does not have a value when the controllers get Run().
While in the actual code, the featuregate is ready before the controllers Run().
Version-Release number of selected component (if applicable):
How reproducible:
The bootstrap test log of commit 2092c9e has an error fetching featuregate inside controller Run().
I0221 18:34:00.360752 17716 container_runtime_config_controller.go:235] imageverification sigstore FeatureGates: false, error: featureGates not yet observed
Steps to Reproduce:
1. Add function call inside the containerruntimeconfig controller Run() funciton: featureGates, err := ctrl.featureGateAccess.CurrentFeatureGates(). Print out the error message. 2. Run the e2e bootstrap test: ci/prow/bootstrap-unit
Actual results:
The function in step 1 returns error: featureGates not yet observed
Expected results:
featureGateAccess.CurrentFeatureGates() should not return not yet observed error and return the featuregates.
Additional info:
Description of problem:
Node has been cordoned manually.After several days, machine-config-controller uncordoned the same node after rendering a new machine-config.
Version-Release number of selected component (if applicable):
4.13
Actual results:
The mco rolled out and the node was uncordoned by the mco
Expected results:
MCO treat unscedhulable node as not ready for performing update. Also, it may halt update on other nodes in the pool based on what maxUnavailable is set for that pool
Additional info:
This is a clone of issue OCPBUGS-41824. The following is the description of the original issue:
—
Description of problem:
The kubeconfigs for the DNS Operator and the Ingress Operator are managed by Hypershift and they should only be managed by the cloud service provider. This can lead to the kubeconfig/certificate being invalid in the cases where the cloud service provider further manages the kubeconfig (for example ca-rotation).
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/kubevirt-csi-driver/pull/41
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Specifying N2D machine types for compute and controlPlane machines, with "confidentialCompute: Enabled", "create cluster" got the error "Confidential Instance Config is only supported for compatible cpu platforms" [1], while the real cause is missing the settings "onHostMaintenance: Terminate". That being said, the 4.16 error is mis-leading, suggest to be consistent with 4.15 [2] / 4/14 [3] error messages. FYI Confidential VM is supported on N2D machine types (see https://cloud.google.com/confidential-computing/confidential-vm/docs/supported-configurations#machine-type-cpu-zone).
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-21-221942
How reproducible:
Always
Steps to Reproduce:
1. Please refer to [1]
Actual results:
The error message is like "Confidential Instance Config is only supported for compatible cpu platforms", which is mis-leading.
Expected results:
4.15 [2] / 4/14 [3] error messages, which look better.
Additional info:
FYI it is about QE test case OCP-60212 scenario b.
Please review the following PR: https://github.com/openshift/azure-file-csi-driver/pull/68
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-39096. The following is the description of the original issue:
—
Description of problem:
CNO doesnt report, as a metric, when there is a network overlap when live migration is initiated.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-39231. The following is the description of the original issue:
—
Description of problem:
Feature : https://issues.redhat.com/browse/MGMT-18411 when to assited installer v. 2.34.0 but apprently not including in any openshift version to be used in ABI installation.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Went thru a loop to very the different commits to check if this is delivered in any ocp version. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Hiding the version is a good security practice
Additional info:
Description of the problem:
For LVMS we need an additional disk for each worker node (incase there are such , else for each master node )
It is currently possible to attach a bootable disk with data to a worker node , select skip formatting and the lvms requirement is satisfied and it is possible to start installation
How reproducible:
Steps to reproduce:
1.Create a cluster 3 masters 3 workers
2. attach for the worker nodes 1 additional disk
3. in one of the worker node make sure that the disk has a file system and contain data
4. for that disk select skip formatting
Actual results:
the issue here is that the disk which will be used to lvms , will not be formatted and will have
Expected results:
In that scenario , the lvms requirement should turn to failed since he disk which AI is planning to use for lvms have file system and may cause installation issues
Description of problem:
Enabling KMS for IBM Cloud will result in the kube-apiserver failing with the following configuration error: 17:45:45 E0711 17:43:00.264407 1 run.go:74] "command failed" err="error while parsing file: resources[0].providers[0]: Invalid value: config.ProviderConfiguration{AESGCM:(*config.AESConfiguration)(nil), AESCBC:(*config.AESConfiguration)(nil), Secretbox:(*config.SecretboxConfiguration)(nil), Identity:(*config.IdentityConfiguration)(0x89b4c60), KMS:(*config.KMSConfiguration)(0xc000ff1900)}: more than one provider specified in a single element, should split into different list elements"
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-38436. The following is the description of the original issue:
—
Description of problem:
e980 is a valid system type for the madrid region but it is not listed as such in the installer.
Version-Release number of selected component (if applicable):
How reproducible:
Easily
Steps to Reproduce:
1. Try to deploy to mad02 with SysType set to e980 2. Fail 3.
Actual results:
Installer exits
Expected results:
Installer should continue as it's a valid system type.
Additional info:
This fix contains the following changes coming from updated version of kubernetes up to v1.30.4:
Changelog:
v1.30.4: https://github.com/kubernetes/kubernetes/blob/release-1.30/CHANGELOG/CHANGELOG-1.30.md#changelog-since-v1303
Description of problem:
Security baselines such as CIS do not recommend using secrets as environment variables, but using files. 5.4.1 Prefer using secrets as files over secrets as environmen... | Tenable® https://www.tenable.com/audits/items/CIS_Kubernetes_v1.6.1_Level_2_Master.audit:98de3da69271994afb6211cf86ae4c6b Secrets in Kubernetes must not be stored as environment variables. https://www.stigviewer.com/stig/kubernetes/2021-04-14/finding/V-242415 However, metal3 and metal3-image-customization Pods are using environment variables. $ oc get pod -A -o jsonpath='{range .items[?(@..secretKeyRef)]} {.kind} {.metadata.name} {"\n"}{end}' | grep metal3 Pod metal3-66b59bbb76-8xzl7 Pod metal3-image-customization-965f5c8fc-h8zrk
Version-Release number of selected component (if applicable):
4.14, 4.13, 4.12
How reproducible:
100%
Steps to Reproduce:
1. Install a new cluster using baremetal IPI 2. Run a compliance scan using compliance operator[1], or just look at the manifest of metal3 or metal3-image-customization pod [1] https://docs.openshift.com/container-platform/4.14/security/compliance_operator/co-overview.html
Actual results:
Not compliant to CIS or other security baselines
Expected results:
Compliant to CIS or other security baselines
Additional info:
Description of problem:
When images have been skipped and no images have been mirrored i see idms and itms are generated. 2024/05/15 15:38:25 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/05/15 15:38:25 [INFO] : 👋 Hello, welcome to oc-mirror 2024/05/15 15:38:25 [INFO] : ⚙️ setting up the environment for you... 2024/05/15 15:38:25 [INFO] : 🔀 workflow mode: mirrorToMirror 2024/05/15 15:38:25 [INFO] : 🕵️ going to discover the necessary images... 2024/05/15 15:38:25 [INFO] : 🔍 collecting release images... 2024/05/15 15:38:25 [INFO] : 🔍 collecting operator images... 2024/05/15 15:38:25 [INFO] : 🔍 collecting additional images... 2024/05/15 15:38:25 [WARN] : [AdditionalImagesCollector] mirroring skipped : source image quay.io/cilium/cilium-etcd-operator:v2.0.7@sha256:04b8327f7f992693c2cb483b999041ed8f92efc8e14f2a5f3ab95574a65ea2dc has both tag and digest 2024/05/15 15:38:25 [WARN] : [AdditionalImagesCollector] mirroring skipped : source image quay.io/coreos/etcd:v3.5.4@sha256:a67fb152d4c53223e96e818420c37f11d05c2d92cf62c05ca5604066c37295e9 has both tag and digest 2024/05/15 15:38:25 [INFO] : 🚀 Start copying the images... 2024/05/15 15:38:25 [INFO] : === Results === 2024/05/15 15:38:25 [INFO] : All release images mirrored successfully 0 / 0 ✅ 2024/05/15 15:38:25 [INFO] : All operator images mirrored successfully 0 / 0 ✅ 2024/05/15 15:38:25 [INFO] : All additional images mirrored successfully 0 / 0 ✅ 2024/05/15 15:38:25 [INFO] : 📄 Generating IDMS and ITMS files... 2024/05/15 15:38:25 [INFO] : /app1/knarra/customertest1/working-dir/cluster-resources/idms-oc-mirror.yaml file created 2024/05/15 15:38:25 [INFO] : 📄 Generating CatalogSource file... 2024/05/15 15:38:25 [INFO] : mirror time : 715.644µs 2024/05/15 15:38:25 [INFO] : 👋 Goodbye, thank you for using oc-mirror [fedora@preserve-fedora36 knarra]$ ls -l /app1/knarra/customertest1/working-dir/cluster-resources/idms-oc-mirror.yaml -rw-r--r--. 1 fedora fedora 0 May 15 15:38 /app1/knarra/customertest1/working-dir/cluster-resources/idms-oc-mirror.yaml [fedora@preserve-fedora36 knarra]$ cat /app1/knarra/customertest1/working-dir/cluster-resources/idms-oc-mirror.yaml
Version-Release number of selected component (if applicable):
4.16 oc-mirror
How reproducible:
Always
Steps to Reproduce:
1. Use the following imageSetConfig.yaml and run command `./oc-mirror --v2 -c /tmp/bug331961.yaml --workspace file:///app1/knarra/customertest1 docker://localhost:5000/bug331961 --dest-tls-verify=false` cat /tmp/imageSetConfig.yaml kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: additionalImages: - name: quay.io/cilium/cilium-etcd-operator:v2.0.7@sha256:04b8327f7f992693c2cb483b999041ed8f92efc8e14f2a5f3ab95574a65ea2dc - name: quay.io/coreos/etcd:v3.5.4@sha256:a67fb152d4c53223e96e818420c37f11d05c2d92cf62c05ca5604066c37295e9
Actual results:
Nothing will be mirrored and the images listed will be skipped as these images has both tag and digest but i see idms and itms empty files being generated
Expected results:
If nothing is mirrored, idms and itms files should not be generated.
Additional info:
https://issues.redhat.com/browse/OCPBUGS-33196
Description of problem:
There is regression issue found with libreswan 4.9 and later versions which causes ipsec tunnel broken and making pod to pod traffic failing intermittently. But this issue is not seen with libreswan 4.5. So we must provide a flexibility for user to install their own IPsec machine config to choose their own libreswan version instead of sticking with CNO managed IPsec machine config which installs libreswan version which comes with RHCOS distro.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
As a temp workaround for https://issues.redhat.com//browse/OCPBUGS-23516, https://github.com/openshift/cluster-monitoring-operator/pull/2186/files was merged.
https://issues.redhat.com/browse/OCPBUGS-26933 fixed the root issue from the console side.
To enable caching back and simplify nginx config, let's revert https://github.com/openshift/cluster-monitoring-operator/pull/2186/files and see what happens.
We need to document properly:
This is a clone of issue OCPBUGS-38051. The following is the description of the original issue:
—
Description of problem:
Information on the Lightspeed modal is not as clear as it could be for users to understand what to do next. Users should also have a very clear way to disable and those options are not obvious.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-41778. The following is the description of the original issue:
—
TRT has detected a consistent long term trend where the oauth-apiserver appears to have more disruption than it did in 4.16, for minor upgrades on azure.
The problem appears roughly over the 90th percentile, we picked it up at P95 where it shows a consistent 5-8s more than we'd expect given the data in 4.16 ga.
The problem hits ONLY oauth, affecting both new and reused connections, as well as the cached variants meaning etcd should be out of the picture. You'll see a few very short blips where all four of these backends lose connectivity for ~1s throughout the run, several times over. It looks like it may be correlated to the oauth-operator reporting:
source/OperatorAvailable display/true condition/Available reason/APIServices_Error status/False APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" [2s]
Sample jobs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669509775822848
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669509775822848/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122854.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827077182283845632
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827077182283845632/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240823-212127.json&overrideDisplayFlag=1&selectedSources=OperatorDegraded&selectedSources=EtcdLog&selectedSources=Disruption&selectedSources=E2EFailed
More can be found using the first link to the dashboard in this post and scrolling down to most recent job runs, and looking for high numbers.
The operator degraded is probably the strongest symptom to persue as it appears in most of the above.
If you find any runs where other backends are disrupted, especially kube-api, I would suggest ignoring those as they are unlikely to be the same fingerprint as the error being described here.
Description of problem:
`make test` is failing in openshift/coredns repo due to TestImportOrdering failure. This is due to the recent addition of the github.com/openshift/coredns-ocp-dnsnameresolver external plugin and the fact that CoreDNS doesn't generate zplugin.go formatted correctly so TestImportOrdering fails after generation.
Version-Release number of selected component (if applicable):
4.16-4.17
How reproducible:
100%
Steps to Reproduce:
1. make test
Actual results:
TestImportOrdering failure
Expected results:
TestImportOrdering should not fail
Additional info:
I created an upstream issue and PR: https://github.com/coredns/coredns/pull/6692 which recently merged. We will just need to carry-patch this in 4.17 and 4.16. The CoreDNS 1.11.3 rebase https://github.com/openshift/coredns/pull/118 is blocked on this.
Description of problem:
The installer will not add some ports needed for private clusters.
Version-Release number of selected component (if applicable):
How reproducible:
Easily
Steps to Reproduce:
1. Create a VPC with a default security group 2. Deploy a private cluster 3. Fail
Actual results:
COs cannot use necessary ports
Expected results:
cluster can fully deploy without manually adding ports
Additional info:
Description of problem:
Currently we show the debug container action for pods that are failing. We should be showing the action also for pods in 'Succeeded' phase
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Log in into a cluster 2. Create an example Job resource 3. Check the job's pod and wait till it is in 'Succeeded' phase
Actual results:
Debug container action is not available, on the pod's Logs page
Expected results:
Debug container action is available, on the pod's Logs page
Additional info:
Since users are looking for this feature even for pods in any phase, we are treating this issue as bug. Related stories: RFE - https://issues.redhat.com/browse/RFE-1935 STORY - https://issues.redhat.com/browse/CONSOLE-4057 Code that needs to be removed - https://github.com/openshift/console/blob/ae115a9e8c72f930a67ee0c545d36f883cd6be34/frontend/public/components/utils/resource-log.tsx#L149-L151
Description of problem:
When publish: internal, bootstrap SSH rules are still open to public internet (0.0.0.0/0) instead of restricted to the machine cidr.
Version-Release number of selected component (if applicable):
How reproducible:
all private clusters
Steps to Reproduce:
1. set publish: internal in installconfig 2. inspect ssh rule 3.
Actual results:
ssh is open to public internet
Expected results:
should be restricted to machine network
Additional info:
This is a clone of issue OCPBUGS-38918. The following is the description of the original issue:
—
Description of problem:
When installing OpenShift 4.16 on vSphere using IPI method with a template it fails with below error: 2024-08-07T09:55:51.4052628Z "level=debug msg= Fetching Image...", 2024-08-07T09:55:51.4054373Z "level=debug msg= Reusing previously-fetched Image", 2024-08-07T09:55:51.4056002Z "level=debug msg= Fetching Common Manifests...", 2024-08-07T09:55:51.4057737Z "level=debug msg= Reusing previously-fetched Common Manifests", 2024-08-07T09:55:51.4059368Z "level=debug msg=Generating Cluster...", 2024-08-07T09:55:51.4060988Z "level=info msg=Creating infrastructure resources...", 2024-08-07T09:55:51.4063254Z "level=debug msg=Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202406251923-0/x86_64/rhcos-416.94.202406251923-0-vmware.x86_64.ova?sha256=893a41653b66170c7d7e9b343ad6e188ccd5f33b377f0bd0f9693288ec6b1b73'", 2024-08-07T09:55:51.4065349Z "level=debug msg=image download content length: 12169", 2024-08-07T09:55:51.4066994Z "level=debug msg=image download content length: 12169", 2024-08-07T09:55:51.4068612Z "level=debug msg=image download content length: 12169", 2024-08-07T09:55:51.4070676Z "level=error msg=failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: failed to use cached vsphere image: bad status: 403"
Version-Release number of selected component (if applicable):
4.16
How reproducible:
All the time in user environment
Steps to Reproduce:
1.Try to install disconnected IPI install on vSphere using a template. 2. 3.
Actual results:
No cluster installation
Expected results:
Cluster installed with indicated template
Additional info:
- 4.14 works as expected in customer environment - 4.15 works as expected in customer environment
This is a clone of issue OCPBUGS-38006. The following is the description of the original issue:
—
Description of problem:
sometimes cluster-capi-operator pod stuck in CrashLoopBackOff on osp
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-08-01-213905
How reproducible:
Sometimes
Steps to Reproduce:
1.Create an osp cluster with TechPreviewNoUpgrade 2.Check cluster-capi-operator pod 3.
Actual results:
cluster-capi-operator pod in CrashLoopBackOff status $ oc get po cluster-capi-operator-74dfcfcb9d-7gk98 0/1 CrashLoopBackOff 6 (2m54s ago) 41m $ oc get po cluster-capi-operator-74dfcfcb9d-7gk98 1/1 Running 7 (7m52s ago) 46m $ oc get po cluster-capi-operator-74dfcfcb9d-7gk98 0/1 CrashLoopBackOff 7 (2m24s ago) 50m E0806 03:44:00.584669 1 kind.go:66] "kind must be registered to the Scheme" err="no kind is registered for the type v1alpha7.OpenStackCluster in scheme \"github.com/openshift/cluster-capi-operator/cmd/cluster-capi-operator/main.go:86\"" logger="controller-runtime.source.EventHandler" E0806 03:44:00.685539 1 controller.go:203] "Could not wait for Cache to sync" err="failed to wait for clusteroperator caches to sync: timed out waiting for cache to be synced for Kind *v1alpha7.OpenStackCluster" controller="clusteroperator" controllerGroup="config.openshift.io" controllerKind="ClusterOperator" I0806 03:44:00.685610 1 internal.go:516] "Stopping and waiting for non leader election runnables" I0806 03:44:00.685620 1 internal.go:520] "Stopping and waiting for leader election runnables" I0806 03:44:00.685646 1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="secret" controllerGroup="" controllerKind="Secret" I0806 03:44:00.685706 1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" I0806 03:44:00.685712 1 controller.go:242] "All workers finished" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" I0806 03:44:00.685717 1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="secret" controllerGroup="" controllerKind="Secret" I0806 03:44:00.685722 1 controller.go:242] "All workers finished" controller="secret" controllerGroup="" controllerKind="Secret" I0806 03:44:00.685718 1 controller.go:242] "All workers finished" controller="secret" controllerGroup="" controllerKind="Secret" I0806 03:44:00.685720 1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="clusteroperator" controllerGroup="config.openshift.io" controllerKind="ClusterOperator" I0806 03:44:00.685823 1 recorder_in_memory.go:80] &Event{ObjectMeta:{dummy.17e906d425f7b2e1 dummy 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:CustomResourceDefinitionUpdateFailed,Message:Failed to update CustomResourceDefinition.apiextensions.k8s.io/openstackclusters.infrastructure.cluster.x-k8s.io: Put "https://172.30.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/openstackclusters.infrastructure.cluster.x-k8s.io": context canceled,Source:EventSource{Component:cluster-capi-operator-capi-installer-apply-client,Host:,},FirstTimestamp:2024-08-06 03:44:00.685748961 +0000 UTC m=+302.946052179,LastTimestamp:2024-08-06 03:44:00.685748961 +0000 UTC m=+302.946052179,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,} I0806 03:44:00.719743 1 capi_installer_controller.go:309] "CAPI Installer Controller is Degraded" logger="CapiInstallerController" controller="clusteroperator" controllerGroup="config.openshift.io" controllerKind="ClusterOperator" ClusterOperator="cluster-api" namespace="" name="cluster-api" reconcileID="6fa96361-4dc2-4865-b1b3-f92378c002cc" E0806 03:44:00.719942 1 controller.go:329] "Reconciler error" err="error during reconcile: failed to set conditions for CAPI Installer controller: failed to sync status: failed to update cluster operator status: client rate limiter Wait returned an error: context canceled" controller="clusteroperator" controllerGroup="config.openshift.io" controllerKind="ClusterOperator" ClusterOperator="cluster-api" namespace="" name="cluster-api" reconcileID="6fa96361-4dc2-4865-b1b3-f92378c002cc"
Expected results:
cluster-capi-operator pod is always Running
Additional info:
Description of problem:
Pseudolocalization is not working in console.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Go to any console's page and add '?pseudolocalization=true' suffix to the URL 2. 3.
Actual results:
The page stays set with the same language
Expected results:
The page should be pseudolocalized language
Additional info:
Looks like this is the issue https://github.com/MattBoatman/i18next-pseudo/issues/4
Description of problem:
Follow up the step described in https://github.com/openshift/installer/pull/8350 to destroy bootstrap server manually, failed with error `FATAL error destroying bootstrap resources failed to delete bootstrap machine: machines.cluster.x-k8s.io "jimatest-5sjqx-bootstrap" not found` # ./openshift-install version ./openshift-install 4.16.0-0.nightly-2024-05-15-001800 built from commit 494b79cf906dc192b8d1a6d98e56ce1036ea932f release image registry.ci.openshift.org/ocp/release@sha256:d055d117027aa9afff8af91da4a265b7c595dc3ded73a2bca71c3161b28d9d5d release architecture amd64 On AWS: # ./openshift-install create cluster --dir ipi-aws INFO Credentials loaded from the "default" profile in file "/root/.aws/credentials" WARNING failed to find default instance type: no instance type found for the zone constraint WARNING failed to find default instance type for worker pool: no instance type found for the zone constraint INFO Consuming Install Config from target directory WARNING failed to find default instance type: no instance type found for the zone constraint WARNING FeatureSet "TechPreviewNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster. INFO Creating infrastructure resources... INFO Creating IAM roles for control-plane and compute nodes INFO Started local control plane with envtest INFO Stored kubeconfig for envtest in: /tmp/jima/ipi-aws/auth/envtest.kubeconfig INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:44379 --webhook-port=44331 --webhook-cert-dir=/tmp/envtest-serving-certs-1391600832] INFO Running process: aws infrastructure provider with args [-v=4 --diagnostics-address=0 --health-addr=127.0.0.1:42725 --webhook-port=45711 --webhook-cert-dir=/tmp/envtest-serving-certs-1758849099 --feature-gates=BootstrapFormatIgnition=true,ExternalResourceGC=true] INFO Created manifest *v1.Namespace, namespace= name=openshift-cluster-api-guests INFO Created manifest *v1beta2.AWSClusterControllerIdentity, namespace= name=default INFO Created manifest *v1beta1.Cluster, namespace=openshift-cluster-api-guests name=jima16a-2xszh INFO Created manifest *v1beta2.AWSCluster, namespace=openshift-cluster-api-guests name=jima16a-2xszh INFO Waiting up to 15m0s (until 11:01PM EDT) for network infrastructure to become ready... INFO Network infrastructure is ready INFO Creating private Hosted Zone INFO Creating Route53 records for control plane load balancer INFO Created manifest *v1beta2.AWSMachine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-bootstrap INFO Created manifest *v1beta2.AWSMachine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-master-0 INFO Created manifest *v1beta2.AWSMachine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-master-1 INFO Created manifest *v1beta2.AWSMachine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-master-2 INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-bootstrap INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-master-0 INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-master-1 INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-master-2 INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jima16a-2xszh-bootstrap INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jima16a-2xszh-master INFO Waiting up to 15m0s (until 11:07PM EDT) for machines to provision... INFO Control-plane machines are ready INFO Cluster API resources have been created. Waiting for cluster to become ready... INFO Waiting up to 20m0s (until 11:12PM EDT) for the Kubernetes API at https://api.jima16a.qe.devcluster.openshift.com:6443... INFO API v1.29.4+4a87b53 up INFO Waiting up to 30m0s (until 11:25PM EDT) for bootstrapping to complete... ^CWARNING Received interrupt signal INFO Shutting down local Cluster API control plane... INFO Stopped controller: Cluster API INFO Stopped controller: aws infrastructure provider INFO Local Cluster API system has completed operations # ./openshift-install destroy bootstrap --dir ipi-aws INFO Started local control plane with envtest INFO Stored kubeconfig for envtest in: /tmp/jima/ipi-aws/auth/envtest.kubeconfig INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:45869 --webhook-port=43141 --webhook-cert-dir=/tmp/envtest-serving-certs-3670728979] INFO Running process: aws infrastructure provider with args [-v=4 --diagnostics-address=0 --health-addr=127.0.0.1:46111 --webhook-port=35061 --webhook-cert-dir=/tmp/envtest-serving-certs-3674093147 --feature-gates=BootstrapFormatIgnition=true,ExternalResourceGC=true] FATAL error destroying bootstrap resources failed to delete bootstrap machine: machines.cluster.x-k8s.io "jima16a-2xszh-bootstrap" not found INFO Shutting down local Cluster API control plane... INFO Stopped controller: Cluster API INFO Stopped controller: aws infrastructure provider INFO Local Cluster API system has completed operations Same issue on vSphere: # ./openshift-install create cluster --dir ipi-vsphere/ INFO Consuming Install Config from target directory WARNING FeatureSet "CustomNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster. INFO Creating infrastructure resources... INFO Started local control plane with envtest INFO Stored kubeconfig for envtest in: /tmp/jima/ipi-vsphere/auth/envtest.kubeconfig INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:39945 --webhook-port=36529 --webhook-cert-dir=/tmp/envtest-serving-certs-3244100953] INFO Running process: vsphere infrastructure provider with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:45417 --webhook-port=37503 --webhook-cert-dir=/tmp/envtest-serving-certs-3224060135 --leader-elect=false] INFO Created manifest *v1.Namespace, namespace= name=openshift-cluster-api-guests INFO Created manifest *v1beta1.Cluster, namespace=openshift-cluster-api-guests name=jimatest-5sjqx INFO Created manifest *v1beta1.VSphereCluster, namespace=openshift-cluster-api-guests name=jimatest-5sjqx INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=vsphere-creds INFO Waiting up to 15m0s (until 10:47PM EDT) for network infrastructure to become ready... INFO Network infrastructure is ready INFO Created manifest *v1beta1.VSphereMachine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-bootstrap INFO Created manifest *v1beta1.VSphereMachine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-master-0 INFO Created manifest *v1beta1.VSphereMachine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-master-1 INFO Created manifest *v1beta1.VSphereMachine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-master-2 INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-bootstrap INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-master-0 INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-master-1 INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-master-2 INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-bootstrap INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-master INFO Waiting up to 15m0s (until 10:47PM EDT) for machines to provision... INFO Control-plane machines are ready INFO Cluster API resources have been created. Waiting for cluster to become ready... INFO Waiting up to 20m0s (until 10:57PM EDT) for the Kubernetes API at https://api.jimatest.qe.devcluster.openshift.com:6443... INFO API v1.29.4+4a87b53 up INFO Waiting up to 1h0m0s (until 11:37PM EDT) for bootstrapping to complete... ^CWARNING Received interrupt signal INFO Shutting down local Cluster API control plane... INFO Stopped controller: Cluster API INFO Stopped controller: vsphere infrastructure provider INFO Local Cluster API system has completed operations # ./openshift-install destroy bootstrap --dir ipi-vsphere/ INFO Started local control plane with envtest INFO Stored kubeconfig for envtest in: /tmp/jima/ipi-vsphere/auth/envtest.kubeconfig INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:34957 --webhook-port=34511 --webhook-cert-dir=/tmp/envtest-serving-certs-94748118] INFO Running process: vsphere infrastructure provider with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:42073 --webhook-port=46721 --webhook-cert-dir=/tmp/envtest-serving-certs-4091171333 --leader-elect=false] FATAL error destroying bootstrap resources failed to delete bootstrap machine: machines.cluster.x-k8s.io "jimatest-5sjqx-bootstrap" not found INFO Shutting down local Cluster API control plane... INFO Stopped controller: Cluster API INFO Stopped controller: vsphere infrastructure provider INFO Local Cluster API system has completed operations
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-15-001800
How reproducible:
Always
Steps to Reproduce:
1. Create cluster 2. Interrupt the installation when waiting for bootstrap completed 3. Run command "openshift-install destroy bootstrap --dir <dir>" to destroy bootstrap manually
Actual results:
Failed to destroy bootstrap through command 'openshift-install destroy bootstrap --dir <dir>'
Expected results:
Bootstrap host is destroyed successfully
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The customer's cloud credentials operator generates millions of the below messages per day in the GCP cluster.
And they want to reduce/stop these logs as it is consuming more disks. Also, their "cloud credentials" operator runs in manual mode.
time="2024-06-21T08:37:42Z" level=warning msg="read-only creds not found, using root creds client" actuator=gcp cr=openshift-cloud-credential-operator/openshift-gcp-ccm secret=openshift-cloud-credential-operator/cloud-credential-operator-gcp-ro-creds time="2024-06-21T08:37:42Z" level=error msg="error creating GCP client" error="Secret \"gcp-credentials\" not found" time="2024-06-21T08:37:42Z" level=error msg="error determining whether a credentials update is needed" actuator=gcp cr=openshift-cloud-credential-operator/openshift-gcp-ccm error="unable to check whether credentialsRequest needs update" time="2024-06-21T08:37:42Z" level=error msg="error syncing credentials: error determining whether a credentials update is needed" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-ccm secret=openshift-cloud-controller-manager/gcp-ccm-cloud-credentials time="2024-06-21T08:37:42Z" level=error msg="errored with condition: CredentialsProvisionFailure" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-ccm secret=openshift-cloud-controller-manager/gcp-ccm-cloud-credentials time="2024-06-21T08:37:42Z" level=info msg="reconciling clusteroperator status" time="2024-06-21T08:37:42Z" level=info msg="operator detects timed access token enabled cluster (STS, Workload Identity, etc.)" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-pd-csi-driver-operator time="2024-06-21T08:37:42Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-pd-csi-driver-operator time="2024-06-21T08:37:42Z" level=warning msg="read-only creds not found, using root creds client" actuator=gcp cr=openshift-cloud-credential-operator/openshift-gcp-pd-csi-driver-operator secret=openshift-cloud-credential-operator/cloud-credential-operator-gcp-ro-creds
Description of problem:
When running ose-tests conformance suites against hypershift clusters, they error due to the `openshift-oauth-apiserver` namespace not existing.
Version-Release number of selected component (if applicable):
4.15.13
How reproducible:
Consistent
Steps to Reproduce:
1. Create a hypershift cluster 2. Attempt to run an ose-tests suite. For example, the CNI conformance suite documented here: https://access.redhat.com/documentation/en-us/red_hat_software_certification/2024/html/red_hat_software_certification_workflow_guide/con_cni-certification_openshift-sw-cert-workflow-working-with-cloud-native-network-function#running-the-cni-tests_openshift-sw-cert-workflow-working-with-container-network-interface 3. Note errors in logs
Actual results:
ERRO[0352] Finished CollectData for [Jira:"kube-apiserver"] monitor test apiserver-availability collection with not-supported error error="not supported: namespace openshift-oauth-apiserver not present" error running options: failed due to a MonitorTest failureerror: failed due to a MonitorTest failure
Expected results:
No errors
Additional info:
As it happened for the ironic container, the ironic-agent container build script needs to be updated for FIPS before we can enable the IPA FIPS option
This is a clone of issue OCPBUGS-39133. The following is the description of the original issue:
—
Description of problem:
Debugging https://issues.redhat.com/browse/OCPBUGS-36808 (the Metrics API failing some of the disruption checks) and taking https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808 as a reproducer of the issue, I think the Kube-aggregator is behind the problem. According to the disruption checks which forward some relevant errors from the apiserver in the logs, looking at one of the new-connections check failures (from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808/artifacts/e2e-aws-ovn-upgrade-2/openshift-e2e-test/artifacts/junit/backend-disruption_20240816-155051.json) > "Aug 16 *16:43:17.672* - 2s E backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests reason/DisruptionBegan request-audit-id/c62b7d32-856f-49de-86f5-1daed55326b2 backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests stopped responding to GET requests over new connections: error running request: 503 Service Unavailable: error trying to reach service: dial tcp 10.128.2.31:10250: connect: connection refused" The "error trying to reach service" part comes from: https://github.com/kubernetes/kubernetes/blob/b3c725627b15bb69fca01b70848f3427aca4c3ef/staging/src/k8s.io/apimachinery/pkg/util/proxy/transport.go#L105, the apiserver failing to reach the metrics-server Pod, the problem is that the IP "10.128.2.31" corresponds to a Pod that was deleted some milliseconds before (as part of a node update/draining), as we can see in: > 2024-08-16T16:19:43.087Z|00195|binding|INFO|openshift-monitoring_metrics-server-7b9d8c5ddb-dtsmr: Claiming 0a:58:0a:80:02:1f 10.128.2.31 ... I0816 *16:43:17.650083* 2240 kubelet.go:2453] "SyncLoop DELETE" source="api" pods=["openshift-monitoring/metrics-server-7b9d8c5ddb-dtsmr"] ... The apiserver was using a stale IP to reach a Pod that no longer exists, even though a new Pod that had already replaced the other Pod (Metrics API backend runs on 2 Pods), some minutes before, was available. According to OVN, a fresher IP 10.131.0.12 of that Pod was already in the endpoints at that time: > I0816 16:40:24.711048 4651 lb_config.go:1018] Cluster endpoints for openshift-monitoring/metrics-server are: map[TCP/https:{10250 [10.128.2.31 10.131.0.12] []}] *I think, when "10.128.2.31" failed, the apiserver should have fallen back to "10.131.0.12", maybe it waits for some time/retries before doing so, or maybe it wasn't even aware of "10.131.0.12"* AFAIU, we have "--enable-aggregator-routing" set by default https://github.com/openshift/cluster-kube-apiserver-operator/blob/37df1b1f80d3be6036b9e31975ac42fcb21b6447/bindata/assets/config/defaultconfig.yaml#L101-L103 on the apiservers, so instead of forwarding to the metrics-server's service, apiserver directly reaches the Pods. For that it keeps track of the relevant services and endpoints https://github.com/kubernetes/kubernetes/blob/ad8a5f5994c0949b5da4240006d938e533834987/staging/src/k8s.io/kube-aggregator/pkg/apiserver/resolvers.go#L40 bad decisions may be made if the if the services and/or endpoints cache are stale. Looking at the metrics-server (the Metrics API backend) endpoints changes in the apiserver audit logs: > $ grep -hr Event . | grep "endpoints/metrics-server" | jq -c 'select( .verb | match("watch|update"))' | jq -r '[.requestReceivedTimestamp,.user.username,.verb] | @tsv' | sort 2024-08-16T15:39:57.575468Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T15:40:02.005051Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T15:40:35.085330Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T15:40:35.128519Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T16:19:41.148148Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T16:19:47.797420Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T16:20:23.051594Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T16:20:23.100761Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T16:20:23.938927Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T16:21:01.699722Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T16:39:00.328312Z system:serviceaccount:kube-system:endpoint-controller update ==> At around 16:39:XX the first Pod was rolled out 2024-08-16T16:39:07.260823Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T16:39:41.124449Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T16:43:23.701015Z system:serviceaccount:kube-system:endpoint-controller update ==> At around 16:43:23, the new Pod that replaced the second one was created 2024-08-16T16:43:28.639793Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T16:43:47.108903Z system:serviceaccount:kube-system:endpoint-controller update We can see that just before the new-connections checks succeeded again at around "2024-08-16T16:43:23.", an UPDATE was received/treated which may have helped the apiserver sync its endpoints cache or/and chose a healthy Pod Also, no update was triggered when the second Pod was deleted at "16:43:17" which may explain the stale 10.128.2.31 endpoints entry on apiserver side. To summarize, I can see two problems here (maybe one is the consequence of the other): A Pod was deleted and an Endpoint pointing to it wasn't updated. Apparently the Endpoints controller had/has some sync issues https://github.com/kubernetes/kubernetes/issues/125638 The apiserver resolver had a endpoints cache with one stale and one fresh entry but it kept 4-5 times in a row trying to reach the stale entry OR The endpoints was updated "At around 16:39:XX the first Pod was rolled out, see above", but the apiserver resolver cache missed that and ended up with 2 stale entries in the cache, and had to wait until "At around 16:43:23, the new Pod that replaced the second one was created, see above" to sync and replace them with 2 fresh entries.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. See "Description of problem" 2. 3.
Actual results:
Expected results:
the kube-aggregator should detect stale Apiservice endpoints.
Additional info:
the kube-aggregator proxies requests to a stale Endpoints/Pod which makes Metrics API requests falsely fail.
Description of problem:
While extracting the cluster's release image using the jq tool from JSON output, an uninitialized variable was mistakenly used as the version number string constant. This caused jq to fail to match any version numbers correctly, resulting in an empty extraction. Consequently, the expected image could not be extracted, resulting in an empty image.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always.
Steps to Reproduce:
1.Prepare an Azure OpenShift cluster. 2.Migration to Azure AD workload Identity using procedure https://github.com/openshift/cloud-credential-operator/blob/master/docs/azure_workload_identity.md#steps-to-in-place-migrate-an-openshift-cluster-to-azure-ad-workload-identity. 3.Failed on step8: Extract CredentialsCrequests from the cluster's release image for the given version.
Actual results:
Could not extract the expected image, actually it is an empty image. $ CLUSTER_VERSION=`oc get clusterversion version -o json | jq -r '.status.desired.version'` $ RELEASE_IMAGE=`oc get clusterversion version -o json | jq -r '.status.history[] | select(.version == "VERSION_FROM_PREVIOUS_COMMAND") | .image'`
Expected results:
The variable should be properly initialized with the correct version number string. This ensures that jq can accurately match the version number and extract the correct image information. $ CLUSTER_VERSION=`oc get clusterversion version -o json | jq -r '.status.desired.version'` $ RELEASE_IMAGE=`oc get clusterversion version -o json | jq -r '.status.history[] | select(.version == "'$CLUSTER_VERSION'") | .image'`
Additional info:
# Obtain release image from the cluster version. Will not work with pre-release versions. $ CLUSTER_VERSION=`oc get clusterversion version -o json | jq -r '.status.desired.version'` (Error)$ oc get clusterversion version -o json | jq -r '.status.history[] | select(.version == "VERSION_FROM_PREVIOUS_COMMAND") | .image' $ oc get clusterversion version -o json | jq -r '.status.history[] | select(.version == "'$CLUSTER_VERSION'") | .image' registry.ci.openshift.org/ocp-arm64/release-arm64@sha256:c605269e51d60b18e6c7251c92355b783bac7f411e137da36b031a1c6f21579b
This is a clone of issue OCPBUGS-39118. The following is the description of the original issue:
—
Description of problem:
For light theme, the Lightspeed logo should use the multi-color version. For dark theme, the Lightspeed logo should use the single color version for both the button and the content.
Description of problem:
While testing the backport of Azure Reserved Capacity Group support a customer observed that they lack permissions required when operating with Workload Identity (token and role based auth). However this is a curious one where it's likely that the target group may not necessarily be part of the group in which the cluster runs so it would require input from the admin in some use cases.
Version-Release number of selected component (if applicable):
4.17.0
How reproducible:
100%
Steps to Reproduce:
1. Install 4.17 in Azure with Workload Identity configured 2. Create an Azure Reserved Capacity Group, for simplicity in the same resource group as the cluster 3. Update a machineset to use that reserved group
Actual results:
Permissions errors, missing Microsoft.Compute/capacityReservationGroups/deploy/action
Expected results:
Machines created successfully
Additional info:
As mentioned in the description it's likely that the reserved capacity group may be in another resource group and thus in order to create the role it requires admin input, ie: cannot be computed 100% of the time. Therefore that specific use case may be a documentation concern unless we can come up with a novel solution. Further, it also brings up the question of whether or not we should include this permission in the default cred request or not. Customers may not use that default for reasons mentioned above, or they may not even use the feature entirely. But, I think for now it may be worth adding it to the 4.17 CredRequest and treating use of this feature in backported versions as a documentation concern, since we wouldn't want to expand permissions on all 4.16 or older clusters.
This bug is to track the initial triage of our low install success rates on vsphere. I couldn't find a duplicate but I could have missing one. Feel free to close it as such if there's a deeper investigation happening elsewhere.
Component Readiness has found a potential regression in the following test:
install should succeed: overall
Probability of significant regression: 100.00%
Sample (being evaluated) Release: 4.17
Start Time: 2024-07-11T00:00:00Z
End Time: 2024-07-17T23:59:59Z
Success Rate: 60.00%
Successes: 12
Failures: 8
Flakes: 0
Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 60
Failures: 0
Flakes: 0
This is a clone of issue OCPBUGS-39396. The following is the description of the original issue:
—
Description of problem:
When using an amd64 release image and setting the multi-arch flag to false, HCP CLI cannot create a HostedCluster. The following error happens: /tmp/hcp create cluster aws --role-arn arn:aws:iam::460538899914:role/cc1c0f586e92c42a7d50 --sts-creds /tmp/secret/sts-creds.json --name cc1c0f586e92c42a7d50 --infra-id cc1c0f586e92c42a7d50 --node-pool-replicas 3 --base-domain origin-ci-int-aws.dev.rhcloud.com --region us-east-1 --pull-secret /etc/ci-pull-credentials/.dockerconfigjson --namespace local-cluster --release-image registry.build01.ci.openshift.org/ci-op-0bi6jr1l/release@sha256:11351a958a409b8e34321edfc459f389058d978e87063bebac764823e0ae3183 2024-08-29T06:23:25Z ERROR Failed to create cluster {"error": "release image is not a multi-arch image"} github.com/openshift/hypershift/product-cli/cmd/cluster/aws.NewCreateCommand.func1 /remote-source/app/product-cli/cmd/cluster/aws/create.go:35 github.com/spf13/cobra.(*Command).execute /remote-source/app/vendor/github.com/spf13/cobra/command.go:983 github.com/spf13/cobra.(*Command).ExecuteC /remote-source/app/vendor/github.com/spf13/cobra/command.go:1115 github.com/spf13/cobra.(*Command).Execute /remote-source/app/vendor/github.com/spf13/cobra/command.go:1039 github.com/spf13/cobra.(*Command).ExecuteContext /remote-source/app/vendor/github.com/spf13/cobra/command.go:1032 main.main /remote-source/app/product-cli/main.go:59 runtime.main /usr/lib/golang/src/runtime/proc.go:271 Error: release image is not a multi-arch image release image is not a multi-arch image
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Try to create a HC with an amd64 release image and multi-arch flag set to false
Actual results:
HC does not create and this error is displayed: Error: release image is not a multi-arch image release image is not a multi-arch image
Expected results:
HC should create without errors
Additional info:
This bug seems to have occurred as a result of HOSTEDCP-1778 and this line: https://github.com/openshift/hypershift/blob/e2f75a7247ab803634a1cc7f7beaf99f8a97194c/cmd/cluster/aws/create.go#L520
Description of problem:
For troubleshooting OSUS cases, the default {{must-gather}} doesn't collect OSUS information, and an {{inspect}} of the {{openshift-update-service}} namespace is missing several OSUS related resources like {{UpdateService}}, {{ImageSetConfiguration}}, and maybe more.
Version-Release number of selected component (if applicable):
4.14, 4.15, 4.16, 4.17
Actual results:
No OSUS information in must-gather
Expected results:
OSUS data in must-gather
Additional info: OTA-1177
PR for 4.17 in [1]
This is a clone of issue OCPBUGS-43757. The following is the description of the original issue:
—
Description of problem:
If the node-joiner container encounters an error, the "oc adm node-image create" command does not show it. It currently returns an error but should also display the node-joiner container's logs so that we can see the underlying issue.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
The node-image create command returns container error.
Expected results:
The node-image create command returns container error and displays the container's log to aid in diagnosing the issue.
Additional info:
Description of problem:
Builds from a buildconfig are failing on OCP 4.12.48. The developers are impacted since large files cant be cloned anymore within a BuildConfig.
Version-Release number of selected component (if applicable):
4.12.48
How reproducible:
Always
Steps to Reproduce:
The issue is fixed in version 4.12.45 as per https://issues.redhat.com/browse/OCPBUGS-23419 but still the issue persists in 4.12.48
Actual results:
The build is failing.
Expected results:
The build should work without any issues.
Additional info:
Build fails with error: ``` Adding cluster TLS certificate authority to trust store Cloning "https://<path>.git" ... error: Downloading <github-repo>/projects/<path>.mp4 (70 MB) Error downloading object: <github-repo>/projects/<path>.mp44 (a11ce74): Smudge error: Error downloading <github-repo>/projects/<path>.mp4 (a11ce745c147aa031dd96915716d792828ae6dd17c60115b675aba75342bb95a): batch request: missing protocol: "origin.git/info/lfs" Errors logged to /tmp/build/inputs/.git/lfs/logs/20240430T112712.167008327.log Use `git lfs logs last` to view the log. error: external filter 'git-lfs filter-process' failed fatal: <github-repo>/projects/<path>.mp4: smudge filter lfs failed warning: Clone succeeded, but checkout failed. You can inspect what was checked out with 'git status' and retry with 'git restore --source=HEAD :/' ```
Description of problem:
Featuregate taking unknown value
Version-Release number of selected component (if applicable):
4.16 and 4.17
How reproducible:
Always
Steps to Reproduce:
oc patch featuregate cluster --type=json -p '[{"op": "replace", "path": "/spec/featureSet", "value": "unknownghfh"}]' featuregate.config.openshift.io/cluster patched oc get featuregate cluster -o yaml apiVersion: config.openshift.io/v1 kind: FeatureGate metadata: annotations: include.release.openshift.io/self-managed-high-availability: "true" creationTimestamp: "2024-06-21T07:20:25Z" generation: 2 name: cluster resourceVersion: "56172" uid: c900a975-78ea-4076-8e56-e5517e14b55e spec: featureSet: unknownghfh
Actual results:
featuregate.config.openshift.io/cluster patched
metadata: annotations: include.release.openshift.io/self-managed-high-availability: "true" creationTimestamp: "2024-06-21T07:20:25Z" generation: 2 name: cluster resourceVersion: "56172" uid: c900a975-78ea-4076-8e56-e5517e14b55e spec: featureSet: unknownghfh
Expected results:
Should not take invalid value and give error
{{oc patch featuregate cluster --type=json -p '[
]'}}
The FeatureGate "cluster" is invalid: spec.featureSet: Unsupported value: "unknownghfh": supported values: "", "CustomNoUpgrade", "LatencySensitive", "TechPreviewNoUpgrade"
Additional info:
https://github.com/openshift/kubernetes/commit/facd3b18622d268a4780de1ad94f7da763351425
Add a recording rule, acm_capacity_effective_cpu_cores, on telemeter server side for ACM subscription usage with two labels, _id and managed_cluster_id.
The rule is built based on the 3 metrics:
Here is the logic for the recording rule:
Note: If the metric cluster:capacity_effective_cpu_cores is not available for a self managed OpenShift cluster, the value of the metric acm_capacity_effective_cpu_cores will fall back to the metric acm_managed_cluster_worker_cores:sum.
See the DDR for more details: https://docs.google.com/document/d/1WbQyaY3C6MxfsebJrV_glX8YvqqzS5fwt3z8Z8SQ0VY/edit?usp=sharing
Description of problem:
Trying to install AWS EFS Driver 4.15 in 4.16 OCP. And driver pods get stuck with the below error: $ oc get pods NAME READY STATUS RESTARTS AGE aws-ebs-csi-driver-controller-5f85b66c6-5gw8n 11/11 Running 0 80m aws-ebs-csi-driver-controller-5f85b66c6-r5lzm 11/11 Running 0 80m aws-ebs-csi-driver-node-4mcjp 3/3 Running 0 76m aws-ebs-csi-driver-node-82hmk 3/3 Running 0 76m aws-ebs-csi-driver-node-p7g8j 3/3 Running 0 80m aws-ebs-csi-driver-node-q9bnd 3/3 Running 0 75m aws-ebs-csi-driver-node-vddmg 3/3 Running 0 80m aws-ebs-csi-driver-node-x8cwl 3/3 Running 0 80m aws-ebs-csi-driver-operator-5c77fbb9fd-dc94m 1/1 Running 0 80m aws-efs-csi-driver-controller-6c4c6f8c8c-725f4 4/4 Running 0 11m aws-efs-csi-driver-controller-6c4c6f8c8c-nvtl7 4/4 Running 0 12m aws-efs-csi-driver-node-2frs7 0/3 Pending 0 6m29s aws-efs-csi-driver-node-5cpb8 0/3 Pending 0 6m26s aws-efs-csi-driver-node-bchg5 0/3 Pending 0 6m28s aws-efs-csi-driver-node-brndb 0/3 Pending 0 6m27s aws-efs-csi-driver-node-qcc4m 0/3 Pending 0 6m27s aws-efs-csi-driver-node-wpk5d 0/3 Pending 0 6m27s aws-efs-csi-driver-operator-6b54c78484-gvxrt 1/1 Running 0 13m Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 6m58s default-scheduler 0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match Pod's node affinity/selector. Warning FailedScheduling 3m42s (x2 over 4m24s) default-scheduler 0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match Pod's node affinity/selector.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
all the time
Steps to Reproduce:
1. Install AWS EFS CSI driver 4.15 in 4.16 OCP 2. 3.
Actual results:
EFS CSI drive node pods are stuck in pending state
Expected results:
All pod should be running.
Additional info:
More info on the initial debug here: https://redhat-internal.slack.com/archives/CBQHQFU0N/p1715757611210639
Please review the following PR: https://github.com/openshift/oc/pull/1785
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-storage-operator/pull/474
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-39209. The following is the description of the original issue:
—
Description of problem:
Attempting to Migrate from OpenShiftSDN to OVNKubernetes but experiencing the below Error once the Limited Live Migration is started.
+ exec /usr/bin/hybrid-overlay-node --node ip-10-241-1-192.us-east-2.compute.internal --config-file=/run/ovnkube-config/ovnkube.conf --bootstrap-kubeconfig=/var/lib/kubelet/kubeconfig --cert-dir=/etc/ovn/ovnkube-node-certs --cert-duration=24h I0829 14:06:20.313928 82345 config.go:2192] Parsed config file /run/ovnkube-config/ovnkube.conf I0829 14:06:20.314202 82345 config.go:2193] Parsed config: {Default:{MTU:8901 RoutableMTU:0 ConntrackZone:64000 HostMasqConntrackZone:0 OVNMasqConntrackZone:0 HostNodePortConntrackZone:0 ReassemblyConntrackZone:0 EncapType:geneve EncapIP: EncapPort:6081 InactivityProbe:100000 OpenFlowProbe:180 OfctrlWaitBeforeClear:0 MonitorAll:true OVSDBTxnTimeout:1m40s LFlowCacheEnable:true LFlowCacheLimit:0 LFlowCacheLimitKb:1048576 RawClusterSubnets:100.64.0.0/15/23 ClusterSubnets:[] EnableUDPAggregation:true Zone:global} Logging:{File: CNIFile: LibovsdbFile:/var/log/ovnkube/libovsdb.log Level:4 LogFileMaxSize:100 LogFileMaxBackups:5 LogFileMaxAge:0 ACLLoggingRateLimit:20} Monitoring:{RawNetFlowTargets: RawSFlowTargets: RawIPFIXTargets: NetFlowTargets:[] SFlowTargets:[] IPFIXTargets:[]} IPFIX:{Sampling:400 CacheActiveTimeout:60 CacheMaxFlows:0} CNI:{ConfDir:/etc/cni/net.d Plugin:ovn-k8s-cni-overlay} OVNKubernetesFeature:{EnableAdminNetworkPolicy:true EnableEgressIP:true EgressIPReachabiltyTotalTimeout:1 EnableEgressFirewall:true EnableEgressQoS:true EnableEgressService:true EgressIPNodeHealthCheckPort:9107 EnableMultiNetwork:true EnableMultiNetworkPolicy:false EnableStatelessNetPol:false EnableInterconnect:false EnableMultiExternalGateway:true EnablePersistentIPs:false EnableDNSNameResolver:false EnableServiceTemplateSupport:false} Kubernetes:{BootstrapKubeconfig: CertDir: CertDuration:10m0s Kubeconfig: CACert: CAData:[] APIServer:https://api-int.nonamenetwork.sandbox1730.opentlc.com:6443 Token: TokenFile: CompatServiceCIDR: RawServiceCIDRs:198.18.0.0/16 ServiceCIDRs:[] OVNConfigNamespace:openshift-ovn-kubernetes OVNEmptyLbEvents:false PodIP: RawNoHostSubnetNodes:migration.network.openshift.io/plugin= NoHostSubnetNodes:<nil> HostNetworkNamespace:openshift-host-network PlatformType:AWS HealthzBindAddress:0.0.0.0:10256 CompatMetricsBindAddress: CompatOVNMetricsBindAddress: CompatMetricsEnablePprof:false DNSServiceNamespace:openshift-dns DNSServiceName:dns-default} Metrics:{BindAddress: OVNMetricsBindAddress: ExportOVSMetrics:false EnablePprof:false NodeServerPrivKey: NodeServerCert: EnableConfigDuration:false EnableScaleMetrics:false} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} Gateway:{Mode:shared Interface: EgressGWInterface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false V4JoinSubnet:100.64.0.0/16 V6JoinSubnet:fd98::/64 V4MasqueradeSubnet:169.254.169.0/29 V6MasqueradeSubnet:fd69::/125 MasqueradeIPs:{V4OVNMasqueradeIP:169.254.169.1 V6OVNMasqueradeIP:fd69::1 V4HostMasqueradeIP:169.254.169.2 V6HostMasqueradeIP:fd69::2 V4HostETPLocalMasqueradeIP:169.254.169.3 V6HostETPLocalMasqueradeIP:fd69::3 V4DummyNextHopMasqueradeIP:169.254.169.4 V6DummyNextHopMasqueradeIP:fd69::4 V4OVNServiceHairpinMasqueradeIP:169.254.169.5 V6OVNServiceHairpinMasqueradeIP:fd69::5} DisablePacketMTUCheck:false RouterSubnet: SingleNode:false DisableForwarding:false AllowNoUplink:false} MasterHA:{ElectionLeaseDuration:137 ElectionRenewDeadline:107 ElectionRetryPeriod:26} ClusterMgrHA:{ElectionLeaseDuration:137 ElectionRenewDeadline:107 ElectionRetryPeriod:26} HybridOverlay:{Enabled:true RawClusterSubnets: ClusterSubnets:[] VXLANPort:4789} OvnKubeNode:{Mode:full DPResourceDeviceIdsMap:map[] MgmtPortNetdev: MgmtPortDPResourceName:} ClusterManager:{V4TransitSwitchSubnet:100.88.0.0/16 V6TransitSwitchSubnet:fd97::/64}} F0829 14:06:20.315468 82345 hybrid-overlay-node.go:54] illegal network configuration: built-in join subnet "100.64.0.0/16" overlaps cluster subnet "100.64.0.0/15"
The OpenShift Container Platform 4 - Cluster has been installed with the below configuration and therefore has a conflict because of the clusterNetwork with the Join Subnet of OVNKubernetes.
$ oc get cm -n kube-system cluster-config-v1 -o yaml
apiVersion: v1
data:
install-config: |
additionalTrustBundlePolicy: Proxyonly
apiVersion: v1
baseDomain: sandbox1730.opentlc.com
compute:
- architecture: amd64
hyperthreading: Enabled
name: worker
platform: {}
replicas: 3
controlPlane:
architecture: amd64
hyperthreading: Enabled
name: master
platform: {}
replicas: 3
metadata:
creationTimestamp: null
name: nonamenetwork
networking:
clusterNetwork:
- cidr: 100.64.0.0/15
hostPrefix: 23
machineNetwork:
- cidr: 10.241.0.0/16
networkType: OpenShiftSDN
serviceNetwork:
- 198.18.0.0/16
platform:
aws:
region: us-east-2
publish: External
pullSecret: ""
So following the procedure, the below steps were executed but still the problem is being reported.
oc patch network.operator.openshift.io cluster --type='merge' -p='{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"ipv4":{"internalJoinSubnet": "100.68.0.0/16"}}}}}'
Checking whether change was applied and one can see it being there/configured.
$ oc get network.operator cluster -o yaml apiVersion: operator.openshift.io/v1 kind: Network metadata: creationTimestamp: "2024-08-29T10:05:36Z" generation: 376 name: cluster resourceVersion: "135345" uid: 37f08c71-98fa-430c-b30f-58f82142788c spec: clusterNetwork: - cidr: 100.64.0.0/15 hostPrefix: 23 defaultNetwork: openshiftSDNConfig: enableUnidling: true mode: NetworkPolicy mtu: 8951 vxlanPort: 4789 ovnKubernetesConfig: egressIPConfig: {} gatewayConfig: ipv4: {} ipv6: {} routingViaHost: false genevePort: 6081 ipsecConfig: mode: Disabled ipv4: internalJoinSubnet: 100.68.0.0/16 mtu: 8901 policyAuditConfig: destination: "null" maxFileSize: 50 maxLogFiles: 5 rateLimit: 20 syslogFacility: local0 type: OpenShiftSDN deployKubeProxy: false disableMultiNetwork: false disableNetworkDiagnostics: false kubeProxyConfig: bindAddress: 0.0.0.0 logLevel: Normal managementState: Managed migration: mode: Live networkType: OVNKubernetes observedConfig: null operatorLogLevel: Normal serviceNetwork: - 198.18.0.0/16 unsupportedConfigOverrides: null useMultiNetworkPolicy: false
Following the above the Limited Live Migration is being triggered, which then suddently stops because of the Error shown.
oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.16.9
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift Container Platform 4 with OpenShiftSDN, the configuration shown above and then update to OpenShift Container Platform 4.16
2. Change internalJoinSubnet to prevent a conflict with the Join Subnet of OVNKubernetes (oc patch network.operator.openshift.io cluster --type='merge' -p='{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"ipv4":
}}}}')
3. Initiate the Limited Live Migration running oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'
4. Check the logs of ovnkube-node using oc logs ovnkube-node-XXXXX -c ovnkube-controller
Actual results:
+ exec /usr/bin/hybrid-overlay-node --node ip-10-241-1-192.us-east-2.compute.internal --config-file=/run/ovnkube-config/ovnkube.conf --bootstrap-kubeconfig=/var/lib/kubelet/kubeconfig --cert-dir=/etc/ovn/ovnkube-node-certs --cert-duration=24h I0829 14:06:20.313928 82345 config.go:2192] Parsed config file /run/ovnkube-config/ovnkube.conf I0829 14:06:20.314202 82345 config.go:2193] Parsed config: {Default:{MTU:8901 RoutableMTU:0 ConntrackZone:64000 HostMasqConntrackZone:0 OVNMasqConntrackZone:0 HostNodePortConntrackZone:0 ReassemblyConntrackZone:0 EncapType:geneve EncapIP: EncapPort:6081 InactivityProbe:100000 OpenFlowProbe:180 OfctrlWaitBeforeClear:0 MonitorAll:true OVSDBTxnTimeout:1m40s LFlowCacheEnable:true LFlowCacheLimit:0 LFlowCacheLimitKb:1048576 RawClusterSubnets:100.64.0.0/15/23 ClusterSubnets:[] EnableUDPAggregation:true Zone:global} Logging:{File: CNIFile: LibovsdbFile:/var/log/ovnkube/libovsdb.log Level:4 LogFileMaxSize:100 LogFileMaxBackups:5 LogFileMaxAge:0 ACLLoggingRateLimit:20} Monitoring:{RawNetFlowTargets: RawSFlowTargets: RawIPFIXTargets: NetFlowTargets:[] SFlowTargets:[] IPFIXTargets:[]} IPFIX:{Sampling:400 CacheActiveTimeout:60 CacheMaxFlows:0} CNI:{ConfDir:/etc/cni/net.d Plugin:ovn-k8s-cni-overlay} OVNKubernetesFeature:{EnableAdminNetworkPolicy:true EnableEgressIP:true EgressIPReachabiltyTotalTimeout:1 EnableEgressFirewall:true EnableEgressQoS:true EnableEgressService:true EgressIPNodeHealthCheckPort:9107 EnableMultiNetwork:true EnableMultiNetworkPolicy:false EnableStatelessNetPol:false EnableInterconnect:false EnableMultiExternalGateway:true EnablePersistentIPs:false EnableDNSNameResolver:false EnableServiceTemplateSupport:false} Kubernetes:{BootstrapKubeconfig: CertDir: CertDuration:10m0s Kubeconfig: CACert: CAData:[] APIServer:https://api-int.nonamenetwork.sandbox1730.opentlc.com:6443 Token: TokenFile: CompatServiceCIDR: RawServiceCIDRs:198.18.0.0/16 ServiceCIDRs:[] OVNConfigNamespace:openshift-ovn-kubernetes OVNEmptyLbEvents:false PodIP: RawNoHostSubnetNodes:migration.network.openshift.io/plugin= NoHostSubnetNodes:<nil> HostNetworkNamespace:openshift-host-network PlatformType:AWS HealthzBindAddress:0.0.0.0:10256 CompatMetricsBindAddress: CompatOVNMetricsBindAddress: CompatMetricsEnablePprof:false DNSServiceNamespace:openshift-dns DNSServiceName:dns-default} Metrics:{BindAddress: OVNMetricsBindAddress: ExportOVSMetrics:false EnablePprof:false NodeServerPrivKey: NodeServerCert: EnableConfigDuration:false EnableScaleMetrics:false} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} Gateway:{Mode:shared Interface: EgressGWInterface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false V4JoinSubnet:100.64.0.0/16 V6JoinSubnet:fd98::/64 V4MasqueradeSubnet:169.254.169.0/29 V6MasqueradeSubnet:fd69::/125 MasqueradeIPs:{V4OVNMasqueradeIP:169.254.169.1 V6OVNMasqueradeIP:fd69::1 V4HostMasqueradeIP:169.254.169.2 V6HostMasqueradeIP:fd69::2 V4HostETPLocalMasqueradeIP:169.254.169.3 V6HostETPLocalMasqueradeIP:fd69::3 V4DummyNextHopMasqueradeIP:169.254.169.4 V6DummyNextHopMasqueradeIP:fd69::4 V4OVNServiceHairpinMasqueradeIP:169.254.169.5 V6OVNServiceHairpinMasqueradeIP:fd69::5} DisablePacketMTUCheck:false RouterSubnet: SingleNode:false DisableForwarding:false AllowNoUplink:false} MasterHA:{ElectionLeaseDuration:137 ElectionRenewDeadline:107 ElectionRetryPeriod:26} ClusterMgrHA:{ElectionLeaseDuration:137 ElectionRenewDeadline:107 ElectionRetryPeriod:26} HybridOverlay:{Enabled:true RawClusterSubnets: ClusterSubnets:[] VXLANPort:4789} OvnKubeNode:{Mode:full DPResourceDeviceIdsMap:map[] MgmtPortNetdev: MgmtPortDPResourceName:} ClusterManager:{V4TransitSwitchSubnet:100.88.0.0/16 V6TransitSwitchSubnet:fd97::/64}} F0829 14:06:20.315468 82345 hybrid-overlay-node.go:54] illegal network configuration: built-in join subnet "100.64.0.0/16" overlaps cluster subnet "100.64.0.0/15"
Expected results:
OVNKubernetes Limited Live Migration to recognize the change applied for internalJoinSubnet and don't report any CIDR/Subnet overlap during the OVNKubernetes Limited Live Migration
Additional info:
N/A
Affected Platforms:
OpenShift Container Platform 4.16 on AWS
Please review the following PR: https://github.com/openshift/azure-kubernetes-kms/pull/7
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
sometimes the unit test is failing in CI, it appears to be related to how the tests are structured in combination with the random ordering.
Version-Release number of selected component (if applicable):
How reproducible:
intermittent, there appears to be a condition that when the tests are executed in random order it is possible for some of them to fail due to old objects being in the test suite.
Steps to Reproduce:
1. run `make test` 2. 3.
Actual results:
sometimes, this will be in the build output
=== RUN TestApplyConfigMap/skip_on_extra_label resourceapply_test.go:177: Expected success, but got an error: <*errors.StatusError | 0xc0002dc140>: configmaps "foo" already exists { ErrStatus: { TypeMeta: {Kind: "", APIVersion: ""}, ListMeta: { SelfLink: "", ResourceVersion: "", Continue: "", RemainingItemCount: nil, }, Status: "Failure", Message: "configmaps \"foo\" already exists", Reason: "AlreadyExists", Details: {Name: "foo", Group: "", Kind: "configmaps", UID: "", Causes: nil, RetryAfterSeconds: 0}, Code: 409, }, }
Expected results:
tests pass
Additional info:
it looks like we are also missing the proper test env remote bucket flag on the make command. it should have something like: "--remote-bucket openshift-kubebuilder-tools"
Description of problem:
The cluster-api-operator https://github.com/openshift/cluster-api-operator is missing the latest update release from upstream cluster-api-operator https://github.com/kubernetes-sigs/cluster-api-operator/tree/main
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
In 4.17, users no longer need the this job in their cluster. The migration the job does is been done for a couple releases (4.15, 4.16).
Acceptance Criteria
Description of the problem:
right now , when entering patch manifest in the manifest folder it will be ignoresd
User should be able to upload manifest only to the openshift folder
How reproducible:
Steps to reproduce:
1.create cluster
2.try to create a patch manifest in the manifest folder
3.
Actual results:
It is created installation starts patch is ignored
Expected results:
UI should block creating the patch manifest
This is a clone of issue OCPBUGS-38482. The following is the description of the original issue:
—
Description of problem:
The PR for "AGENT-938: Enhance console logging to display node ISO expiry date during addNodes workflow" landed in the master branch after the release branch for 4.17 was cut due to delays in tide merge pool. Need to backport this commit to 4.17 https://github.com/openshift/installer/commit/8c381ff6edbbc9885aac7ce2d6dedc055e01c70d
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When running oc-mirror against yaml that includes community-operator-index the process terminates prematurely
Version-Release number of selected component (if applicable):
$ oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202404221110.p0.g0e2235f.assembly.stream.el9-0e2235f", GitCommit:"0e2235f4a51ce0a2d51cfc87227b1c76bc7220ea", GitTreeState:"clean", BuildDate:"2024-04-22T16:05:56Z", GoVersion:"go1.21.9 (Red Hat 1.21.9-1.el9_4) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
$ cat imageset-config.yaml apiVersion: mirror.openshift.io/v1alpha2 kind: ImageSetConfiguration archiveSize: 4 mirror: platform: channels: - name: stable-4.15 type: ocp graph: true operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 full: false - catalog: registry.redhat.io/redhat/certified-operator-index:v4.15 full: false - catalog: registry.redhat.io/redhat/community-operator-index:v4.15 full: false additionalImages: - name: registry.redhat.io/ubi8/ubi:latest helm: {} $ oc-mirror --v2 -c imageset-config.yaml --loglevel debug --workspace file:////data/oc-mirror/workdir/ docker://registry.local.momolab.io:8443 Last 10 lines: 2024/04/29 06:01:40 [DEBUG] : source docker://public.ecr.aws/aws-controllers-k8s/apigatewayv2-controller:1.0.7 2024/04/29 06:01:40 [DEBUG] : destination docker://registry.local.momolab.io:8443/aws-controllers-k8s/apigatewayv2-controller:1.0.7 2024/04/29 06:01:40 [DEBUG] : source docker://quay.io/openshift-community-operators/ack-apigatewayv2-controller@sha256:c6844909fa2fdf8aabf1c6762a2871d85fb3491e4c349990f46e4cd1e7ecc099 2024/04/29 06:01:40 [DEBUG] : destination docker://registry.local.momolab.io:8443/openshift-community-operators/ack-apigatewayv2-controller:c6844909fa2fdf8aabf1c6762a2871d85fb3491e4c349990f46e4cd1e7ecc099 2024/04/29 06:01:40 [DEBUG] : source docker://quay.io/openshift-community-operators/openshift-nfd-operator@sha256:880517267f12e0ca4dd9621aa196c901eb1f754e5ec990a1459d0869a8c17451 2024/04/29 06:01:40 [DEBUG] : destination docker://registry.local.momolab.io:8443/openshift-community-operators/openshift-nfd-operator:880517267f12e0ca4dd9621aa196c901eb1f754e5ec990a1459d0869a8c17451 2024/04/29 06:01:40 [DEBUG] : source docker://quay.io/openshift/origin-cluster-nfd-operator:4.10 2024/04/29 06:01:40 [DEBUG] : destination docker://registry.local.momolab.io:8443/openshift/origin-cluster-nfd-operator:4.10 2024/04/29 06:01:40 [ERROR] : [OperatorImageCollector] unable to parse image registry.redhat.io/openshift4/ose-kube-rbac-proxy correctly 2024/04/29 06:01:40 [INFO] : 👋 Goodbye, thank you for using oc-mirror error closing log file registry.log: close /data/oc-mirror/workdir/working-dir/logs/registry.log: file already closed 2024/04/29 06:01:40 [ERROR] : unable to parse image registry.redhat.io/openshift4/ose-kube-rbac-proxy correctly
Steps to Reproduce:
1. Run oc-mirror command as above with debug enabled 2. Wait a few minutes 3. oc-mirror fails
Actual results:
oc-mirror fails when openshift-community-operator is included
Expected results:
oc-mirror to complete
Additional info:
I have the debug logs, which I can attach.
Please review the following PR: https://github.com/openshift/installer/pull/8456
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
The MAC mapping validation added in MGMT-17618 caused a regression on ABI.
To avoid this regression, the validation should be mitigated to validate only non-predictable interface names.
We should still make sure at least one MAC address exist in the MAC map, to be able to detect the relevant host.
slack discussion.
How reproducible:
100%
Steps to reproduce:
Actual results:
error 'mac-interface mapping for interface xxxx is missing'
Expected results:
Installation succeeds and the interfaces are correctly configured.
Please review the following PR: https://github.com/openshift/csi-operator/pull/241
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/configmap-reload/pull/61
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The troubleshooting panel global trigger is not displayed in the application launcher, this blocks users from discovering the panel and troubleshoot problems correctly
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Install COO 2. Install the troubleshooting panel using the UIPlugin CR
Actual results:
The "Signal Correlation" item does not appear in the application launcher when the troubleshooting panel is installed
Expected results:
The "Signal Correlation" item appears in the application launcher when the troubleshooting panel is installed
Additional info:
https://github.com/openshift/console/pull/14097 to 4.17
Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/121
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-version-operator/pull/1056
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-35048. The following is the description of the original issue:
—
Description of problem:
same admin console bug OCPBUGS-31931 on developer console, 4.15.17 cluster, kubeadmin user goes to developer console UI, click "Observe", select one project, example: openshift-monitoring, select Silences tab, click "Create silence", Creator filed is not auto filled with user name, add label name/value, and Comment to create silence.
will see error on page
An error occurred createdBy in body is required
see picture: https://drive.google.com/file/d/1PR64hvpYCC-WOHT1ID9A4jX91LdGG62Y/view?usp=sharing
this issue exists in 4.15/4.16/4.17/4.18, no issue with 4.14
Version-Release number of selected component (if applicable):
4.15.17
How reproducible:
alwawys
Steps to Reproduce:
see the description
Actual results:
Creator filed is not auto filled with user name
Expected results:
no error
Additional info:
Please review the following PR: https://github.com/openshift/multus-admission-controller/pull/85
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
We need to disable the migration feature, including setting of migration-datastore-url if multiple vcenters are enabled in the cluster, because currently CSI migration doesn't work in that environment.
Description of problem:
Navigation: Pipelines -> Pipelines -> Click on kebab menu -> Add Trigger -> Select Git provider type Issue: "Show variables" "Hide variables" are in English.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-01-063526
How reproducible:
Always
Steps to Reproduce:
1. Log into web console and set language as non en_US 2. Navigate to Pipelines -> Pipelines -> Click on kebab menu -> Add Trigger -> Select Git provider type 3. "Show variables" "Hide variables" are in English
Actual results:
Content is in English
Expected results:
Content should be in set language.
Additional info:
Reference screenshot attached.
This is a clone of issue OCPBUGS-42120. The following is the description of the original issue:
—
Description of problem:
After upgrading OCP and LSO to version 4.14, elasticsearch pods in the openshift-logging deployment are unable to schedule to their respective nodes and remain Pending, even though the LSO managed PVs are bound to the PVCs. A test pod using a newly created test PV managed by the LSO is able to schedule correctly however.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Consistently
Steps to Reproduce:
1. 2. 3.
Actual results:
Pods consuming previously existing LSO managed PVs are unable to schedule and remain in a Pending state after upgrading OCP and LSO to 4.14.
Expected results:
That pods would be able to consume LSO managed PVs and schedule correctly to nodes.
Additional info:
The upcoming OpenShift Pipelines which will be deployed shortly has stricter validations on Pipelines and Task manifests. Clamav would fail the new validations.
In OCPBUGS-30951, we modified a check used in the Cinder CSI Driver Operator to relax the requirements for enabling topology support. Unfortunately in doing this we introduced a bug: we now attempt to access the volume AZ for each compute AZ, which isn't valid if there are more compute AZs than volume AZS. This needs to be addressed.
This affects 4.14 through to master (unreleased 4.17).
Always.
1. Deploy OCP-on-OSP on a cluster with fewer storage AZs than compute AZs
Operator fails due to out-of-range error.
Operator should not fail.
None.
Please review the following PR: https://github.com/openshift/cloud-provider-ibm/pull/70
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
non-lowercase hostname in DHCP breaks assisted installation
How reproducible:
100%
Steps to reproduce:
Actual results:
bootkube fails
Expected results:{}
bootkube should succeed
Description of problem:
Alignment issue with the Breadcrumbs in the Task Selection QuickSearch
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
Always
Steps to Reproduce:
1. Install the Pipelines Operator 2. Use the Quick Search in the Pipeline Builder page 3. Type "git-clone"
Actual results:
Alignment issue with the Breadcrumbs in the Task Selection QuickSearch
Expected results:
Proper alignment
Additional info:
Screenshot: https://drive.google.com/file/d/1qGWLyfLBHAzfhv8Bnng3IyEJCx8hdMEo/view?usp=drive_link
This is a clone of issue OCPBUGS-38241. The following is the description of the original issue:
—
Component Readiness has found a potential regression in the following test:
operator conditions control-plane-machine-set
Probability of significant regression: 100.00%
Sample (being evaluated) Release: 4.17
Start Time: 2024-08-03T00:00:00Z
End Time: 2024-08-09T23:59:59Z
Success Rate: 92.05%
Successes: 81
Failures: 7
Flakes: 0
Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 429
Failures: 0
Flakes: 0
Please review the following PR: https://github.com/openshift/ovirt-csi-driver-operator/pull/134
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/153
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Proxy settings in buildDefaults preserved in image
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
I have a customer, so during builds their developers need proxy access. For this they have configured buildDefaults on thier cluster as described here:https://docs.openshift.com/container-platform/4.10/cicd/builds/build-configuration.html. The problem is that buildDefaults.defaultProxy sets the proxy environment variables in uppercase. Several RedHat S2I images use tools that depend on curl. curl only supports lower-case proxy environment variables. As such the defaultProxy settings are not taken into account.To workaround this "behavior defect", they have configured: - buildDefaults.env.http_proxy - buildDefaults.env.https_proxy - buildDefaults.env.no_proxy But the side effect is that the lowercase environment variables are preserved in the container image. So at runtime, the proxy settings are still active and they constantly have to support developers to unset them again (when using non-fqdn for example). This is causing frustration for them and thier developers. 1. Why can't the buildDefaults.defaultProxy not be set in lower and uppercase proxy settings?2. Why are the buildDefaults.env preserved in the container image while buildDefaults.defaultProxy is correctly unset/removed from the container image. As the name implies, for us "buildDefaults" should only be kept during the build and settings should correctly be removed before pushing the image in the registry.Also have shared them the below KCS: https://access.redhat.com/solutions/1575513. But cu was not satisfied with that , and they responded with the following: The article does not provide a solution to the problem. It describes the same and gives a dirty workaround a developers will have to apply it on each individual buildconfig. This is not wanted. The fact that we set these envs using buildDefaults, is the same workaround. But still the core problem remains: the envs are preserved in the container image when using this workaround. This needs to be addressed by engineering so this is fixed properly.
Actual results:
Expected results:
Additional info:
Description of problem:
Azure HC fails to create AzureMachineTemplate if a MachineIdentityID is not provided. E0705 19:09:23.783858 1 controller.go:329] "Reconciler error" err="failed to parse ProviderID : invalid resource ID: id cannot be empty" controller="azuremachinetemplate" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachineTemplate" AzureMachineTemplate="clusters-hostedcp-1671-hc/hostedcp-1671-hc-f412695a" namespace="clusters-hostedcp-1671-hc" name="hostedcp-1671-hc-f412695a" reconcileID="74581db2-0ac0-4a30-abfc-38f07b8247cc" https://github.com/openshift/hypershift/blob/84f594bd2d44e03aaac2d962b0d548d75505fed7/hypershift-operator/controllers/nodepool/azure.go#L52 does not check first to see if a MachineIdentityID was provided before adding the UserAssignedIdentity field.
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Create an Azure HC without a MachineIdentityID
Actual results:
Azure HC fails to create AzureMachineTemplate properly, nodes aren't created, and HC is in a failed state.
Expected results:
Azure HC creates AzureMachineTemplate properly, nodes are created, and HC is in a completed state.
Additional info:
This is a clone of issue OCPBUGS-38722. The following is the description of the original issue:
—
Description of problem:
We should add validation in the Installer when public-only subnets is enabled to make sure that: 1. Print a warning if OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY is set 2. If this flag is only applicable for public cluster, we could consider exit earlier if publish: Internal 3. If this flag is only applicable for byo-vpc configuration, we could consider exit earlier if no subnets provided in install-config.
Version-Release number of selected component (if applicable):
all versions that support public-only subnets
How reproducible:
always
Steps to Reproduce:
1. Set OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY 2. Do a cluster install without specifying a VPC. 3.
Actual results:
No warning about the invalid configuration.
Expected results:
Additional info:
This is an internal-only feature, so this validations shouldn't affect the normal path used by customers.
Please review the following PR: https://github.com/openshift/machine-os-images/pull/38
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Apps exposed via NodePort do not return responses to client requests if the client's ephemeral port is 22623 or 22624.
When testing with curl command specifying the local port as shown below, a response is returned if the ephemeral port is 22622 or 22626, but it times out if the ephemeral port is 22623 or 22624.
[root@bastion ~]# for i in {22622..22626}; do echo localport:${i}; curl -m 10 -I 10.0.0.20:32325 --local-port ${i}; done localport:22622 HTTP/1.1 200 OK Server: nginx/1.22.1 Date: Thu, 25 Jul 2024 07:44:22 GMT Content-Type: text/html Content-Length: 37451 Last-Modified: Wed, 24 Jul 2024 12:20:19 GMT Connection: keep-alive ETag: "66a0f183-924b" Accept-Ranges: bytes localport:22623 curl: (28) Connection timed out after 10001 milliseconds localport:22624 curl: (28) Connection timed out after 10000 milliseconds localport:22625 HTTP/1.1 200 OK Server: nginx/1.22.1 Date: Thu, 25 Jul 2024 07:44:42 GMT Content-Type: text/html Content-Length: 37451 Last-Modified: Wed, 24 Jul 2024 12:20:19 GMT Connection: keep-alive ETag: "66a0f183-924b" Accept-Ranges: bytes localport:22626 HTTP/1.1 200 OK Server: nginx/1.22.1 Date: Thu, 25 Jul 2024 07:44:42 GMT Content-Type: text/html Content-Length: 37451 Last-Modified: Wed, 24 Jul 2024 12:20:19 GMT Connection: keep-alive ETag: "66a0f183-924b" Accept-Ranges: bytes
This issue has been occurring since upgrading to version 4.16. Confirmed that it does not occur in versions 4.14 and 4.12.
Version-Release number of selected component (if applicable):
OCP 4.16
How reproducible:
100%
Steps to Reproduce:
1. Prepare a 4.16 cluster.
2. Launch any web app pod (nginx, httpd, etc.).
3. Expose the application externally using NodePort.
4. Access the URL using curl --local-port option to specify 22623 or 22624.
Actual results:
No response is returned from the exposed application when the ephemeral port is 22623 or 22624.
Expected results:
A response is returned regardless of the ephemeral port.
Additional info:
This issue started occurring from version 4.16, so it is possible that this is due to changes in RHEL 9.4, particularly those related to nftables.
Description of problem:
Since we aim for removing PF4 and ReactRouter5 in 4.18 we need to deprecate these shared modules in 4.16 to give plugin creators time to update their plugins.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/ironic-agent-image/pull/134
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
kube-apiserver was stuck in updating versions when upgrade from 4.1 to 4.16 with AWS ipi installation
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-01-111315
How reproducible:
always
Steps to Reproduce:
1. IPI Install an AWS 4.1 cluster, upgrade it to 4.16 2. Upgrade was stuck in 4.15 to 4.16, waiting on etcd, kube-apiserver updating
Actual results:
1. Upgrade was stuck in 4.15 to 4.16, waiting on etcd, kube-apiserver updating $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.0-0.nightly-2024-05-16-091947 True True 39m Working towards 4.16.0-0.nightly-2024-05-16-092402: 111 of 894 done (12% complete)
Expected results:
Upgrade should be successful.
Additional info:
Must-gather: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.1-aws-ipi-f30/1791391925467615232/artifacts/aws-ipi-f30/gather-must-gather/artifacts/must-gather.tar Checked the must-gather logs, $ omg get clusterversion -oyaml ... conditions: - lastTransitionTime: '2024-05-17T09:35:29Z' message: Done applying 4.15.0-0.nightly-2024-05-16-091947 status: 'True' type: Available - lastTransitionTime: '2024-05-18T06:31:41Z' message: 'Multiple errors are preventing progress: * Cluster operator kube-apiserver is updating versions * Could not update flowschema "openshift-etcd-operator" (82 of 894): the server does not recognize this resource, check extension API servers' reason: MultipleErrors status: 'True' type: Failing $ omg get co | grep -v '.*True.*False.*False' NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE kube-apiserver 4.15.0-0.nightly-2024-05-16-091947 True True False 10m $ omg get pod -n openshift-kube-apiserver NAME READY STATUS RESTARTS AGE installer-40-ip-10-0-136-146.ec2.internal 0/1 Succeeded 0 2h29m installer-41-ip-10-0-143-206.ec2.internal 0/1 Succeeded 0 2h25m installer-43-ip-10-0-154-116.ec2.internal 0/1 Succeeded 0 2h22m installer-44-ip-10-0-154-116.ec2.internal 0/1 Succeeded 0 1h35m kube-apiserver-guard-ip-10-0-136-146.ec2.internal 1/1 Running 0 2h24m kube-apiserver-guard-ip-10-0-143-206.ec2.internal 1/1 Running 0 2h24m kube-apiserver-guard-ip-10-0-154-116.ec2.internal 0/1 Running 0 2h24m kube-apiserver-ip-10-0-136-146.ec2.internal 5/5 Running 0 2h27m kube-apiserver-ip-10-0-143-206.ec2.internal 5/5 Running 0 2h24m kube-apiserver-ip-10-0-154-116.ec2.internal 4/5 Running 17 1h34m revision-pruner-39-ip-10-0-136-146.ec2.internal 0/1 Succeeded 0 2h44m revision-pruner-39-ip-10-0-143-206.ec2.internal 0/1 Succeeded 0 2h50m revision-pruner-39-ip-10-0-154-116.ec2.internal 0/1 Succeeded 0 2h52m revision-pruner-40-ip-10-0-136-146.ec2.internal 0/1 Succeeded 0 2h29m revision-pruner-40-ip-10-0-143-206.ec2.internal 0/1 Succeeded 0 2h29m revision-pruner-40-ip-10-0-154-116.ec2.internal 0/1 Succeeded 0 2h29m revision-pruner-41-ip-10-0-136-146.ec2.internal 0/1 Succeeded 0 2h26m revision-pruner-41-ip-10-0-143-206.ec2.internal 0/1 Succeeded 0 2h26m revision-pruner-41-ip-10-0-154-116.ec2.internal 0/1 Succeeded 0 2h26m revision-pruner-42-ip-10-0-136-146.ec2.internal 0/1 Succeeded 0 2h24m revision-pruner-42-ip-10-0-143-206.ec2.internal 0/1 Succeeded 0 2h23m revision-pruner-42-ip-10-0-154-116.ec2.internal 0/1 Succeeded 0 2h23m revision-pruner-43-ip-10-0-136-146.ec2.internal 0/1 Succeeded 0 2h23m revision-pruner-43-ip-10-0-143-206.ec2.internal 0/1 Succeeded 0 2h23m revision-pruner-43-ip-10-0-154-116.ec2.internal 0/1 Succeeded 0 2h23m revision-pruner-44-ip-10-0-136-146.ec2.internal 0/1 Succeeded 0 1h35m revision-pruner-44-ip-10-0-143-206.ec2.internal 0/1 Succeeded 0 1h35m revision-pruner-44-ip-10-0-154-116.ec2.internal 0/1 Succeeded 0 1h35m Checked the kube-apiserver kube-apiserver-ip-10-0-154-116.ec2.internal logs, seems something wring with informers, $ grep 'informers not started yet' current.log | wc -l 360 $ grep 'informers not started yet' current.log 2024-05-18T06:34:51.888804183Z [-]informer-sync failed: 4 informers not started yet: [*v1.PriorityLevelConfiguration *v1.Secret *v1.FlowSchema *v1.ConfigMap] 2024-05-18T06:34:51.889350484Z [-]informer-sync failed: 4 informers not started yet: [*v1.PriorityLevelConfiguration *v1.FlowSchema *v1.Secret *v1.ConfigMap] 2024-05-18T06:34:52.004808401Z [-]informer-sync failed: 2 informers not started yet: [*v1.FlowSchema *v1.PriorityLevelConfiguration] 2024-05-18T06:34:52.095516498Z [-]informer-sync failed: 2 informers not started yet: [*v1.PriorityLevelConfiguration *v1.FlowSchema] ...
Please review the following PR: https://github.com/openshift/installer/pull/8459
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-38622. The following is the description of the original issue:
—
Description of problem:
See https://github.com/prometheus/prometheus/issues/14503 for more details
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Steps to Reproduce:
1. Make Prometheus scrape a target that exposes multiple samples of the same series with different explicit timestamps, for example:
# TYPE requests_per_second_requests gauge # UNIT requests_per_second_requests requests # HELP requests_per_second_requests test-description requests_per_second_requests 16 1722466225604 requests_per_second_requests 14 1722466226604 requests_per_second_requests 40 1722466227604 requests_per_second_requests 15 1722466228604 # EOF
2. Not all the samples will be ingested
3. If Prometheus continues scraping that target for a moment, the PrometheusDuplicateTimestamps will fire.
Actual results:
Expected results: all the samples should be considered (or course if the timestamps are too old or are too in the future, Prometheus may refuses them.)
Additional info:
Regression introduced in Prometheus 2.52. Proposed upstream fixes: https://github.com/prometheus/prometheus/pull/14683 https://github.com/prometheus/prometheus/pull/14685
This is a clone of issue OCPBUGS-38636. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
When navigating from Lightspeed's "Don't show again" link, it can be hard to know which element is relevant. We should look at utilizing Spotlight to highlight the relevant user preference. Also, there is an undesirable gap before the Lightspeed user preference caused by an empty div from data-test="console.telemetryAnalytics".
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-42514. The following is the description of the original issue:
—
Description of problem:
When configuring the OpenShift image registry to use a custom Azure storage account in a different resource group, following the official documentation [1], the image-registy CO degrade and upgrade from version 4.14.x to 4.15.x fails. The image registry operator reports misconfiguration errors related to Azure storage credentials, preventing the upgrade and causing instability in the control plane.
[1] Configuring registry storage in Azure user infrastructure
Version-Release number of selected component (if applicable):
4.14.33, 4.15.33
How reproducible:
Steps to Reproduce:
We got the error
NodeCADaemonProgressing: The daemon set node-ca is deployed Progressing: Unable to apply resources: unable to sync storage configuration: client misconfigured, missing 'TenantID', 'ClientID', 'ClientSecret', 'FederatedTokenFile', 'Creds', 'SubscriptionID' option(s)
The oeprator will also genreate a new secret image-registry-private-configuration with the same content as image-registry-private-configuration-user
$ oc get secret image-registry-private-configuration -o yaml apiVersion: v1 data: REGISTRY_STORAGE_AZURE_ACCOUNTKEY: xxxxxxxxxxxxxxxxx kind: Secret metadata: annotations: imageregistry.operator.openshift.io/checksum: sha256:524fab8dd71302f1a9ade9b152b3f9576edb2b670752e1bae1cb49b4de992eee creationTimestamp: "2024-09-26T19:52:17Z" name: image-registry-private-configuration namespace: openshift-image-registry resourceVersion: "126426" uid: e2064353-2511-4666-bd43-29dd020573fe type: Opaque
2. then we delete the secret image-registry-private-configuration-user
now the secret image-registry-private-configuration will still exisit with the same content, but image-registry CO got a new error
NodeCADaemonProgressing: The daemon set node-ca is deployed Progressing: Unable to apply resources: unable to sync storage configuration: failed to get keys for the storage account arojudesa: storage.AccountsClient#ListKeys: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource 'Microsoft.Storage/storageAccounts/arojudesa' under resource group 'aro-ufjvmbl1' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix"
3. apply the workaround to manually changeing the secret installer-cloud-credentials azure_resourcegroup key with custom storage account resourcegroup
$ oc get secret installer-cloud-credentials -o yaml apiVersion: v1 data: azure_client_id: xxxxxxxxxxxxxxxxx azure_client_secret: xxxxxxxxxxxxxxxxx azure_region: xxxxxxxxxxxxxxxxx azure_resource_prefix: xxxxxxxxxxxxxxxxx azure_resourcegroup: xxxxxxxxxxxxxxxxx <<<<<-----THIS azure_subscription_id: xxxxxxxxxxxxxxxxx azure_tenant_id: xxxxxxxxxxxxxxxxx kind: Secret metadata: annotations: cloudcredential.openshift.io/credentials-request: openshift-cloud-credential-operator/openshift-image-registry-azure creationTimestamp: "2024-09-26T16:49:57Z" labels: cloudcredential.openshift.io/credentials-request: "true" name: installer-cloud-credentials namespace: openshift-image-registry resourceVersion: "133921" uid: d1268e2c-1825-49f0-aa44-d0e1cbcda383 type: Opaque
The image-registry report healthy and this help the continue the upgrade
Actual results:
The image registry seems still use the service principal way for Azure storage account authentication
Expected results:
We expect the REGISTRY_STORAGE_AZURE_ACCOUNTKEY should the only thing image registry operator need for storage account authentication if Customer provide
Additional info:
Slack : https://redhat-internal.slack.com/archives/CCV9YF9PD/p1727379313014789
Please review the following PR: https://github.com/openshift/kubernetes-metrics-server/pull/28
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-38794. The following is the description of the original issue:
—
Description of problem:
HCP cluster is being updated but the nodepool is stuck updating: ~~~ NAME CLUSTER DESIRED NODES CURRENT NODES AUTOSCALING AUTOREPAIR VERSION UPDATINGVERSION UPDATINGCONFIG MESSAGE nodepool-dev-cluster dev 2 2 False False 4.15.22 True True ~~~
Version-Release number of selected component (if applicable):
Hosting OCP cluster 4.15 HCP 4.15.23
How reproducible:
N/A
Steps to Reproduce:
1. 2. 3.
Actual results:
Nodepool stuck in upgrade
Expected results:
Upgrade success
Additional info:
I have found this error repeating continually in the ignition-server pods: ~~~ {"level":"error","ts":"2024-08-20T09:02:19Z","msg":"Reconciler error","controller":"secret","controllerGroup":"","controllerKind":"Secret","Secret":{"name":"token-nodepool-dev-cluster-3146da34","namespace":"dev-dev"},"namespace":"dev-dev","name":"token-nodepool-dev-cluster-3146da34","reconcileID":"ec1f0a7f-1657-4245-99ef-c984977ff0f8","error":"error getting ignition payload: failed to download binaries: failed to extract image file: failed to extract image file: file not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"} {"level":"info","ts":"2024-08-20T09:02:20Z","logger":"get-payload","msg":"discovered machine-config-operator image","image":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f3b55cc8f88b9e6564fe6ad0bc431cd7270c0586a06d9b4a19ff2b518c461ede"} {"level":"info","ts":"2024-08-20T09:02:20Z","logger":"get-payload","msg":"created working directory","dir":"/payloads/get-payload4089452863"} {"level":"info","ts":"2024-08-20T09:02:28Z","logger":"get-payload","msg":"extracted image-references","time":"8s"} {"level":"info","ts":"2024-08-20T09:02:38Z","logger":"get-payload","msg":"extracted templates","time":"10s"} {"level":"info","ts":"2024-08-20T09:02:38Z","logger":"image-cache","msg":"retrieved cached file","imageRef":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f3b55cc8f88b9e6564fe6ad0bc431cd7270c0586a06d9b4a19ff2b518c461ede","file":"usr/lib/os-release"} {"level":"info","ts":"2024-08-20T09:02:38Z","logger":"get-payload","msg":"read os-release","mcoRHELMajorVersion":"8","cpoRHELMajorVersion":"9"} {"level":"info","ts":"2024-08-20T09:02:38Z","logger":"get-payload","msg":"copying file","src":"usr/bin/machine-config-operator.rhel9","dest":"/payloads/get-payload4089452863/bin/machine-config-operator"} ~~~
This is a clone of issue OCPBUGS-42553. The following is the description of the original issue:
—
Context thread.
Description of problem:
Monitoring the 4.18 agent-based installer CI job for s390x (https://github.com/openshift/release/pull/50293) I discovered unexpected behavoir onces the installation triggers reboot into disk step for the 2nd and 3rd control plane nodes. (The first control plane node is rebooted last because it's also the bootstrap node). Instead of rebooting successully as expected, it fails to find the OSTree and drops to dracut, stalling the installation.
Version-Release number of selected component (if applicable):
OpenShift 4.18 on s390x only; discovered using agent installer
How reproducible:
Try to install OpenShift 4.18 using agent-based installer on s390x
Steps to Reproduce:
1. Boot nodes with XML (see attached) 2. Wait for installation to get to reboot phase.
Actual results:
Control plane nodes fail to reboot.
Expected results:
Control plane nodes reboot and installation progresses.
Additional info:
See attached logs.
This is a clone of issue OCPBUGS-38368. The following is the description of the original issue:
—
After multi-VC changes were merged, now when we use this tool, following warnings get logged:
E0812 13:04:34.813216 13159 config_yaml.go:208] Unmarshal failed: yaml: unmarshal errors: line 1: cannot unmarshal !!seq into config.CommonConfigYAML I0812 13:04:34.813376 13159 config.go:272] ReadConfig INI succeeded. INI-based cloud-config is deprecated and will be removed in 2.0. Please use YAML based cloud-config.
Which looks bit scarier than it should.
Description of the problem:
The error "Setting Machine network CIDR is forbidden when cluster is not in vip-dhcp-allocation mode" is only ever seen on cluster updates. This means that users may see this issue only after a cluster is fully installed which would prevent day2 node additions.
How reproducible:
100%
Steps to reproduce:
1. Create an AgentClusterInstall with both VIPs and machineNetwork CIDR set
2. Observe SpecSynced condition
Actual results:
No error is seen
Expected results:
An error is presented saying this is an invalid combination.
Additional information:
This was originally seen as a part of https://issues.redhat.com/browse/ACM-10853 and I was only able to see SpecSynced success for a few seconds before I saw the mentioned error. Somehow though this user was able to install a cluster with this configuration so maybe we should block it with a webhook rather than a condition?
This is a clone of issue OCPBUGS-41538. The following is the description of the original issue:
—
Description of problem:
When the user selects a shared vpc install, the created control plane service account is left over. To verify, after the destruction of the cluster check the principals in the host project for a remaining name XXX-m@some-service-account.com
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
No principal remaining
Additional info:
This is a clone of issue OCPBUGS-42961. The following is the description of the original issue:
—
With the rapid recommendations feature (enhancement) one can request various messages from Pods matching various Pod name regular expressions
The problem is when there is a Pod (e.g foo-1 from the below example) matching more than one requested Pod name regex:
{ 'namespace': 'test-namespace', 'pod_name_regex': 'foo-.*', 'messages': ['regex1', 'regex2'] }, { 'namespace': 'test-namespace'', 'pod_name_regex': 'foo-1', 'messages': ['regex3', 'regex4'] }
Assume Pods with names foo-1 and foo-bar. Currently all the regexes (regex1,regex2, regex3, regex4) are filtered for both Pods.
The desired behavior is foo1 filters all the regexes, but foo-bar is filtered only with regex1 and regex2
Please review the following PR: https://github.com/openshift/must-gather/pull/423
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
I was seeing the following error running `build.sh` with go v1.19.5 until I upgraded to v1.22.4:
```
❯ ./build.sh
pkg/auth/sessions/server_session.go:7:2: cannot find package "." in:
/Users/rhamilto/Git/console/vendor/slices
```
Description of problem:
The MAPI for IBM Cloud currently only checks the first group of subnets (50) when searching for Subnet details by name. It should provide pagination support to search all subnets.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%, dependent on order of subnets returned by IBM Cloud API's however
Steps to Reproduce:
1. Create 50+ IBM Cloud VPC Subnets 2. Create a new IPI cluster (with or without BYON) 3. MAPI will attempt to find Subnet details by name, likely failing as it only checks the first group (50)...depending on order returned by IBM Cloud API
Actual results:
MAPI fails to find Subnet ID, thus cannot create/manage cluster nodes.
Expected results:
Successful IPI deployment.
Additional info:
IBM Cloud is working on a patch to MAPI to handle the ListSubnets API call and pagination results.
This is a clone of issue OCPBUGS-38225. The following is the description of the original issue:
—
Description of problem:
https://search.dptools.openshift.org/?search=Helm+Release&maxAge=168h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Fallout of https://issues.redhat.com/browse/OCPBUGS-35371
We simply do not have enough visibility into why these kubelet endpoints are going down, outside of a reboot, while kubelet itself stays up.
A big step would be charting them with the intervals. Add a new monitor test to query prometheus at the end of the run looking for when these targets were down.
Prom query:
max by (node, metrics_path) (up{job="kubelet"}) == 0
Then perhaps a test to flake if we see this happen outside of a node reboot. This seems to happen on every gcp-ovn (non-upgrade) job I look at. It does NOT seem to happen on AWS.
Description of problem:
Regression of OCPBUGS-12739
level=warning msg="Couldn't unmarshall OVN annotations: ''. Skipping." err="unexpected end of JSON input"
Upstream OVN changed the node annotation from "k8s.ovn.org/host-addresses" to "k8s.ovn.org/host-cidrs" in OpenShift 4.14
https://github.com/ovn-org/ovn-kubernetes/pull/3915
We might need to fix baremetal-runtimecfg
diff --git a/pkg/config/node.go b/pkg/config/node.go index 491dd4f..078ad77 100644 --- a/pkg/config/node.go +++ b/pkg/config/node.go @@ -367,10 +367,10 @@ func getNodeIpForRequestedIpStack(node v1.Node, filterIps []string, machineNetwo log.Debugf("For node %s can't find address using NodeInternalIP. Fallback to OVN annotation.", node.Name) var ovnHostAddresses []string - if err := json.Unmarshal([]byte(node.Annotations["k8s.ovn.org/host-addresses"]), &ovnHostAddresses); err != nil { + if err := json.Unmarshal([]byte(node.Annotations["k8s.ovn.org/host-cidrs"]), &ovnHostAddresses); err != nil { log.WithFields(logrus.Fields{ "err": err, - }).Warnf("Couldn't unmarshall OVN annotations: '%s'. Skipping.", node.Annotations["k8s.ovn.org/host-addresses"]) + }).Warnf("Couldn't unmarshall OVN annotations: '%s'. Skipping.", node.Annotations["k8s.ovn.org/host-cidrs"]) }
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-30-130713
How reproducible:
Frequent
Steps to Reproduce:
1. Deploy vsphere IPv4 cluster 2. Convert to Dualstack IPv4/IPv6 3. Add machine network and IPv6 apiServerInternalIPs and ingressIPs 4. Check keepalived.conf for f in $(oc get pods -n openshift-vsphere-infra -l app=vsphere-infra-vrrp --no-headers -o custom-columns=N:.metadata.name ) ; do oc -n openshift-vsphere-infra exec -c keepalived $f -- cat /etc/keepalived/keepalived.conf | tee $f-keepalived.conf ; done
Actual results:
IPv6 VIP is not in keepalived.conf
Expected results:
Something like:
vrrp_instance rbrattai_INGRESS_1 { state BACKUP interface br-ex virtual_router_id 129 priority 20 advert_int 1 unicast_src_ip fd65:a1a8:60ad:271c::cc unicast_peer { fd65:a1a8:60ad:271c:9af:16a9:cb4f:d75c fd65:a1a8:60ad:271c:86ec:8104:1bc2:ab12 fd65:a1a8:60ad:271c:5f93:c9cf:95f:9a6d fd65:a1a8:60ad:271c:bb4:de9e:6d58:89e7 fd65:a1a8:60ad:271c:3072:2921:890:9263 } ... virtual_ipaddress { fd65:a1a8:60ad:271c::1117/128 } ... }
This is a clone of issue OCPBUGS-42579. The following is the description of the original issue:
—
Hello Team,
When we deploy the HyperShift cluster with OpenShift Virtualization by specifying NodePort strategy for services, the requests to ignition, oauth, connectivity (for oc rsh, oc logs, oc exec), virt-launcher-hypershift-node-pool pod fails as by default following netpols get created automatically and restricting the traffic on on all other ports.
$ oc get netpol NAME POD-SELECTOR AGE kas app=kube-apiserver 153m openshift-ingress <none> 153m openshift-monitoring <none> 153m same-namespace <none> 153m
I resolved
$ cat ingress-netpol apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: ingress spec: ingress: - ports: - port: 31032 protocol: TCP podSelector: matchLabels: kubevirt.io: virt-launcher policyTypes: - Ingress $ cat oauth-netpol apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: oauth spec: ingress: - ports: - port: 6443 protocol: TCP podSelector: matchLabels: app: oauth-openshift hypershift.openshift.io/control-plane-component: oauth-openshift policyTypes: - Ingress $ cat ignition-netpol apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: nodeport-ignition-proxy spec: ingress: - ports: - port: 8443 protocol: TCP podSelector: matchLabels: app: ignition-server-proxy policyTypes: - Ingress $ cat konn-netpol apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: konn spec: ingress: - ports: - port: 8091 protocol: TCP podSelector: matchLabels: app: kube-apiserver hypershift.openshift.io/control-plane-component: kube-apiserver policyTypes: - Ingress
The bug for ignition netpol has already been reported.
--> https://issues.redhat.com/browse/OCPBUGS-39158
--> https://issues.redhat.com/browse/OCPBUGS-39317
It would be helpful if these policies get created automatically as well or maybe we get an option in HyperShift to disable the automatic management of network policies where we can manually take care of the network policies.
As an SRE, I want to aggregate the `cluster_proxy_ca_expiry_timestamp` metric to achieve feature parity with OSD/ROSA clusters as they are today.
Description of criteria:
The expiry in classic is calculated by looking at the user supplied CA bundle, running openssl command to extract the expiry, and calculating the number of days that the cert is valid for. We should use the same approach to calculate the expiry for HCP clusters.
Current implementation:
SRE Spike for this effort: https://issues.redhat.com/browse/OSD-15414
Please review the following PR: https://github.com/openshift/cluster-api-provider-azure/pull/306
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
After installed MCE operator, tried to create MultiClusterEngine instance, it failed with error: "error applying object Name: mce Kind: ConsolePlugin Error: conversion webhook for console.openshift.io/v1alpha1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": service "webhook" not found" Checked in openshift-console-operator, there is not webhook service, also deployment "console-conversion-webhook" is missed.
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-06-25-103421
How reproducible:
Always
Steps to Reproduce:
1.Check resources in openshift-console-opeator, such as deployment and service. 2. 3.
Actual results:
1. There is not webhook related deployment, pod and service.
Expected results:
1. Should have webhook related resources.
Additional info:
Description of problem:
When deploying a private cluster, if the VPC isn't a permitted network the installer will not add it as one
Version-Release number of selected component (if applicable):
How reproducible:
Easily
Steps to Reproduce:
1. Deploy a private cluster with CAPI in a VPC where the desired DNS zone is not permitted 2. Fail 3.
Actual results:
Cluster cannot reach endpoints and deployment fails
Expected results:
Network should be permitted and deployment should succeed
Additional info:
Description of problem:
Live migration gets stuck when the ConfigMap MTU is absent. The ConfigMap mtu should be created by the mtu-prober job at the installation time since 4.11. But if the cluster was upgrade from a very early releases, such as 4.4.4, the ConfigMap mtu may be absent.
Version-Release number of selected component (if applicable):
4.16.rc2
How reproducible:
Steps to Reproduce:
1. build a 4.16 cluster with OpenShiftSDN 2. remove the configmap mtu from the namespace cluster-network-operator. 3. start live migration.
Actual results:
Live migration gets stuck with error NetworkTypeMigrationFailed Failed to process SDN live migration (configmaps "mtu" not found)
Expected results:
Live migration finished successfully.
Additional info:
A workaround is to create the configmap mtu manually before starting live migration.
Converted to a bug so it could be backported.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-38994. The following is the description of the original issue:
—
Description of problem:
The library-sync.sh script may leave some files of the unsupported samples in the checkout. In particular, the files that have been renamed are not deleted even though they should have.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Run library-sync.sh
Actual results:
A couple of files under assets/operator/ocp-x86_64/fis are present.
Expected results:
The directory should not be present at all, because it is not supported.
Additional info:
The goal is to collect recording rule about issues within the pods on cnv containers, at the moment cnv_abnormal metric includes memory exceed values by container for the pod with the highest exceeded bytes.
The recording rules attached in the screenshot.
Labels
The cardinality of the metric is at most 8
The end results contain 4 (containers) x 2 (memory types), 8 records and 2 labels for each record. In addition we use 2 additional kubevirt rules that calculates the end values by memory type: https://github.com/kubevirt/kubevirt/blob/main/pkg/monitoring/rules/recordingrules/operator.go. cnv_abnormal reported in hco: https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/pkg/monitoring/rules/recordingrules/operator.go#L28
Description of problem: Missing dependency warning error in console UI dev env
[yapei@yapei-mac frontend (master)]$ yarn lint --fix
yarn run v1.22.15
$ NODE_OPTIONS=--max-old-space-size=4096 yarn eslint . --fix
$ eslint --ext .js,.jsx,.ts,.tsx,.json,.gql,.graphql --color . --fix/Users/yapei/go/src/github.com/openshift/console/frontend/packages/console-shared/src/components/close-button/CloseButton.tsx
2:46 error Unable to resolve path to module '@patternfly/react-component-groups' import/no-unresolved/Users/yapei/go/src/github.com/openshift/console/frontend/public/components/resource-dropdown.tsx
109:6 warning React Hook React.useEffect has missing dependencies: 'clearItems' and 'recentSelected'. Either include them or remove the dependency array react-hooks/exhaustive-deps✖ 2 problems (1 error, 1 warning)error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
[yapei@yapei-mac frontend (master)]$ git log -1
commit 9478885a967f633ddc327ade1c0d552094db418b (HEAD -> master, origin/release-4.17, origin/release-4.16, origin/master, origin/HEAD)
Merge: 3e708b7df9 c3d89c5798
Author: openshift-merge-bot[bot] <148852131+openshift-merge-bot[bot]@users.noreply.github.com>
Date: Mon Mar 18 16:33:11 2024 +0000 Merge pull request #13665 from cyril-ui-developer/add-locales-support-fr-es
Description of problem:
Infra machine is going to failed status:
2024-05-18 07:26:49.815 | NAMESPACE NAME PHASE TYPE REGION ZONE AGE 2024-05-18 07:26:49.822 | openshift-machine-api ostest-wgdc2-infra-0-4sqdh Running master regionOne nova 31m 2024-05-18 07:26:49.826 | openshift-machine-api ostest-wgdc2-infra-0-ssx8j Failed 31m 2024-05-18 07:26:49.831 | openshift-machine-api ostest-wgdc2-infra-0-tfkf5 Running master regionOne nova 31m 2024-05-18 07:26:49.841 | openshift-machine-api ostest-wgdc2-master-0 Running master regionOne nova 38m 2024-05-18 07:26:49.847 | openshift-machine-api ostest-wgdc2-master-1 Running master regionOne nova 38m 2024-05-18 07:26:49.852 | openshift-machine-api ostest-wgdc2-master-2 Running master regionOne nova 38m 2024-05-18 07:26:49.858 | openshift-machine-api ostest-wgdc2-worker-0-d5cdp Running worker regionOne nova 31m 2024-05-18 07:26:49.868 | openshift-machine-api ostest-wgdc2-worker-0-jcxml Running worker regionOne nova 31m 2024-05-18 07:26:49.873 | openshift-machine-api ostest-wgdc2-worker-0-t29fz Running worker regionOne nova 31m
Logs from machine-controller shows below error:
2024-05-18T06:59:11.159013162Z I0518 06:59:11.158938 1 controller.go:156] ostest-wgdc2-infra-0-ssx8j: reconciling Machine 2024-05-18T06:59:11.159589148Z I0518 06:59:11.159529 1 recorder.go:104] events "msg"="Reconciled machine ostest-wgdc2-worker-0-jcxml" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"ostest-wgdc2-worker-0-jcxml","uid":"245bac8e-c110-4bef-ac11-3d3751a93353","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"18617"} "reason"="Reconciled" "type"="Normal" 2024-05-18T06:59:12.749966746Z I0518 06:59:12.749845 1 controller.go:349] ostest-wgdc2-infra-0-ssx8j: reconciling machine triggers idempotent create 2024-05-18T07:00:00.487702632Z E0518 07:00:00.486365 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-api-provider-openstack-leader: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-api-provider-openstack-leader": http2: client connection lost 2024-05-18T07:00:00.487702632Z W0518 07:00:00.486497 1 controller.go:351] ostest-wgdc2-infra-0-ssx8j: failed to create machine: error creating bootstrap for ostest-wgdc2-infra-0-ssx8j: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-api/secrets/worker-user-data": http2: client connection lost 2024-05-18T07:00:00.487702632Z I0518 07:00:00.486534 1 controller.go:391] Actuator returned invalid configuration error: error creating bootstrap for ostest-wgdc2-infra-0-ssx8j: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-api/secrets/worker-user-data": http2: client connection lost 2024-05-18T07:00:00.487702632Z I0518 07:00:00.486548 1 controller.go:404] ostest-wgdc2-infra-0-ssx8j: going into phase "Failed"
The openstack VM is not even created:
2024-05-18 07:26:50.911 | +--------------------------------------+-----------------------------+--------+---------------------------------------------------------------------------------------------------------------------+--------------------+--------+ 2024-05-18 07:26:50.917 | | ID | Name | Status | Networks | Image | Flavor | 2024-05-18 07:26:50.924 | +--------------------------------------+-----------------------------+--------+---------------------------------------------------------------------------------------------------------------------+--------------------+--------+ 2024-05-18 07:26:50.929 | | 3a1b9af6-d284-4da5-8ebe-434d3aa95131 | ostest-wgdc2-worker-0-jcxml | ACTIVE | StorageNFS=172.17.5.187; network-dualstack=192.168.192.185, fd2e:6f44:5dd8:c956:f816:3eff:fe3e:4e7c | ostest-wgdc2-rhcos | worker | 2024-05-18 07:26:50.935 | | 5c34b78a-d876-49fb-a307-874d3c197c44 | ostest-wgdc2-infra-0-tfkf5 | ACTIVE | network-dualstack=192.168.192.133, fd2e:6f44:5dd8:c956:f816:3eff:fee6:4410, fd2e:6f44:5dd8:c956:f816:3eff:fef2:930a | ostest-wgdc2-rhcos | master | 2024-05-18 07:26:50.941 | | d2025444-8e11-409d-8a87-3f1082814af1 | ostest-wgdc2-infra-0-4sqdh | ACTIVE | network-dualstack=192.168.192.156, fd2e:6f44:5dd8:c956:f816:3eff:fe82:ae56, fd2e:6f44:5dd8:c956:f816:3eff:fe86:b6d1 | ostest-wgdc2-rhcos | master | 2024-05-18 07:26:50.947 | | dcbde9ac-da5a-44c8-b64f-049f10b6b50c | ostest-wgdc2-worker-0-t29fz | ACTIVE | StorageNFS=172.17.5.233; network-dualstack=192.168.192.13, fd2e:6f44:5dd8:c956:f816:3eff:fe94:a2d2 | ostest-wgdc2-rhcos | worker | 2024-05-18 07:26:50.951 | | 8ad98adf-147c-4268-920f-9eb5c43ab611 | ostest-wgdc2-worker-0-d5cdp | ACTIVE | StorageNFS=172.17.5.217; network-dualstack=192.168.192.173, fd2e:6f44:5dd8:c956:f816:3eff:fe22:5cff | ostest-wgdc2-rhcos | worker | 2024-05-18 07:26:50.957 | | f01d6740-2954-485d-865f-402b88789354 | ostest-wgdc2-master-2 | ACTIVE | StorageNFS=172.17.5.177; network-dualstack=192.168.192.198, fd2e:6f44:5dd8:c956:f816:3eff:fe1f:3c64 | ostest-wgdc2-rhcos | master | 2024-05-18 07:26:50.963 | | d215a70f-760d-41fb-8e30-9f3106dbaabe | ostest-wgdc2-master-1 | ACTIVE | StorageNFS=172.17.5.163; network-dualstack=192.168.192.152, fd2e:6f44:5dd8:c956:f816:3eff:fe4e:67b6 | ostest-wgdc2-rhcos | master | 2024-05-18 07:26:50.968 | | 53fe495b-f617-412d-9608-47cd355bc2e5 | ostest-wgdc2-master-0 | ACTIVE | StorageNFS=172.17.5.170; network-dualstack=192.168.192.193, fd2e:6f44:5dd8:c956:f816:3eff:febd:a836 | ostest-wgdc2-rhcos | master | 2024-05-18 07:26:50.975 | +--------------------------------------+-----------------------------+--------+---------------------------------------------------------------------------------------------------------------------+--------------------+--------+
Version-Release number of selected component (if applicable):
RHOS-17.1-RHEL-9-20240123.n.1 4.15.0-0.nightly-2024-05-16-091947
Additional info:
Must-gather link provided on private comment.
Please review the following PR: https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/109
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The tech preview jobs can sometimes fail: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-ovn-techpreview-serial/1787262709813743616 It seems early on the pinnedimageset controller can panic: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-ovn-techpreview-serial/1787262709813743616/artifacts/e2e-vsphere-ovn-techpreview-serial/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-66559c9856-58g4w_machine-config-controller_previous.log Although it is fine on future syncs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-ovn-techpreview-serial/1787262709813743616/artifacts/e2e-vsphere-ovn-techpreview-serial/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-66559c9856-58g4w_machine-config-controller.log
Version-Release number of selected component (if applicable):
4.16.0 techpreview only
How reproducible:
Unsure
Steps to Reproduce:
See CI
Actual results:
Expected results:
Don't panic
Additional info:
This is a clone of issue OCPBUGS-43280. The following is the description of the original issue:
—
Description of problem:
NTO CI starts falling with: • [FAILED] [247.873 seconds] [rfe_id:27363][performance] CPU Management Verification of cpu_manager_state file when kubelet is restart [It] [test_id: 73501] defaultCpuset should not change [tier-0] /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/cpu_management.go:309 [FAILED] Expected <cpuset.CPUSet>: { elems: {0: {}, 2: {}}, } to equal <cpuset.CPUSet>: { elems: {0: {}, 1: {}, 2: {}, 3: {}}, } In [It] at: /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/cpu_management.go:332 @ 10/04/24 16:56:51.436 The failure happened due to the fact that the test pod couldn't get admitted after Kubelet restart. Adding the failure is happening at this line: https://github.com/openshift/kubernetes/blob/cec2232a4be561df0ba32d98f43556f1cad1db01/pkg/kubelet/cm/cpumanager/policy_static.go#L352 something has changed with how Kubelet accounts for `availablePhysicalCPUs`
Version-Release number of selected component (if applicable):
4.18 (start happening after OCP rebased on top of k8s 1.31
How reproducible:
Always
Steps to Reproduce:
1. Set up a system with 4 CPUs and apply performance-profile with single-numa-policy 2. Run pao-functests
Actual results:
Tests falling with: • [FAILED] [247.873 seconds] [rfe_id:27363][performance] CPU Management Verification of cpu_manager_state file when kubelet is restart [It] [test_id: 73501] defaultCpuset should not change [tier-0] /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/cpu_management.go:309 [FAILED] Expected <cpuset.CPUSet>: { elems: {0: {}, 2: {}}, } to equal <cpuset.CPUSet>: { elems: {0: {}, 1: {}, 2: {}, 3: {}}, } In [It] at: /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/cpu_management.go:332 @ 10/04/24 16:56:51.436
Expected results:
Tests should pass
Additional info:
NOTE: The issue occurs only on system with small amount of CPUs (4 in our case)
Please review the following PR: https://github.com/openshift/azure-workload-identity/pull/21
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
observed panic in kube-scheduler: 2024-05-29T07:53:40.874397450Z E0529 07:53:40.873820 1 runtime.go:79] Observed a panic: "integer divide by zero" (runtime error: integer divide by zero) 2024-05-29T07:53:40.874397450Z goroutine 2363 [running]: 2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/runtime.logPanic({0x215e8a0, 0x3c7c150}) 2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85 2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x0?}) 2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b 2024-05-29T07:53:40.874397450Z panic({0x215e8a0?, 0x3c7c150?}) 2024-05-29T07:53:40.874397450Z runtime/panic.go:770 +0x132 2024-05-29T07:53:40.874397450Z k8s.io/kubernetes/pkg/scheduler.(*Scheduler).findNodesThatFitPod(0xc0005b6900, {0x28f4618, 0xc002a97360}, {0x291d688, 0xc00039f688}, 0xc002ac1a00, 0xc0022fc488) 2024-05-29T07:53:40.874397450Z k8s.io/kubernetes/pkg/scheduler/schedule_one.go:505 +0xaf0 2024-05-29T07:53:40.874397450Z k8s.io/kubernetes/pkg/scheduler.(*Scheduler).schedulePod(0xc0005b6900, {0x28f4618, 0xc002a97360}, {0x291d688, 0xc00039f688}, 0xc002ac1a00, 0xc0022fc488) 2024-05-29T07:53:40.874397450Z k8s.io/kubernetes/pkg/scheduler/schedule_one.go:402 +0x31f 2024-05-29T07:53:40.874397450Z k8s.io/kubernetes/pkg/scheduler.(*Scheduler).schedulingCycle(0xc0005b6900, {0x28f4618, 0xc002a97360}, 0xc002ac1a00, {0x291d688, 0xc00039f688}, 0xc002a96370, {0xc18dd5a13410037e, 0x72c11612c5e, 0x3d515e0}, ...) 2024-05-29T07:53:40.874397450Z k8s.io/kubernetes/pkg/scheduler/schedule_one.go:149 +0x115 2024-05-29T07:53:40.874397450Z k8s.io/kubernetes/pkg/scheduler.(*Scheduler).ScheduleOne(0xc0005b6900, {0x28f4618, 0xc000df7ea0}) 2024-05-29T07:53:40.874397450Z k8s.io/kubernetes/pkg/scheduler/schedule_one.go:111 +0x698 2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1() 2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0x1f 2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00214bee0?) 2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33 2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00214bf70, {0x28cfa20, 0xc00169e6c0}, 0x1, 0xc000cfd9e0) 2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf 2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001256f70, 0x0, 0x0, 0x1, 0xc000cfd9e0) 2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f 2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x28f4618, 0xc000df7ea0}, 0xc0009be200, 0x0, 0x0, 0x1) 2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0x93 2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/wait.UntilWithContext(...) 2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/wait/backoff.go:170 2024-05-29T07:53:40.874397450Z created by k8s.io/kubernetes/pkg/scheduler.(*Scheduler).Run in goroutine 2386 2024-05-29T07:53:40.874397450Z k8s.io/kubernetes/pkg/scheduler/scheduler.go:445 +0x119 2024-05-29T07:53:40.876894723Z panic: runtime error: integer divide by zero [recovered] 2024-05-29T07:53:40.876894723Z panic: runtime error: integer divide by zero 2024-05-29T07:53:40.876894723Z 2024-05-29T07:53:40.876894723Z goroutine 2363 [running]: 2024-05-29T07:53:40.876894723Z k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x0?}) 2024-05-29T07:53:40.876894723Z k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xcd 2024-05-29T07:53:40.876929875Z panic({0x215e8a0?, 0x3c7c150?}) 2024-05-29T07:53:40.876929875Z runtime/panic.go:770 +0x132 2024-05-29T07:53:40.876929875Z k8s.io/kubernetes/pkg/scheduler.(*Scheduler).findNodesThatFitPod(0xc0005b6900, {0x28f4618, 0xc002a97360}, {0x291d688, 0xc00039f688}, 0xc002ac1a00, 0xc0022fc488) 2024-05-29T07:53:40.876943106Z k8s.io/kubernetes/pkg/scheduler/schedule_one.go:505 +0xaf0 2024-05-29T07:53:40.876953277Z k8s.io/kubernetes/pkg/scheduler.(*Scheduler).schedulePod(0xc0005b6900, {0x28f4618, 0xc002a97360}, {0x291d688, 0xc00039f688}, 0xc002ac1a00, 0xc0022fc488) 2024-05-29T07:53:40.876962958Z k8s.io/kubernetes/pkg/scheduler/schedule_one.go:402 +0x31f 2024-05-29T07:53:40.876973018Z k8s.io/kubernetes/pkg/scheduler.(*Scheduler).schedulingCycle(0xc0005b6900, {0x28f4618, 0xc002a97360}, 0xc002ac1a00, {0x291d688, 0xc00039f688}, 0xc002a96370, {0xc18dd5a13410037e, 0x72c11612c5e, 0x3d515e0}, ...) 2024-05-29T07:53:40.877000640Z k8s.io/kubernetes/pkg/scheduler/schedule_one.go:149 +0x115 2024-05-29T07:53:40.877000640Z k8s.io/kubernetes/pkg/scheduler.(*Scheduler).ScheduleOne(0xc0005b6900, {0x28f4618, 0xc000df7ea0}) 2024-05-29T07:53:40.877011311Z k8s.io/kubernetes/pkg/scheduler/schedule_one.go:111 +0x698 2024-05-29T07:53:40.877028792Z k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1() 2024-05-29T07:53:40.877028792Z k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0x1f 2024-05-29T07:53:40.877028792Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00214bee0?) 2024-05-29T07:53:40.877028792Z k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33 2024-05-29T07:53:40.877049294Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00214bf70, {0x28cfa20, 0xc00169e6c0}, 0x1, 0xc000cfd9e0) 2024-05-29T07:53:40.877058805Z k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf 2024-05-29T07:53:40.877068225Z k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001256f70, 0x0, 0x0, 0x1, 0xc000cfd9e0) 2024-05-29T07:53:40.877088457Z k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f 2024-05-29T07:53:40.877088457Z k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x28f4618, 0xc000df7ea0}, 0xc0009be200, 0x0, 0x0, 0x1) 2024-05-29T07:53:40.877099448Z k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0x93 2024-05-29T07:53:40.877099448Z k8s.io/apimachinery/pkg/util/wait.UntilWithContext(...) 2024-05-29T07:53:40.877109888Z k8s.io/apimachinery/pkg/util/wait/backoff.go:170 2024-05-29T07:53:40.877109888Z created by k8s.io/kubernetes/pkg/scheduler.(*Scheduler).Run in goroutine 2386 2024-05-29T07:53:40.877119479Z k8s.io/kubernetes/pkg/scheduler/scheduler.go:445 +0x119
Version-Release number of selected component (if applicable):
4.17
How reproducible:
there are a lot of instances; see https://search.dptools.openshift.org/?search=runtime+error%3A+integer+divide+by+zero&maxAge=24h&context=1&type=all&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job $ podman run -it corbinu/alpine-w3m -dump -cols 200 "https://search.dptools.openshift.org/?search=runtime+error%3A+integer+divide+by+zero&maxAge=24h&context=1&type=all&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job" | grep 'failures match' | sort openshift-origin-28839-ci-4.16-e2e-azure-ovn-techpreview-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-arm64-techpreview-serial (all) - 3 runs, 33% failed, 100% of failures match = 33% impact periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-techpreview-serial (all) - 3 runs, 33% failed, 300% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.14-e2e-azure-sdn-techpreview-serial (all) - 3 runs, 33% failed, 200% of failures match = 67% impact periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-sdn-techpreview-serial (all) - 3 runs, 100% failed, 67% of failures match = 67% impact periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-techpreview-serial (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.17-e2e-azure-ovn-techpreview-serial (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-techpreview-serial (all) - 3 runs, 100% failed, 67% of failures match = 67% impact periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-ovn-techpreview-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-techpreview-serial (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-nightly-4.17-fips-payload-scan (all) - 3 runs, 100% failed, 33% of failures match = 33% impact pull-ci-openshift-api-master-e2e-aws-serial-techpreview (all) - 8 runs, 100% failed, 50% of failures match = 50% impact pull-ci-openshift-hypershift-main-e2e-kubevirt-azure-ovn (all) - 27 runs, 70% failed, 5% of failures match = 4% impact pull-ci-openshift-installer-master-e2e-openstack-dualstack-upi (all) - 6 runs, 83% failed, 20% of failures match = 17% impact
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
see https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-techpreview-serial/1795684524709908480 you need to pull the must-gather and you will find the panic in the openshift-kube-scheduler pod
Description of problem:
We have an egressfirewall set in our build farm build09. Once a node is deleted, all ovnkube-node-* pods are crashed immediately.
Version-Release number of selected component (if applicable):
4.16.0-rc3
How reproducible:
Steps to Reproduce:
1. create an egressfirewall object in any namespace
2. delete an node on the cluster
3.
Actual results:
All ovnkube-node-* pods are crashed
Expected results:
Nothing shall happen
Additional info:
https://redhat-internal.slack.com/archives/CDCP2LA9L/p1718210291108709
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
This is a clone of issue OCPBUGS-44099. The following is the description of the original issue:
—
Description of problem:
OCPBUGS-42772 is verified. But testing found oauth-server panic with OAuth2.0 idp names that contain whitespaces
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-31-190119
How reproducible:
Always
Steps to Reproduce:
1. Set up Google IDP with below: $ oc create secret generic google-secret-1 --from-literal=clientSecret=xxxxxxxx -n openshift-config $ oc edit oauth cluster spec: identityProviders: - google: clientID: 9745..snipped..apps.googleusercontent.com clientSecret: name: google-secret-1 hostedDomain: redhat.com mappingMethod: claim name: 'my Google idp' type: Google ...
Actual results:
oauth-server panic:
$ oc get po -n openshift-authentication NAME READY STATUS RESTARTS oauth-openshift-59545c6f5-dwr6s 0/1 CrashLoopBackOff 11 (4m10s ago) ... $ oc logs -p -n openshift-authentication oauth-openshift-59545c6f5-dwr6s Copying system trust bundle I1101 03:40:09.883698 1 dynamic_serving_content.go:113] "Loaded a new cert/key pair" name="serving-cert::/var/config/system/secrets/v4-0-config-system-serving-cert/tls.crt::/var/config/system/secrets/v4-0-config-system-serving-cert/tls.key" I1101 03:40:09.884046 1 dynamic_serving_content.go:113] "Loaded a new cert/key pair" name="sni-serving-cert::/var/config/system/secrets/v4-0-config-system-router-certs/apps.hongli-az.qe.azure.devcluster.openshift.com::/var/config/system/secrets/v4-0-config-system-router-certs/apps.hongli-az.qe.azure.devcluster.openshift.com" I1101 03:40:10.335739 1 audit.go:340] Using audit backend: ignoreErrors<log> I1101 03:40:10.347632 1 requestheader_controller.go:244] Loaded a new request header values for RequestHeaderAuthRequestController panic: parsing "/oauth2callback/my Google idp": at offset 0: invalid method "/oauth2callback/my"goroutine 1 [running]: net/http.(*ServeMux).register(...) net/http/server.go:2738 net/http.(*ServeMux).Handle(0x29844c0?, {0xc0008886a0?, 0x2984420?}, {0x2987fc0?, 0xc0006ff4a0?}) net/http/server.go:2701 +0x56 github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).getAuthenticationHandler(0xc0006c28c0, {0x298f618, 0xc0008a4d00}, {0x2984540, 0xc000171450}) github.com/openshift/oauth-server/pkg/oauthserver/auth.go:407 +0x11ad github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).getAuthorizeAuthenticationHandlers(0xc0006c28c0, {0x298f618, 0xc0008a4d00}, {0x2984540, 0xc000171450}) github.com/openshift/oauth-server/pkg/oauthserver/auth.go:243 +0x65 github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).WithOAuth(0xc0006c28c0, {0x2982500, 0xc0000aca80}) github.com/openshift/oauth-server/pkg/oauthserver/auth.go:108 +0x21d github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).buildHandlerChainForOAuth(0xc0006c28c0, {0x2982500?, 0xc0000aca80?}, 0xc000785888) github.com/openshift/oauth-server/pkg/oauthserver/oauth_apiserver.go:342 +0x45 k8s.io/apiserver/pkg/server.completedConfig.New.func1({0x2982500?, 0xc0000aca80?}) k8s.io/apiserver@v0.29.2/pkg/server/config.go:825 +0x28 k8s.io/apiserver/pkg/server.NewAPIServerHandler({0x252ca0a, 0xf}, {0x2996020, 0xc000501a00}, 0xc0005d1740, {0x0, 0x0}) k8s.io/apiserver@v0.29.2/pkg/server/handler.go:96 +0x2ad k8s.io/apiserver/pkg/server.completedConfig.New({0xc000785888?, {0x0?, 0x0?}}, {0x252ca0a, 0xf}, {0x29b41a0, 0xc000171370}) k8s.io/apiserver@v0.29.2/pkg/server/config.go:833 +0x2a5 github.com/openshift/oauth-server/pkg/oauthserver.completedOAuthConfig.New({{0xc0005add40?}, 0xc0006c28c8?}, {0x29b41a0?, 0xc000171370?}) github.com/openshift/oauth-server/pkg/oauthserver/oauth_apiserver.go:322 +0x6a github.com/openshift/oauth-server/pkg/cmd/oauth-server.RunOsinServer(0xc000451cc0?, 0xc000810000?, 0xc00061a5a0) github.com/openshift/oauth-server/pkg/cmd/oauth-server/server.go:45 +0x73 github.com/openshift/oauth-server/pkg/cmd/oauth-server.(*OsinServerOptions).RunOsinServer(0xc00030e168, 0xc00061a5a0) github.com/openshift/oauth-server/pkg/cmd/oauth-server/cmd.go:108 +0x259 github.com/openshift/oauth-server/pkg/cmd/oauth-server.NewOsinServerCommand.func1(0xc00061c300?, {0x251a8c8?, 0x4?, 0x251a8cc?}) github.com/openshift/oauth-server/pkg/cmd/oauth-server/cmd.go:46 +0xed github.com/spf13/cobra.(*Command).execute(0xc000780008, {0xc00058d6c0, 0x7, 0x7}) github.com/spf13/cobra@v1.7.0/command.go:944 +0x867 github.com/spf13/cobra.(*Command).ExecuteC(0xc0001a3b08) github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 github.com/spf13/cobra.(*Command).Execute(...) github.com/spf13/cobra@v1.7.0/command.go:992 k8s.io/component-base/cli.run(0xc0001a3b08) k8s.io/component-base@v0.29.2/cli/run.go:146 +0x290 k8s.io/component-base/cli.Run(0xc00061a5a0?) k8s.io/component-base@v0.29.2/cli/run.go:46 +0x17 main.main() github.com/openshift/oauth-server/cmd/oauth-server/main.go:46 +0x2de
Expected results:
No panic
Additional info:
Tried in old env like 4.16.20 with same steps, no panic: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.20 True False 95m Cluster version is 4.16.20 $ oc get po -n openshift-authentication NAME READY STATUS RESTARTS AGE oauth-openshift-7dfcd8c8fd-77ltf 1/1 Running 0 116s oauth-openshift-7dfcd8c8fd-sr97w 1/1 Running 0 89s oauth-openshift-7dfcd8c8fd-tsrff 1/1 Running 0 62s
This is a clone of issue OCPBUGS-38114. The following is the description of the original issue:
—
Description of problem:
Starting from version 4.16, the installer does not support creating a cluster in AWS with the OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY=true flag enabled anymore.
Version-Release number of selected component (if applicable):
How reproducible:
The installation procedure fails systemically when using a predefined VPC
Steps to Reproduce:
1. Follow the procedure at https://docs.openshift.com/container-platform/4.16/installing/installing_aws/ipi/installing-aws-vpc.html#installation-aws-config-yaml_installing-aws-vpc to prepare an install-config.yaml in order to install a cluster with a custom VPC 2. Run `openshift-install create cluster ...' 3. The procedure fails: `failed to create load balancer`
Actual results:
The installation procedure fails.
Expected results:
An OCP cluster to be provisioned in AWS, with public subnets only.
Additional info:
ControlePlaneReleaseProvider is modifying the cached release image directly which means the userReleaseProvider is still picking up and using the registry overrides for data-plane components.
This is a clone of issue OCPBUGS-18007. The following is the description of the original issue:
—
Description of problem:
When the TelemeterClientFailures alert fires, there's no runbook link explaining the meaning of the alert and what to do about it.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Check the TelemeterClientFailures alerting rule's annotations 2. 3.
Actual results:
No runbook_url annotation.
Expected results:
runbook_url annotation is present.
Additional info:
This is a consequence of a telemeter server outage that triggered questions from customers about the alert: https://issues.redhat.com/browse/OHSS-25947 https://issues.redhat.com/browse/OCPBUGS-17966 Also in relation to https://issues.redhat.com/browse/OCPBUGS-17797
Description of the problem:
When running the infrastructure operator, the local cluster is not being imported in ACM as expected.
How reproducible:
Run the infrastructure operator in ACM
Steps to reproduce:
1. Install ACM
Actual results:
Local cluster entities are not created
Expected results:
Local cluster entities should be created
Description of problem:
A ServiceAccount is not deleted due to a race condition in the controller manager. When deleting the SA, this is logged in the controller manager:
2024-06-17T15:57:47.793991942Z I0617 15:57:47.793942 1 image_pull_secret_controller.go:233] "Internal registry pull secret auth data does not contain the correct number of entries" ns="test-qtreoisu" name="sink-eguqqiwm-dockercfg-vh8mw" expected=3 actual=0 2024-06-17T15:57:47.794120755Z I0617 15:57:47.794080 1 image_pull_secret_controller.go:163] "Refreshing image pull secret" ns="test-qtreoisu" name="sink-eguqqiwm-dockercfg-vh8mw" serviceaccount="sink-eguqqiwm"
As a result, the Secret is updated and the ServiceAccount owning the Secret is updated by the controller via server-side apply operation as can be seen in the managedFields:
{ "apiVersion":"v1", "imagePullSecrets":[ { "name":"default-dockercfg-vdck9" }, { "name":"kn-test-image-pull-secret" }, { "name":"sink-eguqqiwm-dockercfg-vh8mw" } ], "kind":"ServiceAccount", "metadata":{ "annotations":{ "openshift.io/internal-registry-pull-secret-ref":"sink-eguqqiwm-dockercfg-vh8mw" }, "creationTimestamp":"2024-06-17T15:57:47Z", "managedFields":[ { "apiVersion":"v1", "fieldsType":"FieldsV1", "fieldsV1":{ "f:imagePullSecrets":{ }, "f:metadata":{ "f:annotations":{ "f:openshift.io/internal-registry-pull-secret-ref":{ } } }, "f:secrets":{ "k:{\"name\":\"sink-eguqqiwm-dockercfg-vh8mw\"}":{ } } }, "manager":"openshift.io/image-registry-pull-secrets_service-account-controller", "operation":"Apply", "time":"2024-06-17T15:57:47Z" } ], "name":"sink-eguqqiwm", "namespace":"test-qtreoisu", "resourceVersion":"104739", "uid":"eaae8d0e-8714-4c2e-9d20-c0c1a221eecc" }, "secrets":[ { "name":"sink-eguqqiwm-dockercfg-vh8mw" } ] }"Events":{ "metadata":{ }, "items":null }
The ServiceAccount then hangs there and is NOT deleted.
We have seen this only on OCP 4.16 (not on older versions) but already several time, like for example in this CI run which also has must-gather logs that can be investigated.
Another run is here
The controller code is new in 4.16 and it seems to be a regression.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-14-130320
How reproducible:
It happens sometimes in our CI runs where we want to delete a ServiceAccount but it's hanging there. The test doesn't try to delete it again. It tries only once.
Steps to Reproduce:
The following reproducer works for me. Some service accounts keep handing there after running the script
#!/usr/bin/env bash kubectl create namespace test for i in `seq 100`; do ( kubectl create sa "my-sa-${i}" -n test kubectl wait --for=jsonpath="{.metadata.annotations.openshift\\.io/internal-registry-pull-secret-ref}" sa/my-sa-${i} kubectl delete sa/my-sa-${i} kubectl wait --for=delete sa/my-sa-${i} --timeout=60s )& done wait
Actual results:
ServiceAccount not deleted
Expected results:
ServiceAccount deleted
Additional info:
Monitor test for nodes should fail when nodes go ready=false unexpectedly.
Monitor test for nodes should fail when the unreachable taint is placed on them.
Getting this into release-4.17.
This is a clone of issue OCPBUGS-36479. The following is the description of the original issue:
—
Description of problem:
As part of https://issues.redhat.com/browse/CFE-811, we added a featuregate "RouteExternalCertificate" to release the feature as TP, and all the code implementations were behind this gate. However, it seems https://github.com/openshift/api/pull/1731 inadvertently duplicated "ExternalRouteCertificate" as "RouteExternalCertificate".
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100%
Steps to Reproduce:
$ oc get featuregates.config.openshift.io cluster -oyaml <......> spec: featureSet: TechPreviewNoUpgrade status: featureGates: enabled: - name: ExternalRouteCertificate - name: RouteExternalCertificate <......>
Actual results:
Both RouteExternalCertificate and ExternalRouteCertificate were added in the API
Expected results:
We should have only one featuregate "RouteExternalCertificate" and the same should be displayed in https://docs.openshift.com/container-platform/4.16/nodes/clusters/nodes-cluster-enabling-features.html
Additional info:
Git commits https://github.com/openshift/api/commit/11f491c2c64c3f47cea6c12cc58611301bac10b3 https://github.com/openshift/api/commit/ff31f9c1a0e4553cb63c3e530e46a3e8d2e30930 Slack thread: https://redhat-internal.slack.com/archives/C06EK9ZH3Q8/p1719867937186219
This is a clone of issue OCPBUGS-38409. The following is the description of the original issue:
—
Update our CPO and HO dockerfiles to use appropriate base image versions.
Description of problem:
Gathering bootstrap log bundles has been failing in CI with:
level=error msg=Attempted to gather debug logs after installation failure: must provide bootstrap host address
Version-Release number of selected component (if applicable):
How reproducible:
not. this is a race condition when serializing the machine manifests to disk
Steps to Reproduce:
can't reproduce. need to verify in ci.
Actual results:
can't pull bootstrap log bundle
Expected results:
grabs bootstrap log bundle
Additional info:
This is a clone of issue OCPBUGS-41617. The following is the description of the original issue:
—
TRT has detected a consistent long term trend where the oauth-apiserver appears to have more disruption than it did in 4.16, for minor upgrades on azure.
The problem appears roughly over the 90th percentile, we picked it up at P95 where it shows a consistent 5-8s more than we'd expect given the data in 4.16 ga.
The problem hits ONLY oauth, affecting both new and reused connections, as well as the cached variants meaning etcd should be out of the picture. You'll see a few very short blips where all four of these backends lose connectivity for ~1s throughout the run, several times over. It looks like it may be correlated to the oauth-operator reporting:
source/OperatorAvailable display/true condition/Available reason/APIServices_Error status/False APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" [2s]
Sample jobs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669509775822848
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669509775822848/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122854.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827077182283845632
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827077182283845632/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240823-212127.json&overrideDisplayFlag=1&selectedSources=OperatorDegraded&selectedSources=EtcdLog&selectedSources=Disruption&selectedSources=E2EFailed
More can be found using the first link to the dashboard in this post and scrolling down to most recent job runs, and looking for high numbers.
The operator degraded is probably the strongest symptom to persue as it appears in most of the above.
If you find any runs where other backends are disrupted, especially kube-api, I would suggest ignoring those as they are unlikely to be the same fingerprint as the error being described here.
Description of problem:
checked in 4.17.0-0.nightly-2024-09-18-003538, default thanos-ruler retention time is 24h, not 15d mentioned in https://github.com/openshift/cluster-monitoring-operator/blob/release-4.17/Documentation/api.md#thanosrulerconfig, the issue exists in 4.12+
$ for i in $(oc -n openshift-user-workload-monitoring get sts --no-headers | awk '{print $1}'); do echo $i; oc -n openshift-user-workload-monitoring get sts $i -oyaml | grep retention; echo -e "\n"; done prometheus-user-workload - --storage.tsdb.retention.time=24h thanos-ruler-user-workload - --tsdb.retention=24h
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-09-18-003538
How reproducible:
always
Steps to Reproduce:
1. see the description
Actual results:
default thanos-ruler retention time is 15d in api.md
Expected results:
should be 24h
Additional info:
This is a clone of issue OCPBUGS-38558. The following is the description of the original issue:
—
Description of problem:
Remove the extra . from below INFO message when running add-nodes workdflow INFO[2024-08-15T12:48:45-04:00] Generated ISO at ../day2-worker-4/node.x86_64.iso.. The ISO is valid up to 2024-08-15T16:48:00Z The INFO message will be visible inside the container which runs the node joiner , if using oc adm command
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Run oc adm node-image create command to create a node iso 2. See the INFO message at the end 3.
Actual results:
INFO[2024-08-15T12:48:45-04:00] Generated ISO at ../day2-worker-4/node.x86_64.iso.. The ISO is valid up to 2024-08-15T16:48:00Z
Expected results:
INFO[2024-08-15T12:48:45-04:00] Generated ISO at ../day2-worker-4/node.x86_64.iso. The ISO is valid up to 2024-08-15T16:48:00Z
Additional info:
Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/1059
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
During a security audit, questions were raised about why a number of our containers run privileged. The short answer is that they are doing things that require more permissions than a regular container, but what is not clear is whether we could accomplish the same thing by adding individual capabilities. If it is not necessary to run them fully privileged then we should stop doing that. If it is necessary for some reason we'll need to document why the container must be privileged.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Refactor name to Dockerfile.ocp as a better alternative to Dockerfile.rhel7 since contents are actually rhel9.
This is a clone of issue OCPBUGS-39111. The following is the description of the original issue:
—
Gather the nodenetworkconfigurationpolicy.nmstate.io/v1 and nodenetworkstate.nmstate.io/v1beta1 cluster scoped resources in the Insights data. This CRs are introduced by the NMState operator.
Please review the following PR: https://github.com/openshift/console/pull/13886
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Azure HostedClusters are failing in OCP 4.17 due to issues with the cluster-storage-operator.
- lastTransitionTime: "2024-05-29T19:58:39Z" message: 'Unable to apply 4.17.0-0.nightly-multi-2024-05-29-121923: the cluster operator storage is not available' observedGeneration: 2 reason: ClusterOperatorNotAvailable status: "True" type: ClusterVersionProgressing
I0529 20:05:21.547544 1 status_controller.go:218] clusteroperator/storage diff {"status":{"conditions":[{"lastTransitionTime":"2024-05-29T20:02:00Z","message":"AzureDiskCSIDriverOperatorCRDegraded: AzureDiskDriverGuestStaticResourcesControllerDegraded: \"node_service.yaml\" (string): namespaces \"clusters-test-case4\" not found\nAzureDiskCSIDriverOperatorCRDegraded: AzureDiskDriverGuestStaticResourcesControllerDegraded: ","reason":"AzureDiskCSIDriverOperatorCR_AzureDiskDriverGuestStaticResourcesController_SyncError","status":"True","type":"Degraded"},{"lastTransitionTime":"2024-05-29T20:04:15Z","message":"AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods","reason":"AzureDiskCSIDriverOperatorCR_AzureDiskDriverNodeServiceController_Deploying","status":"True","type":"Progressing"},{"lastTransitionTime":"2024-05-29T19:59:00Z","message":"AzureDiskCSIDriverOperatorCRAvailable: AzureDiskDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service","reason":"AzureDiskCSIDriverOperatorCR_AzureDiskDriverNodeServiceController_Deploying","status":"False","type":"Available"},{"lastTransitionTime":"2024-05-29T19:59:00Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"},{"lastTransitionTime":"2024-05-29T19:59:00Z","reason":"NoData","status":"Unknown","type":"EvaluationConditionsDetected"}]}} I0529 20:05:21.566215 1 event.go:364] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"azure-cloud-controller-manager", UID:"205a4307-67e4-481e-9fee-975b2c5c40fb", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/storage changed: Progressing message changed from "AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods\nAzureFileCSIDriverOperatorCRProgressing: AzureFileDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods" to "AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods"
On the HostedCluster itself, these errors with the csi pods not coming up are:
% k describe pod/azure-disk-csi-driver-node-5hb24 -n openshift-cluster-csi-drivers | grep fail Liveness: http-get http://:healthz/healthz delay=10s timeout=3s period=10s #success=1 #failure=5 Liveness: http-get http://:rhealthz/healthz delay=10s timeout=3s period=10s #success=1 #failure=5 Warning FailedMount 2m (x28 over 42m) kubelet MountVolume.SetUp failed for volume "metrics-serving-cert" : secret "azure-disk-csi-driver-node-metrics-serving-cert" not found
There was an error with the CO as well:
storage 4.17.0-0.nightly-multi-2024-05-29-121923 False True True 49m AzureDiskCSIDriverOperatorCRAvailable: AzureDiskDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Every time
Steps to Reproduce:
1. Create a HC with a 4.17 nightly
Actual results:
Azure HC does not complete; nodes do join NodePool though
Expected results:
Azure HC should complete
Additional info:
Description of problem:
Occasionally, the TestMCDGetsMachineOSConfigSecrets e2e test fails during the e2e-gcp-op-techpreview CI job run. The reason for this failure is because the MCD pod is restarted when the MachineOSConfig is created because it must be aware of the new secrets that the MachineOSConfig expects. The final portion of the test involves using the MCD as a bridge to determine whether the expected actions have occurred. Without the MCD pod containers in a running / ready state, this operation fails.
Version-Release number of selected component (if applicable):
How reproducible:
Variable
Steps to Reproduce:
1. Run the e2e-gcp-op-techpreview job on 4.17+
Actual results:
The TestMCDGetsMachineOSConfigSecrets test fails because it cannot get the expected config file from the targeted node.
Expected results:
The test should pass.
Additional info:
This was discovered and fixed in 4.16 during the backport of the PR that introduced this problem. Consequently, this bug only covers 4.17+.
This is a clone of issue OCPBUGS-43764. The following is the description of the original issue:
—
Description of problem:
IBM ROKS uses Calico as their CNI. In previous versions of OpenShift, OpenShiftSDN would create IPTable rules that would force local endpoint for DNS Service.
Starting in OCP 4.17 with the removal of SDN, IBM ROKS is not using OVN-K and therefor local endpoint for dns service is not working as expected.
IBM ROKS is asking that the code block be restored to restore the functionality previously seen in OCP 4.16
Without this functionality IBM ROKS is not able to GA OCP 4.17
Please review the following PR: https://github.com/openshift/cluster-capi-operator/pull/174
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/baremetal-operator/pull/356
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-35054. The following is the description of the original issue:
—
Description of problem:
Create VPC and subnets with following configs [refer to attached CF template]: Subnets (subnets-pair-default) in CIDR 10.0.0.0/16 Subnets (subnets-pair-134) in CIDR 10.134.0.0/16 Subnets (subnets-pair-190) in CIDR 10.190.0.0/16 Create cluster into subnets-pair-134, the bootstrap process fails [see attached log-bundle logs]: level=debug msg=I0605 09:52:49.548166 937 loadbalancer.go:1262] "adding attributes to load balancer" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" namespace="openshift-cluster-api-guests" name="yunjiang29781a-86-rvqd9" reconcileID="a9310bd5-acc7-4b01-8a84-e47139fc0d1d" cluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" attrs=[{"Key":"load_balancing.cross_zone.enabled","Value":"true"}] level=debug msg=I0605 09:52:49.909861 937 awscluster_controller.go:291] "Looking up IP address for DNS" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" namespace="openshift-cluster-api-guests" name="yunjiang29781a-86-rvqd9" reconcileID="a9310bd5-acc7-4b01-8a84-e47139fc0d1d" cluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" dns="yunjiang29781a-86-rvqd9-int-19a9485653bf29a1.elb.us-east-2.amazonaws.com" level=debug msg=I0605 09:52:53.483058 937 reflector.go:377] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: forcing resync level=debug msg=Fetching Bootstrap SSH Key Pair... Checking security groups: <infraid>-lb allows 10.0.0.0/16:6443 and 10.0.0.0/16:22623 <infraid>-apiserver-lb allows 10.0.0.0/16:6443 and 10.134.0.0/16:22623 (and 0.0.0.0/0:6443) are these settings correct?
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-03-060250
How reproducible:
Always
Steps to Reproduce:
1. Create subnets using attached CG template 2. Create cluster into subnets which CIDR is 10.134.0.0/16 3.
Actual results:
Bootstrap process fails.
Expected results:
Bootstrap succeeds.
Additional info:
No issues if creating cluster into subnets-pair-default (10.0.0.0/16) No issues if only one CIDR in VPC, e.g. set VpcCidr to 10.134.0.0/16 in https://github.com/openshift/installer/blob/master/upi/aws/cloudformation/01_vpc.yaml
Description of problem:
The ingress operator is E2E tests are perma-failing with a prometheus service account issue: === CONT TestAll/parallel/TestRouteMetricsControllerRouteAndNamespaceSelector route_metrics_test.go:86: prometheus service account not found === CONT TestAll/parallel/TestRouteMetricsControllerOnlyNamespaceSelector route_metrics_test.go:86: prometheus service account not found === CONT TestAll/parallel/TestRouteMetricsControllerOnlyRouteSelector route_metrics_test.go:86: prometheus service account not found We need to bump openshift/library-go to get update https://github.com/openshift/library-go/pull/1697 for NewPrometheusClient function that switches from using a legacy service account API to TokenRequest API.
Version-Release number of selected component (if applicable):
4.16, 4.17
How reproducible:
100%
Steps to Reproduce:
1. Run e2e-[aws|gcp|azure]-operator E2E tests on cluster-ingress-operator
Actual results:
route_metrics_test.go:86: prometheus service account not found
Expected results:
No failure
Additional info:
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/152
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-41184. The following is the description of the original issue:
—
Description of problem:
The disk and instance types for gcp machines should be validated further. The current implementation provides validation for each individually, but the disk types and instance types should be checked against each other for valid combinations. The attached spreadsheet displays the combinations of valid disk and instance types.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
After destroying cluster, there are still some files leftover in <install-dir>/.clusterapi_output $ ls -ltra total 1516 drwxr-xr-x. 1 fedora fedora 596 Jun 17 03:46 .. drwxr-x---. 1 fedora fedora 88 Jun 17 06:09 .clusterapi_output -rw-r--r--. 1 fedora fedora 1552382 Jun 17 06:09 .openshift_install.log drwxr-xr-x. 1 fedora fedora 80 Jun 17 06:09 . $ ls -ltr .clusterapi_output/ total 40 -rw-r--r--. 1 fedora fedora 2335 Jun 17 05:58 envtest.kubeconfig -rw-r--r--. 1 fedora fedora 20542 Jun 17 06:03 kube-apiserver.log -rw-r--r--. 1 fedora fedora 10656 Jun 17 06:03 etcd.log Then continue installing new cluster within same install dir, installer exited with error as below: $ ./openshift-install create cluster --dir ipi-aws INFO Credentials loaded from the "default" profile in file "/home/fedora/.aws/credentials" INFO Consuming Install Config from target directory FATAL failed to fetch Cluster: failed to load asset "Cluster": local infrastructure provisioning artifacts already exist. There may already be a running cluster After removing .clusterapi_output/envtest.kubeconfig, and creating cluster again, installation is continued.
Version-Release number of selected component (if applicable):
4.16 nightly build
How reproducible:
always
Steps to Reproduce:
1. Launch capi-based installation 2. Destroy cluster 3. Launch new cluster within same install dir
Actual results:
Fail to launch new cluster within the same install dir, because .clusterapi_output/envtest.kubeconfig is still there.
Expected results:
Succeed to create a new cluster within the same install dir
Additional info:
Component Readiness has found a potential regression in the following test:
[sig-network] pods should successfully create sandboxes by adding pod to network
Probability of significant regression: 99.93%
Sample (being evaluated) Release: 4.17
Start Time: 2024-07-12T00:00:00Z
End Time: 2024-07-18T23:59:59Z
Success Rate: 74.29%
Successes: 25
Failures: 9
Flakes: 1
Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 98.18%
Successes: 54
Failures: 1
Flakes: 0
This test appears to be failing roughly 50% of the time on periodic-ci-openshift-release-master-nightly-4.17-upgrade-from-stable-4.16-e2e-metal-ipi-ovn-upgrade and the error looks workable:
[sig-network] pods should successfully create sandboxes by adding pod to network expand_less 0s { 1 failures to create the sandbox namespace/e2e-test-ns-global-srg5f node/worker-1 pod/test-ipv6-podtm8vn hmsg/da5d303f42 - never deleted - firstTimestamp/2024-07-18T11:26:41Z interesting/true lastTimestamp/2024-07-18T11:26:41Z reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-ipv6-podtm8vn_e2e-test-ns-global-srg5f_65c4722e-d832-4ec8-8209-39587a81d95d_0(d11ec24638e2d578486e57851a419e52ddd4367d48b33e46825f7c42687c9f7f): error adding pod e2e-test-ns-global-srg5f_test-ipv6-podtm8vn to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: 'ContainerID:"d11ec24638e2d578486e57851a419e52ddd4367d48b33e46825f7c42687c9f7f" Netns:"/var/run/netns/7bb7a08a-9352-49d6-a211-02046349dba6" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=e2e-test-ns-global-srg5f;K8S_POD_NAME=test-ipv6-podtm8vn;K8S_POD_INFRA_CONTAINER_ID=d11ec24638e2d578486e57851a419e52ddd4367d48b33e46825f7c42687c9f7f;K8S_POD_UID=65c4722e-d832-4ec8-8209-39587a81d95d" Path:"" ERRORED: error configuring pod [e2e-test-ns-global-srg5f/test-ipv6-podtm8vn] networking: [e2e-test-ns-global-srg5f/test-ipv6-podtm8vn/65c4722e-d832-4ec8-8209-39587a81d95d:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[e2e-test-ns-global-srg5f/test-ipv6-podtm8vn d11ec24638e2d578486e57851a419e52ddd4367d48b33e46825f7c42687c9f7f network default NAD default] [e2e-test-ns-global-srg5f/test-ipv6-podtm8vn d11ec24638e2d578486e57851a419e52ddd4367d48b33e46825f7c42687c9f7f network default NAD default] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:83:00:e3 [10.131.0.227/23] ' ': StdinData: {"binDir":"/var/lib/cni/bin","clusterNetwork":"/host/run/multus/cni/net.d/10-ovn-kubernetes.conf","cniVersion":"0.3.1","daemonSocketDir":"/run/multus/socket","globalNamespaces":"default,openshift-multus,openshift-sriov-network-operator","logLevel":"verbose","logToStderr":true,"name":"multus-cni-network","namespaceIsolation":true,"type":"multus-shim"}}
Description of problem:
EC2 instances is failing to launch by MAPI as the instance profile set to the MachineSet config is invalid, not created by installer. ~~~ errorMessage: "error launching instance: Value (ci-op-ikqrdc6x-cc206-bmcnx-edge-profile) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name" errorReason: InvalidConfiguration ~~~
Version-Release number of selected component (if applicable):
4.17+
How reproducible:
Always
Steps to Reproduce:
1. set the edge compute pool on installer, without setting a custom instance profile 2. create a cluster 3.
Actual results:
Expected results:
instance created in edge zone
Additional info:
- IAM Profile feature: https://github.com/openshift/installer/pull/8689/files#diff-e46d61c55e5e276e3c264d18cba0346777fe3e662d0180a173001b8282af7c6eR51-R54 - CI failures: https://sippy.dptools.openshift.org/sippy-ng/jobs/Presubmits/analysis?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22pull-ci-openshift-origin-master-e2e-aws-ovn-edge-zones%22%7D%5D%7D - Sippy: https://sippy.dptools.openshift.org/sippy-ng/jobs/4.17/runs?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-edge-zones-manifest-validation%22%7D%5D%7D&sortField=timestamp&sort=desc - Slack thread: https://redhat-internal.slack.com/archives/CBZHF4DHC/p1722964067282219
Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/205
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/sdn/pull/623
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The following jobs have been failing on the bootstrap stage. The following error message is seen "level=error msg=Bootstrap failed to complete: timed out waiting for the condition"
https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.17-e2e-openstack-csi-manila
https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.17-e2e-openstack-csi-cinder
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.17-e2e-openstack-nfv-mellanox/1797334785262096384
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.17-e2e-openstack-proxy/1797330506849718272
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The "0000_90_olm_00-service-monitor.yaml" manifest containing RBAC for Prometheus to scrape OLM namespace is omitted https://github.com/openshift/hypershift/blob/e9594f47570b557877009b607d26b9cb4a34f233/control-plane-operator/controllers/hostedcontrolplane/cvo/reconcile.go#L66 But "0000_50_olm_06-psm-operator.servicemonitor.yaml" containing a new OLM ServiceMonitor that was added in https://github.com/openshift/operator-framework-olm/pull/551/files is still deployed, which make Prometheus logs failures: see https://issues.redhat.com/browse/OCPBUGS-36299
Version-Release number of selected component (if applicable):
How reproducible:
Check Prometheus logs in any 4.17 hosted cluster
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Prometheus shouldn't be asked to discover/scrape targets without giving it the appropriate RBAC If that new ServiceMonitor is needed, the appropriate RBAC should be deployed, if not, the ServiceMonitor should be omitted. Maybe Hypershift should use an opt-in approach instead of opt-out for OLM resources, to avoid such issues in the future.
Additional info:
Description of problem:
Httpd icon does not show up in git import flow
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. Navigate to import from git 2. Fill in any github url 3. Edit import strategy, notice that httpd icon is missing
Actual results:
Icon isn't there
Expected results:
It is there
Additional info:
Description of problem:
4.16 NodePool CEL validation breaking existing/older NodePools
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
100%
Steps to Reproduce:
1. Deploy 4.16 NodePool CRDs 2. Create NodePool resource without spec.replicas + spec.autoScaling 3.
Actual results:
The NodePool "22276350-mynodepool" is invalid: spec: Invalid value: "object": One of replicas or autoScaling should be set but not both
Expected results:
NodePool to apply successfully
Additional info:
Breaking change: https://github.com/openshift/hypershift/pull/3786
This is a clone of issue OCPBUGS-42584. The following is the description of the original issue:
—
Description of problem:
Redhat CamelK installation should be via CLI
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Check for operator installation through CLI 2. Check for any post-installation needed 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-39320. The following is the description of the original issue:
—
The original logic of the test is checking for a condition that can never happen. Moreover the test is comparing reserved cpus with all the content of the irqbalance config files which is not great. There were accidental matches between comments and cpu no.1.
Remove checking of reserved cpus in the /etc/sysconfig/irqbalance as in the current Performance profile deployment reserved cpus are never added to the irqbalance config file.
Description of problem:
When building https://github.com/kubevirt-ui/kubevirt-plugin from its release-4.16 branch, following warnings are issued during the webpack build:
WARNING in shared module react No required version specified and unable to automatically determine one. Unable to find required version for "react" in description file (/home/vszocs/work/kubevirt-plugin/node_modules/react/package.json). It need to be in dependencies, devDependencies or peerDependencies.
These warnings should not appear during the plugin build.
Root cause seems to be webpack module federation code which attempts to auto-detect actual build version of shared modules, but this code seems to be unreliable and warnings such as the one above are anything but helpful.
How reproducible: always on kubevirt-plugin branch release-4.16
Steps to Reproduce:
1. git clone https://github.com/kubevirt-ui/kubevirt-plugin
2. cd kubevirt-plugin
3. yarn && yarn dev
Please review the following PR: https://github.com/openshift/prometheus-alertmanager/pull/92
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
After upgrading to OpenShift 4.14, the must-gather took much longer than before.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Run oc adm must-gather 2. Wait for it to complete 3.
Actual results:
For a cluster with around 50 nodes, the must-gather took about 30 minutes.
Expected results:
For a cluster with around 50 nodes, the must-gather can finish in about 10 minutes.
Additional info:
It seems the gather_ppc collection script is related here. https://github.com/openshift/must-gather/blob/release-4.14/collection-scripts/gather_ppc
This is a clone of issue OCPBUGS-38349. The following is the description of the original issue:
—
Description of problem:
When using configuring an OpenID idp that can only be accessed via the data plane, if the hostname of the provider can only be resolved by the data plane, reconciliation of the idp fails.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. Configure an OpenID idp on a HostedCluster with a URL that points to a service in the dataplane (like https://keycloak.keycloak.svc)
Actual results:
The oauth server fails to be reconciled
Expected results:
The oauth server reconciles and functions properly
Additional info:
Follow up to OCPBUGS-37753
Description of problem:
ARO cluster fails to install with disconnected networking. We see master nodes bootup hang on the service machine-config-daemon-pull.service. Logs from the service indicate it cannot reach the public IP of the image registry. In ARO, image registries need to go via a proxy. Dnsmasq is used to inject proxy DNS answers, but machine-config-daemon-pull is starting before ARO's dnsmasq.service starts.
Version-Release number of selected component (if applicable):
4.14.16
How reproducible:
Always
Steps to Reproduce:
For Fresh Install: 1. Create the required ARO vnet and subnets 2. Attach a route table to the subnets with a blackhole route 0.0.0.0/0 3. Create 4.14 ARO cluster with --apiserver-visibility=Private --ingress-visibility=Private --outbound-type=UserDefinedRouting [OR] Post Upgrade to 4.14: 1. Create a ARO 4.13 UDR. 2. ClusterUpgrade the cluster 4.13-> 4.14 , upgrade was successful 3. Create a new node (scale up), we run into the same issue.
Actual results:
For Fresh Install of 4.14: ERROR: (InternalServerError) Deployment failed. [OR] Post Upgrade to 4.14: Node doesn't come into a Ready State and Machine is stuck in Provisioned status.
Expected results:
Succeeded
Additional info:
We see in the node logs that machine-config-daemon-pull.service is unable to reach the image registry. ARO's dnsmasq was not yet started.
Previously, systemd ordering was set for ovs-configuration.service to start after (ARO's) dnsmasq.service. Perhaps that should have gone on machine-config-daemon-pull.service.
See https://issues.redhat.com/browse/OCPBUGS-25406.
This is a clone of issue OCPBUGS-44163. The following is the description of the original issue:
—
Description of problem:
We identified a regression where we can no longer get oauth tokens for HyperShift v4.16 clusters via the OpenShift web console. v4.16.10 works fine, but once clusters are patched to v4.16.16 (or are created at that version) they fail to get the oauth token. This is due to this faulty PR: https://github.com/openshift/hypershift/pull/4496. The oauth openshift deployment was changed and affected the IBM Cloud code path. We need this endpoint to change back to using `socks5`. Bug: < value: socks5://127.0.0.1:8090 --- > value: http://127.0.0.1:8092 98c98 < value: socks5://127.0.0.1:8090 --- > value: http://127.0.0.1:80924:53 Fix: Change http://127.0.0.1:8092 to socks5://127.0.0.1:8090
Version-Release number of selected component (if applicable):
4.16.16
How reproducible:
Every time.
Steps to Reproduce:
1. Create ROKS v4.16.16 HyperShift-based cluster. 2. Navigate to the OpenShift web console. 2. Click IAM#<username> menu in the top right. 3. Click 'Copy login command'. 4. Click 'Display token'.
Actual results:
Error getting token: Post "https://example.com:31335/oauth/token": http: server gave HTTP response to HTTPS client
Expected results:
The oauth token should be successfully displayed.
Additional info:
Description of problem:
Multicast packets got 100% dropped
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-02-202327
How reproducible:
Always
Steps to Reproduce:
1. Create a test namespace and enable multicast
oc describe ns test
Name: test
Labels: kubernetes.io/metadata.name=test
pod-security.kubernetes.io/audit=restricted
pod-security.kubernetes.io/audit-version=v1.24
pod-security.kubernetes.io/enforce=restricted
pod-security.kubernetes.io/enforce-version=v1.24
pod-security.kubernetes.io/warn=restricted
pod-security.kubernetes.io/warn-version=v1.24
Annotations: k8s.ovn.org/multicast-enabled: true
openshift.io/sa.scc.mcs: s0:c28,c27
openshift.io/sa.scc.supplemental-groups: 1000810000/10000
openshift.io/sa.scc.uid-range: 1000810000/10000
Status: Active
No resource quota.
No LimitRange resource.
2. Created multicast pods
% oc get pods -n test -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES mcast-rc-67897 1/1 Running 0 10s 10.129.2.42 ip-10-0-86-58.us-east-2.compute.internal <none> <none> mcast-rc-ftsq8 1/1 Running 0 10s 10.128.2.61 ip-10-0-33-247.us-east-2.compute.internal <none> <none> mcast-rc-q48db 1/1 Running 0 10s 10.131.0.27 ip-10-0-1-176.us-east-2.compute.internal <none> <none>
3. Test mulicast traffic with omping from two pods
% oc rsh -n test mcast-rc-67897 ~ $ ~ $ omping -c10 10.129.2.42 10.128.2.61 10.128.2.61 : waiting for response msg 10.128.2.61 : joined (S,G) = (*, 232.43.211.234), pinging 10.128.2.61 : unicast, seq=1, size=69 bytes, dist=2, time=0.506ms 10.128.2.61 : unicast, seq=2, size=69 bytes, dist=2, time=0.595ms 10.128.2.61 : unicast, seq=3, size=69 bytes, dist=2, time=0.555ms 10.128.2.61 : unicast, seq=4, size=69 bytes, dist=2, time=0.572ms 10.128.2.61 : unicast, seq=5, size=69 bytes, dist=2, time=0.614ms 10.128.2.61 : unicast, seq=6, size=69 bytes, dist=2, time=0.653ms 10.128.2.61 : unicast, seq=7, size=69 bytes, dist=2, time=0.611ms 10.128.2.61 : unicast, seq=8, size=69 bytes, dist=2, time=0.594ms 10.128.2.61 : unicast, seq=9, size=69 bytes, dist=2, time=0.603ms 10.128.2.61 : unicast, seq=10, size=69 bytes, dist=2, time=0.687ms 10.128.2.61 : given amount of query messages was sent 10.128.2.61 : unicast, xmt/rcv/%loss = 10/10/0%, min/avg/max/std-dev = 0.506/0.599/0.687/0.050 10.128.2.61 : multicast, xmt/rcv/%loss = 10/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000 % oc rsh -n test mcast-rc-ftsq8 ~ $ omping -c10 10.128.2.61 10.129.2.42 10.129.2.42 : waiting for response msg 10.129.2.42 : waiting for response msg 10.129.2.42 : waiting for response msg 10.129.2.42 : waiting for response msg 10.129.2.42 : joined (S,G) = (*, 232.43.211.234), pinging 10.129.2.42 : unicast, seq=1, size=69 bytes, dist=2, time=0.463ms 10.129.2.42 : unicast, seq=2, size=69 bytes, dist=2, time=0.578ms 10.129.2.42 : unicast, seq=3, size=69 bytes, dist=2, time=0.632ms 10.129.2.42 : unicast, seq=4, size=69 bytes, dist=2, time=0.652ms 10.129.2.42 : unicast, seq=5, size=69 bytes, dist=2, time=0.635ms 10.129.2.42 : unicast, seq=6, size=69 bytes, dist=2, time=0.626ms 10.129.2.42 : unicast, seq=7, size=69 bytes, dist=2, time=0.597ms 10.129.2.42 : unicast, seq=8, size=69 bytes, dist=2, time=0.618ms 10.129.2.42 : unicast, seq=9, size=69 bytes, dist=2, time=0.964ms 10.129.2.42 : unicast, seq=10, size=69 bytes, dist=2, time=0.619ms 10.129.2.42 : given amount of query messages was sent 10.129.2.42 : unicast, xmt/rcv/%loss = 10/10/0%, min/avg/max/std-dev = 0.463/0.638/0.964/0.126 10.129.2.42 : multicast, xmt/rcv/%loss = 10/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000
Actual results:
Mulicast packets loss is 100%
10.129.2.42 : multicast, xmt/rcv/%loss = 10/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000
Expected results:
Should no 100% packet loss.
Additional info:
No such issue in 4.15, tested on same profile ipi-on-aws/versioned-installer-ci with 4.15.0-0.nightly-2024-05-31-131420, same operation with above steps.
The output for both mulicast pods:
10.131.0.27 : unicast, xmt/rcv/%loss = 10/10/0%, min/avg/max/std-dev = 1.176/1.239/1.269/0.027 10.131.0.27 : multicast, xmt/rcv/%loss = 10/9/9% (seq>=2 0%), min/avg/max/std-dev = 1.227/1.304/1.755/0.170 and 10.129.2.16 : unicast, xmt/rcv/%loss = 10/10/0%, min/avg/max/std-dev = 1.101/1.264/1.321/0.065 10.129.2.16 : multicast, xmt/rcv/%loss = 10/10/0%, min/avg/max/std-dev = 1.230/1.351/1.890/0.191
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
This is a clone of issue OCPBUGS-42563. The following is the description of the original issue:
—
During install of multi-AZ OSD GCP clusters into customer-provided GCP projects, extra control plane nodes are created by the installer. This may be limited to a few regions, and has show up in our testing in us-west2 and asia-east2.
When the cluster is installed, the installer provisions three control plane nodes via the cluster-api:
However, the Machine manifest for master-0 and master-2 are written with the wrong AZs (master-0 in AZ *c and master-2 in AZ *a).
When the Machine controller in-cluster starts up and parses the manifests, it cannot find a VM for master-0 in AZ *c, or master-2 in *a, so it proceeds to try to create new VMs for those cases. master-1 is identified correctly, and unaffected.
This results in the cluster coming up with three control plane nodes, with master-0 and master-2 having no backing Machines, three control plane Machines, with only master-1 having a Node link, and the other two listed in Provisioned state, but with no Nodes, and 5 GCP VMs for these control plane nodes:
This happens consistently, across multiple GCP projects, so far in us-west2 and asia-east2 ONLY.
4.16.z clusters work as expected, as do clusters upgraded from 4.16.z to 4.17.z.
4.17.0-rc3 - 4.17.0-rc6 have all been identified as having this issue.
100%
I'm unsure how to replicate this in vanilla cluster install, but via OSD:
Example:
$ ocm create cluster --provider=gcp --multi-az --ccs --secure-boot-for-shielded-vms --region asia-east2 --service-account-file ~/.config/gcloud/chcollin1-dev-acct.json --channel-group candidate --version openshift-v4.17.0-rc.3-candidate chcollin-4170rc3-gcp
Requesting a GCP install via an install-config with controlPlane.platform.gcp.zones out of order seems to reliably reproduce.
Install will fail in OSD, but a cluster will be created with multiple extra control-plane nodes, and the API server will respond on the master-1 node.
A standard 3 control-plane-node cluster is created.
We're unsure what it is about the two reported Zones or the difference between the primary OSD GCP project and customer-supplied Projects that has an effect.
The only thing we've noticed is the install-config has the order backwards for compute nodes, but not for control plane nodes:
{ "controlPlane": [ "us-west2-a", "us-west2-b", "us-west2-c" ], "compute": [ "us-west2-c", <--- inverted order. Shouldn't matter when building control-plane Machines, but maybe cross-contaminated somehow? "us-west2-b", "us-west2-a" ], "platform": { "defaultMachinePlatform": { <--- nothing about zones in here, although again, the controlPlane block should override any zones configured here "osDisk": { "DiskSizeGB": 0, "diskType": "" }, "secureBoot": "Enabled", "type": "" }, "projectID": "anishpatel", "region": "us-west2" } }
Since we see the divergence at the asset/manifest level, we should be able to reproduce with just an openshift-install create manifests, followed by grep -r zones: or something, without having to wait for an actuall install attempt to come up and fail.
Please review the following PR: https://github.com/openshift/cloud-provider-openstack/pull/280
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The PowerVS CI uses the installer image to do some necessary setup. The openssl binary was recently removed from that image. So we need to switch to the upi-installer image.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Always
Steps to Reproduce:
1. Look at CI runs
This is a clone of issue OCPBUGS-38813. The following is the description of the original issue:
—
Description of problem:
OLM 4.17 references 4.16 catalogs
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. oc get pods -n openshift-marketplace -o yaml | grep "image: registry.redhat.io"
Actual results:
image: registry.redhat.io/redhat/certified-operator-index:v4.16 image: registry.redhat.io/redhat/certified-operator-index:v4.16 image: registry.redhat.io/redhat/community-operator-index:v4.16 image: registry.redhat.io/redhat/community-operator-index:v4.16 image: registry.redhat.io/redhat/redhat-marketplace-index:v4.16 image: registry.redhat.io/redhat/redhat-marketplace-index:v4.16 image: registry.redhat.io/redhat/redhat-operator-index:v4.16 image: registry.redhat.io/redhat/redhat-operator-index:v4.16
Expected results:
image: registry.redhat.io/redhat/certified-operator-index:v4.17 image: registry.redhat.io/redhat/certified-operator-index:v4.17 image: registry.redhat.io/redhat/community-operator-index:v4.17 image: registry.redhat.io/redhat/community-operator-index:v4.17 image: registry.redhat.io/redhat/redhat-marketplace-index:v4.17 image: registry.redhat.io/redhat/redhat-marketplace-index:v4.17 image: registry.redhat.io/redhat/redhat-operator-index:v4.17 image: registry.redhat.io/redhat/redhat-operator-index:v4.17
Additional info:
This is a clone of issue OCPBUGS-38802. The following is the description of the original issue:
—
Description of problem:
Infrastructure object with platform None is ignored by node-joiner tool
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Always
Steps to Reproduce:
1. Run the node-joiner add-nodes command
Actual results:
Currently the node-joiner tool retrieves the platform type from the kube-system/cluster-config-v1 config map
Expected results:
Retrieve the platform type from the infrastructure cluster object
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
There is an intermittent issue with the UploadImage() implementation in github.com/nutanix-cloud-native/prism-go-client@v0.3.4, on which the OCP installer depends. When testing the OCP installer with ClusterAPIInstall=true, I frequently hit the error with UploadImage() when calling to upload the bootstrap image to PC from the local image file. The error logs: INFO creating the bootstrap image demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso (uuid: 75694edf-f9c4-4d9a-9a44-731a4d103cc8), taskUUID: c8eafd49-54e2-4fb9-a3df-c456863d71fd. INFO created the bootstrap image demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso (uuid: 75694edf-f9c4-4d9a-9a44-731a4d103cc8). INFO preparing to upload the bootstrap image demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso (uuid: 75694edf-f9c4-4d9a-9a44-731a4d103cc8) data from file /Users/yanhuali/Library/Caches/openshift-installer/image_cache/demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso ERROR failed to upload the bootstrap image data "demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso" from filepath /Users/yanhuali/Library/Caches/openshift-installer/image_cache/demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso: status: 400 Bad Request, error-response: { ERROR "api_version": "3.1", ERROR "code": 400, ERROR "message_list": [ ERROR { ERROR "message": "Given input is invalid. Image 75694edf-f9c4-4d9a-9a44-731a4d103cc8 is already complete", ERROR "reason": "INVALID_ARGUMENT" ERROR }ERROR ], ERROR "state": "ERROR" ERROR } ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed preparing ignition data: failed to upload the bootstrap image data "demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso" from filepath /Users/yanhuali/Library/Caches/openshift-installer/image_cache/demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso: status: 400 Bad Request, error-response: { ERROR "api_version": "3.1", ERROR "code": 400, ERROR "message_list": [ ERROR { ERROR "message": "Given input is invalid. Image 75694edf-f9c4-4d9a-9a44-731a4d103cc8 is already complete", ERROR "reason": "INVALID_ARGUMENT" ERROR }ERROR ], ERROR "state": "ERROR" ERROR } The OCP installer code calling the prism-go-client function UploadImage() is here:https://github.com/openshift/installer/blob/master/pkg/infrastructure/nutanix/clusterapi/clusterapi.go#L172-L207
How reproducible:
Use OCP IPI 4.16 to provision a Nutanix OCP cluster with the install-config ClusterAPIInstall=true. This is an intermittent issue, so you need to repeat the test several times to reproduce.
Steps to Reproduce:
1. 2. 3.
Actual results:
The installer intermittently failed at uploading the bootstrap image data to PC from the local image data file.
Expected results:
The installer successfully to create the Nutanix OCP cluster with the install-config ClusterAPIInstall=true.
Additional info:
Description of problem:
As a user, when I manually type in a git repo it sends tens of unnecessary API calls to the git provider, which makes me hit the rate limit very quickly, and reduces my productivity
Version-Release number of selected component (if applicable):
4.17.0
How reproducible:
Always
Steps to Reproduce:
1. Developer perspective > +Add > Import from Git 2. Open devtools and switch to the networking tab 3. Start typing a GitHub link
Actual results:
There are many API calls to GitHub
Expected results:
There should not be that many
Additional info:
Description of problem:
Create any type of RoleBinding will trigger Admission Webhook Warning: xxxx unknown field: subjectRef
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-06-20-005211
How reproducible:
Always
Steps to Reproduce:
1. goes to RoleBinding creation page User Management -> RoleBindings -> Create binding or /k8s/cluster/rolebindings/~new 2. create any type of RoleBinding
Actual results:
2. We can see an warning message on submit:Admission Webhook WarningRoleBinding test-ns-1 violates policy 299 - "unknown field \"subjectRef\""
Expected results:
2. no warning message
Additional info:
Description of problem:
The following parameter has been added to safe sysctls since k8s v1.29[1].
net.ipv4.tcp_keepalive_time
net.ipv4.tcp_fin_timeout
net.ipv4.tcp_keepalive_intvl
net.ipv4.tcp_keepalive_probes
However, the list of safe sysctls returned by SafeSysctlAllowlist() in OpenShift is not updated [2].
[1] https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/#safe-and-unsafe-sysctls
[2] https://github.com/openshift/apiserver-library-go/blob/e88385a79b1724850143487d507f606f8540f437/pkg/securitycontextconstraints/sysctl/mustmatchpatterns.go#L32
Due to this, the pod with these safe sysctls configuration is blocked by SCC for non-privileged users.
(Look at "Steps to Reproduce" for details.)
$ oc apply -f pod-sysctl.yaml
Error from server (Forbidden): error when creating "pod-sysctl.yaml": pods "pod-sysctl" is forbidden: unable to validate against any security context constraint: [provider "trident-controller": Forbidden: not usable by user or serviceaccount, provider "anyuid": Forbidden: not usable by user or serviceaccount, pod.spec.securityContext.sysctls[0]: Forbidden: unsafe sysctl "net.ipv4.tcp_fin_timeout" is not allowed, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "trident-node-linux": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount]
Version-Release number of selected component (if applicable):
OpenShift v4.16.4
How reproducible:
Always
Steps to Reproduce:
Step1. Login as a non-privileged user.
$ oc login -u user
Step2. Create the following yaml file and apply it.
$ cat pod-sysctl.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-sysctl
spec:
containers:
$ oc apply -f pod-sysctl.yaml
Error from server (Forbidden): error when creating "pod-sysctl.yaml": pods "pod-sysctl" is forbidden: unable to validate against any security context constraint: [provider "trident-controller": Forbidden: not usable by user or serviceaccount, provider "anyuid": Forbidden: not usable by user or serviceaccount, pod.spec.securityContext.sysctls[0]: Forbidden: unsafe sysctl "net.ipv4.tcp_fin_timeout" is not allowed, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "trident-node-linux": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount]
Expected results:
The yaml with safe sysctls can be applied by non-privileged user.
The specified sysctls are enabled in the pod.
Description of problem:
imagesStreams on hosted-clusters pointing to image on private registries are failing due to tls verification although the registry is correctly trusted. example: $ oc create namespace e2e-test $ oc --namespace=e2e-test tag virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-29-4-4zE9mRvED4RQoUxQ busybox:latest $ oc --namespace=e2e-test set image-lookup busybox stirabos@t14s:~$ oc get imagestream -n e2e-test NAME IMAGE REPOSITORY TAGS UPDATED busybox image-registry.openshift-image-registry.svc:5000/e2e-test/busybox latest stirabos@t14s:~$ oc get imagestream -n e2e-test busybox -o yaml apiVersion: image.openshift.io/v1 kind: ImageStream metadata: annotations: openshift.io/image.dockerRepositoryCheck: "2024-03-27T12:43:56Z" creationTimestamp: "2024-03-27T12:43:56Z" generation: 3 name: busybox namespace: e2e-test resourceVersion: "49021" uid: 847281e7-e307-4057-ab57-ccb7bfc49327 spec: lookupPolicy: local: true tags: - annotations: null from: kind: DockerImage name: virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-29-4-4zE9mRvED4RQoUxQ generation: 2 importPolicy: importMode: Legacy name: latest referencePolicy: type: Source status: dockerImageRepository: image-registry.openshift-image-registry.svc:5000/e2e-test/busybox tags: - conditions: - generation: 2 lastTransitionTime: "2024-03-27T12:43:56Z" message: 'Internal error occurred: virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-29-4-4zE9mRvED4RQoUxQ: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority' reason: InternalError status: "False" type: ImportSuccess items: null tag: latest While image virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-29-4-4zE9mRvED4RQoUxQ can be properly consumed if directly used for a container on a pod on the same cluster. user-ca-bundle config map is properly propagated from hypershift: $ oc get configmap -n openshift-config user-ca-bundle NAME DATA AGE user-ca-bundle 1 3h32m $ openssl x509 -text -noout -in <(oc get cm -n openshift-config user-ca-bundle -o json | jq -r '.data["ca-bundle.crt"]') Certificate: Data: Version: 3 (0x2) Serial Number: 11:3f:15:23:97:ac:c2:d5:f6:54:06:1a:9a:22:f2:b5:bf:0c:5a:00 Signature Algorithm: sha256WithRSAEncryption Issuer: C = US, ST = NC, L = Raleigh, O = Test Company, OU = Testing, CN = test.metalkube.org Validity Not Before: Mar 27 08:28:07 2024 GMT Not After : Mar 27 08:28:07 2025 GMT Subject: C = US, ST = NC, L = Raleigh, O = Test Company, OU = Testing, CN = test.metalkube.org Subject Public Key Info: Public Key Algorithm: rsaEncryption Public-Key: (2048 bit) Modulus: 00:c1:49:1f:18:d2:12:49:da:76:05:36:3e:6b:1a: 82:a7:22:0d:be:f5:66:dc:97:44:c7:ca:31:4d:f3: 7f:0a:d3:de:df:f2:b6:23:f9:09:b1:7a:3f:19:cc: 22:c9:70:90:30:a7:eb:49:28:b6:d1:e0:5a:14:42: 02:93:c4:ac:cc:da:b1:5a:8f:9c:af:60:19:1a:e3: b1:34:c2:b6:2f:78:ec:9f:fe:38:75:91:0f:a6:09: 78:28:36:9e:ab:1c:0d:22:74:d5:52:fe:0a:fc:db: 5a:7c:30:9d:84:7d:f7:6a:46:fe:c5:6f:50:86:98: cc:35:1f:6c:b0:e6:21:fc:a5:87:da:81:2c:7b:e4: 4e:20:bb:35:cc:6c:81:db:b3:95:51:cf:ff:9f:ed: 00:78:28:1d:cd:41:1d:03:45:26:45:d4:36:98:bd: bf:5c:78:0f:c7:23:5c:44:5d:a6:ae:85:2b:99:25: ae:c0:73:b1:d2:87:64:3e:15:31:8e:63:dc:be:5c: ed:e3:fe:97:29:10:fb:5c:43:2f:3a:c2:e4:1a:af: 80:18:55:bc:40:0f:12:26:6b:f9:41:da:e2:a4:6b: fd:66:ae:bc:9c:e8:2a:5a:3b:e7:2b:fc:a6:f6:e2: 73:9b:79:ee:0c:86:97:ab:2e:cc:47:e7:1b:e5:be: 0c:9f Exponent: 65537 (0x10001) X509v3 extensions: X509v3 Basic Constraints: CA:TRUE, pathlen:0 X509v3 Subject Alternative Name: DNS:virthost.ostest.test.metalkube.org Signature Algorithm: sha256WithRSAEncryption Signature Value: 58:d2:da:f9:2a:c0:2d:7a:d9:9f:1f:97:e1:fd:36:a7:32:d3: ab:3f:15:cd:68:8e:be:7c:11:ec:5e:45:50:c4:ec:d8:d3:c5: 22:3c:79:5a:01:63:9e:5a:bd:02:0c:87:69:c6:ff:a2:38:05: 21:e4:96:78:40:db:52:c8:08:44:9a:96:6a:70:1e:1e:ae:74: e2:2d:fa:76:86:4d:06:b1:cf:d5:5c:94:40:17:5d:9f:84:2c: 8b:65:ca:48:2b:2d:00:3b:42:b9:3c:08:1b:c5:5d:d2:9c:e9: bc:df:9a:7c:db:30:07:be:33:2a:bb:2d:69:72:b8:dc:f4:0e: 62:08:49:93:d5:0f:db:35:98:18:df:e6:87:11:ce:65:5b:dc: 6f:f7:f0:1c:b0:23:40:1e:e3:45:17:04:1a:bc:d1:57:d7:0d: c8:26:6d:99:fe:28:52:fe:ba:6a:a1:b8:d1:d1:50:a9:fa:03: bb:b7:ad:0e:82:d2:e8:34:91:fa:b4:f9:81:d1:9b:6d:0f:a3: 8c:9d:c4:4a:1e:08:26:71:b9:1a:e8:49:96:0f:db:5c:76:db: ae:c7:6b:2e:ea:89:5d:7f:a3:ba:ea:7e:12:97:12:bc:1e:7f: 49:09:d4:08:a6:4a:34:73:51:9e:a2:9a:ec:2a:f7:fc:b5:5c: f8:20:95:ad This is probably a side effect of https://issues.redhat.com/browse/RFE-3093 - imagestream to trust CA added during the installation, that is also affecting imagestreams that requires a CA cert injected by hypershift during hosted-cluster creation in the disconnected use case.
Version-Release number of selected component (if applicable):
v4.14, v4.15, v4.16
How reproducible:
100%
Steps to Reproduce:
once connected to a disconnected hosted cluster, create an image stream pointing to an image on the internal mirror registry: 1. $ oc --namespace=e2e-test tag virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-29-4-4zE9mRvED4RQoUxQ busybox:latest 2. $ oc --namespace=e2e-test set image-lookup busybox 3. then check the image stream
Actual results:
status: dockerImageRepository: image-registry.openshift-image-registry.svc:5000/e2e-test/busybox tags: - conditions: - generation: 2 lastTransitionTime: "2024-03-27T12:43:56Z" message: 'Internal error occurred: virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-29-4-4zE9mRvED4RQoUxQ: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority' although the same image can be directly consumed by a pod on the same cluster
Expected results:
status: dockerImageRepository: image-registry.openshift-image-registry.svc:5000/e2e-test/busybox tags: - conditions: - generation: 8 lastTransitionTime: "2024-03-27T13:30:46Z" message: dockerimage.image.openshift.io "virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-29-4-4zE9mRvED4RQoUxQ" not found reason: NotFound status: "False" type: ImportSuccess
Additional info:
This is probably a side effect of https://issues.redhat.com/browse/RFE-3093 Marking the imagestream as: importPolicy: importMode: Legacy insecure: true is enough to workaround this.
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/121
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The MAPI GCP Code hasn't changed since 4.14, so if it were a MAPI issue I'd expect to see other things breaking.T he Installer doesn't seem to have changed either, however there's a discrepancy between what's created by terraform: https://github.com/openshift/installer/blame/916b3a305691dcbf1e47f01137e0ceee89ed0f59/data/data/gcp/post-bootstrap/main.tf#L14 https://github.com/openshift/installer/blob/916b3a305691dcbf1e47f01137e0ceee89ed0f59/data/data/gcp/cluster/network/lb-private.tf#L10 and the UPI instructions https://github.com/openshift/installer/blame/916b3a305691dcbf1e47f01137e0ceee89ed0f59/docs/user/gcp/install_upi.md#L560 https://github.com/openshift/installer/blob/916b3a305691dcbf1e47f01137e0ceee89ed0f59/upi/gcp/02_lb_int.py#L19
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Upstream is adding includes for tests that should be run in https://github.com/openshift/kubernetes/blob/master/test/e2e/e2e_test.go, since this is a test file, we keep our import list in https://github.com/openshift/kubernetes/blob/master/openshift-hack/e2e/include.go.
Currently the two files are out of sync, we should ensure they are both kept in sync.
Description of problem:
The builds installed in the hosted clusters are having issues to git-clone repositories from external URLs where their CA are configured in the ca-bundle.crt from trsutedCA section: spec: configuration: apiServer: [...] proxy: trustedCA: name: user-ca-bundle <--- In traditional OCP implementations, the *-global-ca configmap is installed in the same namespace from the build and the ca-bundle.crt is injected into this configmap. In hosted clusters the configmap is being created empty: $ oc get cm -n <app-namespace> <build-name>-global-ca -oyaml apiVersion: v1 data: ca-bundle.crt: "" As mentioned, the user-ca-bundle has the certificates configured: $ oc get cm -n openshift-config user-ca-bundle -oyaml apiVersion: v1 data: ca-bundle.crt: | -----BEGIN CERTIFICATE----- <---
Version-Release number of selected component (if applicable):
How reproducible:
Easily
Steps to Reproduce:
1. Install hosted cluster with trustedCA configmap 2. Run a build in the hosted cluster 3. Check the global-ca configmap
Actual results:
global-ca is empty
Expected results:
global-ca injects the ca-bundle.crt properly
Additional info:
This fix contains the following changes coming from updated version of kubernetes up to v1.30.3:
Changelog:
v1.30.3: https://github.com/kubernetes/kubernetes/blob/release-1.30/CHANGELOG/CHANGELOG-1.30.md#changelog-since-v1302
Description of problem:
An unexpected validation failure occurs when creating the agent ISO image if the RendezvousIP is a substring of the next-hop-address set for a worker node.
For example this configuration snippet in agent-config.yaml:
apiVersion: v1alpha1 kind: AgentConfig metadata: name: agent-config rendezvousIP: 7.162.6.1 hosts: ... - hostname: worker-0 role: worker networkConfig: interfaces: - name: eth0 type: Ethernet state: up ipv4: enabled: true address: - ip: 7.162.6.4 prefix-length: 25 dhcp: false routes: config: - destination: 0.0.0.0/0 next-hop-address: 7.162.6.126 next-hop-interface: eth0 table-id: 254
Will result in the validation failure when creating the image:
FATAL failed to fetch Agent Installer ISO: failed to fetch dependency of "Agent Installer ISO": failed to fetch dependency of "Agent Installer Artifacts": failed to fetch dependency of "Agent Installer Ignition": failed to fetch dependency of "Agent Manifests": failed to fetch dependency of "NMState Config": failed to generate asset "Agent Hosts": invalid Hosts configuration: [Hosts[3].Host: Forbidden: Host worker-0 has role 'worker' and has the rendezvousIP assigned to it. The rendezvousIP must be assigned to a control plane host.
The problem is this check here https://github.com/openshift/installer/pull/6716/files#diff-fa305fe33630f77b65bd21cc9473b620f67cfd9ce35f7ddf24d03b26ec2ccfffR293
Its checking for the IP in the raw nmConfig. The problem is the routes stanza is also included in the nmConfig and the route is
next-hop-address: 7.162.6.126
So when rendezvousIP is 7.162.6.1 that strings.Contains() check returns true and the validation fails.
4.16.0-0.nightly-2024-05-16-165920 aws-sdn-upgrade failures in 1791152612112863232
Undiagnosed panic detected in pod
{ pods/openshift-controller-manager_controller-manager-8d46bf695-cvdc6_controller-manager.log.gz:E0516 17:36:26.515398 1 runtime.go:79] Observed a panic: &runtime.TypeAssertionError{_interface:(*abi.Type)(0x3ca66c0), concrete:(*abi.Type)(0x3e9f720), asserted:(*abi.Type)(0x41dd660), missingMethod:""} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Secret)
Please review the following PR: https://github.com/openshift/origin/pull/28827
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
openshift-install is creating user-defined tags (platform.aws.userTags) in subnets on AWS of BYO VPC (unmanaged VPC) deployment when using CAPA. The documentation[1] for userTags state: > A map of keys and values that the installation program adds as tags to all resources that it creates. So when the network (VPC and subnets) are managed by user (BYO VPC), the installer should not create additional tags when provided in install-config.yaml. Investigating in CAPA codebase, the feature gate TagUnmanagedNetworkResources is enabled, and the subnet is propagating the userTags in the reconciliation loop[2]. [1] https://docs.openshift.com/container-platform/4.15/installing/installing_aws/installation-config-parameters-aws.html [2] https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/pkg/cloud/services/network/subnets.go#L618
Version-Release number of selected component (if applicable):
4.16.0-ec.6-x86_64
How reproducible:
always
Steps to Reproduce:
- 1. create VPC and subnets using CloudFormation. Example template: https://github.com/openshift/installer/blob/master/upi/aws/cloudformation/01_vpc.yaml - 2. create install-config with user-tags and subnet IDs to install the cluster: - 3. create the cluster with feature gate for CAPI ``` featureSet: CustomNoUpgrade featureGates: - ClusterAPIInstall=true metadata: name: "${CLUSTER_NAME}" platform: aws: region: us-east-1 subnets: - subnet-0165c70573a45651c - subnet-08540527fffeae3e9 userTags: x-red-hat-clustertype: installer x-red-hat-managed: "true" ```
Actual results:
installer/CAPA is setting the user-defined tags in unmanaged subnets
Expected results:
- installer/CAPA does not create userTags on unmanaged subnets - userTags is applied for regular/standard workflow (managed VPC) with CAPA
Additional info:
- Impacting on SD/ROSA: https://redhat-internal.slack.com/archives/CCPBZPX7U/p1717588837289489
Ecosystem QE is preparing to create a release-4.16 branch within our test repos. Many pkgs are currently using v0.29 modules which will not be compatible with v0.28. It would be ideal if we can update k8s modules to v0.29 to prevent us from needing to re-implement the assisted APIs.
Description of problem:
When using the OpenShift Assisted Installer with a password containing the `:` colon character.
Version-Release number of selected component (if applicable):
OpenShift 4.15
How reproducible:
Everytime
Steps to Reproduce:
1. Attempt to install using the Agentbased installer with a pull-secret which includes a colon character. The following snippet of. code appears to be hit when there is a colon within the user/password section of the pull-secret. https://github.com/openshift/assisted-service/blob/d3dd2897d1f6fe108353c9241234a724b30262c2/internal/cluster/validations/validations.go#L132-L135
Actual results:
Install fails
Expected results:
Install succeeds
Additional info:
This is a clone of issue OCPBUGS-41637. The following is the description of the original issue:
—
Description of problem:
Console and OLM engineering and BU have decided to remove the Extension Catalog navigation item until the feature has matured more.
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/289
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-operator/pull/243
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-42000. The following is the description of the original issue:
—
Description of problem:
1. We are making 2 API calls to get the logs for the PipelineRuns. instead, we can make use of `results.tekton.dev/record` annotation and replace the `records` in the value of the annotation with `logs` to get the logs of the PipelineRuns. 2. Tekton results will return back only v1 version of PipelineRun and TaskRun from Pipelines 1.16, so data type has to be v1 version for 1.16 version and for lower version it is v1beta1
Description of problem:
When running a conformance suite against a hypershift cluster (for example, CNI conformance) the MonitorTests step fails because of missing files from the disruption monitor.
Version-Release number of selected component (if applicable):
4.15.13
How reproducible:
Consistent
Steps to Reproduce:
1. Create a hypershift cluster 2. Attempt to run an ose-tests suite. For example, the CNI conformance suite documented here: https://access.redhat.com/documentation/en-us/red_hat_software_certification/2024/html/red_hat_software_certification_workflow_guide/con_cni-certification_openshift-sw-cert-workflow-working-with-cloud-native-network-function#running-the-cni-tests_openshift-sw-cert-workflow-working-with-container-network-interface 3. Note errors in logs
Actual results:
found errors fetching in-cluster data: [failed to list files in disruption event folder on node ip-10-0-130-177.us-west-2.compute.internal: the server could not find the requested resource failed to list files in disruption event folder on node ip-10-0-152-10.us-west-2.compute.internal: the server could not find the requested resource] Failed to write events from in-cluster monitors, err: open /tmp/artifacts/junit/AdditionalEvents__in_cluster_disruption.json: no such file or directory
Expected results:
No errors
Additional info:
The first error can be avoided by creating the directory it's looking for on all nodes: for node in $(oc get nodes -oname); do oc debug -n default $node -- chroot /host mkdir -p /var/log/disruption-data/monitor-events; done However, I'm not sure if this directory not being created is due to the disruption monitor working properly on hypershift, or if this should be skipped on hypershift entirely. The second error is related to the ARTIFACT_DIR env var not being set locally, and can be avoided by creating a directory, setting that directory as the ARTIFACT_DIR, and then creating an empty "junit" dir inside of it. It looks like ARTIFACT_DIR defaults to a temporary directory if it's not set in the env, but the "junit" directory doesn't exist inside of it, so file creation in that non-existant directory fails.
Description of problem:
In OCP 4.17, kube-apiserver no longer gets a valid cloud config. Therefore the PersistentVolumeLabel admission plugin reject in-tree GCE PD PVs that do not have correct topology with `persistentvolumes \"gce-\" is forbidden: error querying GCE PD volume e2e-4d8656c6-d1d4-4245-9527-33e5ed18dd31: disk is not found`
In 4.16, kube-apiserver will not get a valid cloud config after it updates library-go with this PR.
How reproducible:
always
Steps to Reproduce:
1. Run e2e test "Multi-AZ Cluster Volumes should schedule pods in the same zones as statically provisioned PVs"
Due to upstream changes (https://github.com/kubernetes/kubernetes/pull/121485) KMSv1 is deprecated starting with k8s 1.29. HyperShift is actively using KMSv1. Migrating cluster from KMSv1 to KMSv2 is tricky so we need to at least make sure that new ROSA clusters can only enable KMSv2 whilst old one remains on KMSv1.
We need to verify that new installations of ROSA that enables KMS encryption are running the KMSv2 API and that old clusters upgrading to a version where KMSv2 is available remains on KMSv1.
We need to make some minor updates to our tekton files per https://github.com/konflux-ci/build-definitions/blob/main/task/buildah/0.2/MIGRATION.md. Specifically -
- Removes the BASE_IMAGES_DIGESTS result. Please remove all the references to this result from your pipeline.
- Base images and their digests can be found in the SBOM for the output image.
- No longer writes the base_images_from_dockerfile file into the source workspace.
- Removes the BUILDER_IMAGE and DOCKER_AUTH params. Neither one did anything in the later releases of version 0.1. Please stop passing these params to the buildah task if you used to do so with version 0.1.
Description of problem:
Snyk is failing on some deps
Version-Release number of selected component (if applicable):
At least master/4.17 and 4.16
How reproducible:
100%
Steps to Reproduce:
Open a PR against master or release-4.16 branch, Snyk will fail. And it seems like recent history shows that the test is just being overridden, we should stop overriding the test and fix the deps or justify excluding them from Snyk
Actual results:
This is a clone of issue OCPBUGS-43508. The following is the description of the original issue:
—
Description of problem:
These two tests have been flaking more often lately. The TestLeaderElection flake is partially (but not solely) connected to OCPBUGS-41903. TestOperandProxyConfiguration seems to fail in the teardown while waiting for other cluster operators to become available. Although these flakes aren't customer facing, they considerably slow development cycles (due to retests) and also consume more resources than they should (every retest runs on a new cluster), so we want to backport the fixes.
Version-Release number of selected component (if applicable):
4.18, 4.17, 4.16, 4.15, 4.14
How reproducible:
Sometimes
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/machine-api-operator/pull/1251
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Test flake on 409 conflict during check
{Failed === RUN TestCreateCluster/Main/EnsureHostedClusterImmutability util.go:911: Expected <string>: Operation cannot be fulfilled on hostedclusters.hypershift.openshift.io "example-c88md": the object has been modified; please apply your changes to the latest version and try again to contain substring <string>: Services is immutable --- FAIL: TestCreateCluster/Main/EnsureHostedClusterImmutability (0.05s) }
Description of problem:
unit test jobs fail due to removed image https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_oc/1882/pull-ci-openshift-oc-master-unit/1856738411440771072
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
unit test job passes
Additional info:
Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/304
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Refactor name to Dockerfile.ocp as a better, version independent alternative
Please, see the last 3 failures on this test in the link provided in the boilerplate text below:
Component Readiness has found a potential regression in [sig-cluster-lifecycle] Cluster completes upgrade.
Probability of significant regression: 100.00%
Sample (being evaluated) Release: 4.17
Start Time: 2024-06-29T00:00:00Z
End Time: 2024-07-05T23:59:59Z
Success Rate: 89.96%
Successes: 242
Failures: 27
Flakes: 0
Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 858
Failures: 0
Flakes: 0
Description of problem:
When use `shortestPath: true` the mirror images numbers is too many then the required.
Version-Release number of selected component (if applicable):
./oc-mirror version --output=yaml clientVersion: buildDate: "2024-05-08T04:26:09Z" compiler: gc gitCommit: 9e77c1944f70fed0a85e5051c8f3efdfb09add70 gitTreeState: clean gitVersion: 4.16.0-202405080039.p0.g9e77c19.assembly.stream.el9-9e77c19 goVersion: go1.21.9 (Red Hat 1.21.9-1.el9_4) X:strictfipsruntime major: "" minor: "" platform: linux/amd64
How reproducible:
always
Steps to Reproduce:
1) Use following isc to do mirror2mirror for v2: kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 archiveSize: 8 mirror: platform: channels: - name: stable-4.15 type: ocp minVersion: '4.15.11' maxVersion: '4.15.11' shortestPath: true graph: true `oc-mirror --config config.yaml --v2 docker://xxx.com:5000/m2m --workspace file:///app1/0416/clid20/` `oc-mirror --config config.yaml docker://localhost:5000 --dest-use-http` 2) without the shortest path : kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: platform: channels: - name: stable-4.15
Actual results:
1) It counted 577 images to mirror oc-mirror --config config-11.yaml file://outsizecheck --v2 2024/05/11 03:57:21 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/05/11 03:57:21 [INFO] : 👋 Hello, welcome to oc-mirror 2024/05/11 03:57:21 [INFO] : ⚙️ setting up the environment for you... 2024/05/11 03:57:21 [INFO] : 🔀 workflow mode: mirrorToDisk 2024/05/11 03:57:21 [INFO] : 🕵️ going to discover the necessary images... 2024/05/11 03:57:21 [INFO] : 🔍 collecting release images... 2024/05/11 03:57:28 [INFO] : 🔍 collecting operator images... 2024/05/11 03:57:28 [INFO] : 🔍 collecting additional images... 2024/05/11 03:57:28 [INFO] : 🚀 Start copying the images... 2024/05/11 03:57:28 [INFO] : === Overall Progress - copying image 1 / 577 === 2024/05/11 03:57:28 [INFO] : copying release image 1 / 577 2) without the shortest path , only counted 192 images to mirror. oc-mirror --config config-32547.yaml file://outsizecheck --v2 2024/05/11 03:55:12 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/05/11 03:55:12 [INFO] : 👋 Hello, welcome to oc-mirror 2024/05/11 03:55:12 [INFO] : ⚙️ setting up the environment for you... 2024/05/11 03:55:12 [INFO] : 🔀 workflow mode: mirrorToDisk 2024/05/11 03:55:12 [INFO] : 🕵️ going to discover the necessary images... 2024/05/11 03:55:12 [INFO] : 🔍 collecting release images... 2024/05/11 03:55:12 [INFO] : detected minimum version as 4.15.11 2024/05/11 03:55:12 [INFO] : detected minimum version as 4.15.11 2024/05/11 03:55:18 [INFO] : 🔍 collecting operator images... 2024/05/11 03:56:09 [INFO] : 🔍 collecting additional images... 2024/05/11 03:56:09 [INFO] : 🚀 Start copying the images... 2024/05/11 03:56:09 [INFO] : === Overall Progress - copying image 1 / 266 === 2024/05/11 03:56:09 [INFO] : copying release image 1 / 192
Expected results:
1) if only have one ocp payload, the image need mirrored should be same.
Additional information:
[sig-arch] events should not repeat pathologically for ns/openshift-etcd-operator
{ 1 events happened too frequently event happened 25 times, something is wrong: namespace/openshift-etcd-operator deployment/etcd-operator hmsg/e2df46f507 - reason/RequiredInstallerResourcesMissing configmaps: etcd-all-bundles-8 (02:05:52Z) result=reject }
Sample failures:
It's hitting both of these jobs linked above, but intermittently, 20-40% of the time on this first payload with the regression.
Looks to be this PR: https://github.com/openshift/cluster-etcd-operator/pull/1268
This is a clone of issue OCPBUGS-39285. The following is the description of the original issue:
—
Description of problem: https://github.com/openshift/installer/pull/7727 changed the order of some playbooks and we're expected to run the network.yaml playbook before the metadata.json file is created. This isn't a problem with newer version of ansible, that will happily ignore missing var_files, however this is a problem with older ansible that fail with:
[cloud-user@installer-host ~]$ ansible-playbook -i "/home/cloud-user/ostest/inventory.yaml" "/home/cloud-user/ostest/network.yaml" PLAY [localhost] ***************************************************************************************************************************************************************************************************************************** ERROR! vars file metadata.json was not found Could not find file on the Ansible Controller. If you are using a module and expect the file to exist on the remote, see the remote_src option
Please review the following PR: https://github.com/openshift/cluster-openshift-apiserver-operator/pull/579
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
CCMs attempt direct connections when the mgmt cluster on which the HCP runs is proxied and does not allow direction outbound connections.
Example from the AWS CCM
I0731 21:46:33.948466 1 event.go:389] "Event occurred" object="openshift-ingress/router-default" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: error listing AWS instances: \"WebIdentityErr: failed to retrieve credentials\\ncaused by: RequestError: send request failed\\ncaused by: Post \\\"https://sts.us-east-1.amazonaws.com/\\\": dial tcp 72.21.206.96:443: i/o timeout\""
Description of problem:
Sometimes deleting the bootstrap ssh rule during bootstrap destroy can timeout after 5min, failing the installation.
Version-Release number of selected component (if applicable):
4.16+ with capi/aws
How reproducible:
Intermittent
Steps to Reproduce:
1. 2. 3.
Actual results:
level=info msg=Waiting up to 5m0s (until 2:31AM UTC) for bootstrap SSH rule to be destroyed... level=fatal msg=error destroying bootstrap resources failed during the destroy bootstrap hook: failed to remove bootstrap SSH rule: bootstrap ssh rule was not removed within 5m0s: timed out waiting for the condition
Expected results:
The rule is deleted successfully and in a timely manner.
Additional info:
This is probably happening because we are changing the AWSCluster object, thus causing capi/capa to trigger a big reconciliation of the resources. We should try to delete the rule via aws sdk.
Description of problem:
IDMS is set on HostedCluster and reflected in their respective CR in-cluster. Customers can create, update, and delete these today. In-cluster IDMS has no impact.
Version-Release number of selected component (if applicable):
4.14+
How reproducible:
100%
Steps to Reproduce:
1. Create HCP 2. Create IDMS 3. Observe it does nothing
Actual results:
IDMS doesn't change anything if manipulated in data plane
Expected results:
IDMS either allows updates OR IDMS updates are blocked.
Additional info:
Please review the following PR: https://github.com/openshift/csi-external-attacher/pull/75
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Given that we create a new pool, and we enable OCB in this pool, and we remove the pool and the MachineOSConfig resource, and we create another new pool to enable OCB again, then the controller pod panics.
Version-Release number of selected component (if applicable):
pre-merge https://github.com/openshift/machine-config-operator/pull/4327
How reproducible:
Always
Steps to Reproduce:
1. Create a new infra MCP apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: infra spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]} nodeSelector: matchLabels: node-role.kubernetes.io/infra: "" 2. Create a MachineOSConfig for infra pool oc create -f - << EOF apiVersion: machineconfiguration.openshift.io/v1alpha1 kind: MachineOSConfig metadata: name: infra spec: machineConfigPool: name: infra buildInputs: imageBuilder: imageBuilderType: PodImageBuilder baseImagePullSecret: name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy") renderedImagePushSecret: name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}') renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest" EOF 3. When the build is finished, remove the MachineOSConfig and the pool oc delete machineosconfig infra oc delete mcp infra 4. Create a new infra1 pool apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: infra1 spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra1]} nodeSelector: matchLabels: node-role.kubernetes.io/infra1: "" 5. Create a new machineosconfig for infra1 pool oc create -f - << EOF apiVersion: machineconfiguration.openshift.io/v1alpha1 kind: MachineOSConfig metadata: name: infra1 spec: machineConfigPool: name: infra1 buildInputs: imageBuilder: imageBuilderType: PodImageBuilder baseImagePullSecret: name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy") renderedImagePushSecret: name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}') renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest" containerFile: - containerfileArch: noarch content: |- RUN echo 'test image' > /etc/test-image.file EOF
Actual results:
The MCO controller pod panics (in updateMachineOSBuild): E0430 11:21:03.779078 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 265 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x3547bc0?, 0x53ebb20}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00035e000?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b panic({0x3547bc0?, 0x53ebb20?}) /usr/lib/golang/src/runtime/panic.go:914 +0x21f github.com/openshift/api/machineconfiguration/v1.(*MachineConfigPool).GetNamespace(0x53f6200?) <autogenerated>:1 +0x9 k8s.io/client-go/tools/cache.MetaObjectToName({0x3e2a8f8, 0x0}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:131 +0x25 k8s.io/client-go/tools/cache.ObjectToName({0x3902740?, 0x0?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:126 +0x74 k8s.io/client-go/tools/cache.MetaNamespaceKeyFunc({0x3902740?, 0x0?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:112 +0x3e k8s.io/client-go/tools/cache.DeletionHandlingMetaNamespaceKeyFunc({0x3902740?, 0x0?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:336 +0x3b github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueAfter(0xc0007097a0, 0x0, 0x0?) /go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:761 +0x33 github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueDefault(...) /go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:772 github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).updateMachineOSBuild(0xc0007097a0, {0xc001c37800?, 0xc000029678?}, {0x3904000?, 0xc0028361a0}) /go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:395 +0xd1 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:246 k8s.io/client-go/tools/cache.(*processorListener).run.func1() /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:970 +0xea k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0005e5738?, {0x3de6020, 0xc0008fe780}, 0x1, 0xc0000ac720) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x6974616761706f72?, 0x3b9aca00, 0x0, 0x69?, 0xc0005e5788?) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f k8s.io/apimachinery/pkg/util/wait.Until(...) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 k8s.io/client-go/tools/cache.(*processorListener).run(0xc000b97c20) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:966 +0x69 k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1() /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x4f created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 248 /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x210a6e9] When the controller pod is restarted, it panics again, but in a different function (addMachineOSBuild): E0430 11:26:54.753689 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 97 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x3547bc0?, 0x53ebb20}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x15555555aa?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b panic({0x3547bc0?, 0x53ebb20?}) /usr/lib/golang/src/runtime/panic.go:914 +0x21f github.com/openshift/api/machineconfiguration/v1.(*MachineConfigPool).GetNamespace(0x53f6200?) <autogenerated>:1 +0x9 k8s.io/client-go/tools/cache.MetaObjectToName({0x3e2a8f8, 0x0}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:131 +0x25 k8s.io/client-go/tools/cache.ObjectToName({0x3902740?, 0x0?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:126 +0x74 k8s.io/client-go/tools/cache.MetaNamespaceKeyFunc({0x3902740?, 0x0?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:112 +0x3e k8s.io/client-go/tools/cache.DeletionHandlingMetaNamespaceKeyFunc({0x3902740?, 0x0?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:336 +0x3b github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueAfter(0xc000899560, 0x0, 0x0?) /go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:761 +0x33 github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueDefault(...) /go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:772 github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).addMachineOSBuild(0xc000899560, {0x3904000?, 0xc0006a8b60}) /go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:386 +0xc5 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:239 k8s.io/client-go/tools/cache.(*processorListener).run.func1() /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:972 +0x13e k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00066bf38?, {0x3de6020, 0xc0008f8b40}, 0x1, 0xc000c2ea20) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0xc00066bf88?) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f k8s.io/apimachinery/pkg/util/wait.Until(...) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 k8s.io/client-go/tools/cache.(*processorListener).run(0xc000ba6240) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:966 +0x69 k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1() /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x4f created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 43 /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x210a6e9]
Expected results:
No panic should happen. Errors should be controlled.
Additional info:
In order to recover from this panic, we need to manually delete the MachineOSBuild resources that are related to the pool that does not exist anymore.
Description of problem:
Found in QE CI case failure https://issues.redhat.com/browse/OCPQE-22045 that: 4.16 HCP oauth-openshift panics when anonymously curl'ed (this is not seen in OCP 4.16 and HCP 4.15).
Version-Release number of selected component (if applicable):
HCP 4.16 4.16.0-0.nightly-2024-05-14-165654
How reproducible:
Always
Steps to Reproduce:
1. $ export KUBECONFIG=HCP.kubeconfig $ oc get --raw=/.well-known/oauth-authorization-server | jq -r .issuer https://oauth-clusters-hypershift-ci-283235.apps.xxxx.com:443 2. Panics when anonymously curl'ed: $ curl -k "https://oauth-clusters-hypershift-ci-283235.apps.xxxx.com:443/oauth/authorize?response_type=token&client_id=openshift-challenging-client" This request caused apiserver to panic. Look in the logs for details. 3. Check logs. $ oc --kubeconfig=/home/xxia/my/env/hypershift-management/mjoseph-hyp-283235-416/kubeconfig -n clusters-hypershift-ci-283235 get pod | grep oauth-openshift oauth-openshift-55c6967667-9bxz9 2/2 Running 0 6h23m oauth-openshift-55c6967667-l55fh 2/2 Running 0 6h22m oauth-openshift-55c6967667-ntc6l 2/2 Running 0 6h23m $ for i in oauth-openshift-55c6967667-9bxz9 oauth-openshift-55c6967667-l55fh oauth-openshift-55c6967667-ntc6l; do oc logs --timestamps --kubeconfig=/home/xxia/my/env/hypershift-management/mjoseph-hyp-283235-416/kubeconfig -n clusters-hypershift-ci-283235 $i > logs/hypershift-management/mjoseph-hyp-283235-416/$i.log; done $ grep -il panic *.log oauth-openshift-55c6967667-ntc6l.log $ cat oauth-openshift-55c6967667-ntc6l.log 2024-05-15T03:43:59.769424528Z I0515 03:43:59.769303 1 secure_serving.go:57] Forcing use of http/1.1 only 2024-05-15T03:43:59.772754182Z I0515 03:43:59.772725 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController 2024-05-15T03:43:59.772803132Z I0515 03:43:59.772782 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" 2024-05-15T03:43:59.772841518Z I0515 03:43:59.772834 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file 2024-05-15T03:43:59.772870498Z I0515 03:43:59.772787 1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController 2024-05-15T03:43:59.772982605Z I0515 03:43:59.772736 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file" 2024-05-15T03:43:59.773009678Z I0515 03:43:59.773002 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file 2024-05-15T03:43:59.773214896Z I0515 03:43:59.773194 1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/etc/kubernetes/certs/serving-cert/tls.crt::/etc/kubernetes/certs/serving-cert/tls.key" 2024-05-15T03:43:59.773939655Z I0515 03:43:59.773923 1 secure_serving.go:213] Serving securely on [::]:6443 2024-05-15T03:43:59.773965659Z I0515 03:43:59.773952 1 tlsconfig.go:240] "Starting DynamicServingCertificateController" 2024-05-15T03:43:59.873008524Z I0515 03:43:59.872970 1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController 2024-05-15T03:43:59.873078108Z I0515 03:43:59.873021 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file 2024-05-15T03:43:59.873120163Z I0515 03:43:59.873032 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file 2024-05-15T09:25:25.782066400Z E0515 09:25:25.782026 1 runtime.go:77] Observed a panic: runtime error: invalid memory address or nil pointer dereference 2024-05-15T09:25:25.782066400Z goroutine 8662 [running]: 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1.1() 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/server/filters/timeout.go:110 +0x9c 2024-05-15T09:25:25.782066400Z panic({0x2115f60?, 0x3c45ec0?}) 2024-05-15T09:25:25.782066400Z runtime/panic.go:914 +0x21f 2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/oauth/handlers.(*unionAuthenticationHandler).AuthenticationNeeded(0xc0008a90e0, {0x7f2a74268bd8?, 0xc000607760?}, {0x293c340?, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/oauth/handlers/default_auth_handler.go:122 +0xce1 2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/oauth/handlers.(*authorizeAuthenticator).HandleAuthorize(0xc0008a9110, 0xc0007b06c0, 0x7?, {0x293c340, 0xc0007d1ef0}) 2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/oauth/handlers/authenticator.go:54 +0x21d 2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/osinserver.AuthorizeHandlers.HandleAuthorize({0xc0008a91a0?, 0x3, 0x772d66?}, 0x22ef8e0?, 0xc0007b2420?, {0x293c340, 0xc0007d1ef0}) 2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/osinserver/interfaces.go:29 +0x95 2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/osinserver.(*osinServer).handleAuthorize(0xc0004a54c0, {0x293c340, 0xc0007d1ef0}, 0xd?) 2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/osinserver/osinserver.go:77 +0x25e 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x0?, {0x293c340?, 0xc0007d1ef0?}, 0x410acc?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z net/http.(*ServeMux).ServeHTTP(0x2390e60?, {0x293c340, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z net/http/server.go:2514 +0x142 2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).buildHandlerChainForOAuth.WithRestoreOAuthHeaders.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/server/headers/oauthbasic.go:57 +0x1ca 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x235fda0?, {0x293c340?, 0xc0007d1ef0?}, 0x291ef40?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.TrackCompleted.trackCompleted.func21({0x293c340?, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:110 +0x177 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x29490d0?, {0x293c340?, 0xc0007d1ef0?}, 0x4?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/endpoints/filters.withAuthorization.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/authorization.go:78 +0x639 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0xc1893dc16e2d2585?, {0x293c340?, 0xc0007d1ef0?}, 0xc0007fabb8?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:84 +0x192 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x3c5b920?, {0x293c340?, 0xc0007d1ef0?}, 0x3?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server/filters.WithMaxInFlightLimit.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/server/filters/maxinflight.go:196 +0x262 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x235fda0?, {0x293c340?, 0xc0007d1ef0?}, 0x291ef40?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.TrackCompleted.trackCompleted.func23({0x293c340?, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:110 +0x177 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x7f2a74226390?, {0x293c340?, 0xc0007d1ef0?}, 0xc0007953c8?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithImpersonation.func4({0x293c340, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/impersonation.go:50 +0x1c3 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0xcd1160?, {0x293c340?, 0xc0007d1ef0?}, 0x0?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:84 +0x192 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x235fda0?, {0x293c340?, 0xc0007d1ef0?}, 0x291ef40?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.TrackCompleted.trackCompleted.func24({0x293c340?, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:110 +0x177 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0xcd1160?, {0x293c340?, 0xc0007d1ef0?}, 0x0?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:84 +0x192 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x235fda0?, {0x293c340?, 0xc0007d1ef0?}, 0x291ef40?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.TrackCompleted.trackCompleted.func26({0x293c340?, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:110 +0x177 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x29490d0?, {0x293c340?, 0xc0007d1ef0?}, 0x291a100?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/endpoints/filters.withAuthentication.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/authentication.go:120 +0x7e5 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x29490d0?, {0x293c340?, 0xc0007d1ef0?}, 0x291ef40?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3500) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:94 +0x37a 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0xc0003e0900?, {0x293c340?, 0xc0007d1ef0?}, 0xc00061af20?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1() 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/server/filters/timeout.go:115 +0x62 2024-05-15T09:25:25.782066400Z created by k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP in goroutine 8660 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/server/filters/timeout.go:101 +0x1b2 2024-05-15T09:25:25.782066400Z 2024-05-15T09:25:25.782066400Z goroutine 8660 [running]: 2024-05-15T09:25:25.782066400Z k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1fb1a00?, 0xc000810260}) 2024-05-15T09:25:25.782066400Z k8s.io/apimachinery@v0.29.2/pkg/util/runtime/runtime.go:75 +0x85 2024-05-15T09:25:25.782066400Z k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0xc0005aa840, 0x1, 0x1c08865?}) 2024-05-15T09:25:25.782066400Z k8s.io/apimachinery@v0.29.2/pkg/util/runtime/runtime.go:49 +0x6b 2024-05-15T09:25:25.782066400Z panic({0x1fb1a00?, 0xc000810260?}) 2024-05-15T09:25:25.782066400Z runtime/panic.go:914 +0x21f 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP(0xc000528cc0, {0x2944dd0, 0xc000476460}, 0xdf8475800?) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/server/filters/timeout.go:121 +0x35c 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithRequestDeadline.withRequestDeadline.func27({0x2944dd0, 0xc000476460}, 0xc0007a3300) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/request_deadline.go:100 +0x237 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x29490d0?, {0x2944dd0?, 0xc000476460?}, 0x2459ac0?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithWaitGroup.withWaitGroup.func28({0x2944dd0, 0xc000476460}, 0xc0004764b0?) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/server/filters/waitgroup.go:86 +0x18c 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0xc0007a3200?, {0x2944dd0?, 0xc000476460?}, 0xc0004764b0?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithWarningRecorder.func13({0x2944dd0?, 0xc000476460}, 0xc000476410?) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/warning.go:35 +0xc6 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x2390e60?, {0x2944dd0?, 0xc000476460?}, 0xd?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithCacheControl.func14({0x2944dd0, 0xc000476460}, 0x0?) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/cachecontrol.go:31 +0xa7 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0xc0002a0fa0?, {0x2944dd0?, 0xc000476460?}, 0xc0005aad90?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithHTTPLogging.WithLogging.withLogging.func34({0x2944dd0, 0xc000476460}, 0x1?) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/server/httplog/httplog.go:111 +0x95 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0xc0007b0360?, {0x2944dd0?, 0xc000476460?}, 0x0?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/endpoints/filters.WithTracing.func1({0x2944dd0?, 0xc000476460?}, 0xc0007a3200?) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/traces.go:42 +0x222 2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0x29490d0?, {0x2944dd0?, 0xc000476460?}, 0x291ef40?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782129547Z go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP(0xc000289b80, {0x293c340?, 0xc0007d1bf0}, 0xc0007a3100, {0x2923a40, 0xc000528d68}) 2024-05-15T09:25:25.782129547Z go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.44.0/handler.go:217 +0x1202 2024-05-15T09:25:25.782129547Z go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1({0x293c340?, 0xc0007d1bf0?}, 0xc0001fec40?) 2024-05-15T09:25:25.782129547Z go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.44.0/handler.go:81 +0x35 2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0x2948fb0?, {0x293c340?, 0xc0007d1bf0?}, 0x100?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782129547Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithLatencyTrackers.func16({0x29377e0?, 0xc0001fec40}, 0xc000289e40?) 2024-05-15T09:25:25.782129547Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/webhook_duration.go:57 +0x14a 2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0xc0007a2f00?, {0x29377e0?, 0xc0001fec40?}, 0x7f2abb853108?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782129547Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithRequestInfo.func17({0x29377e0, 0xc0001fec40}, 0x3d02360?) 2024-05-15T09:25:25.782129547Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/requestinfo.go:39 +0x118 2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0xc0007a2e00?, {0x29377e0?, 0xc0001fec40?}, 0x12a1dc02246f?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782129547Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithRequestReceivedTimestamp.withRequestReceivedTimestampWithClock.func31({0x29377e0, 0xc0001fec40}, 0xc000508b58?) 2024-05-15T09:25:25.782129547Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/request_received_time.go:38 +0xaf 2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0x3?, {0x29377e0?, 0xc0001fec40?}, 0xc0005ab818?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782129547Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithMuxAndDiscoveryComplete.func18({0x29377e0?, 0xc0001fec40?}, 0xc0007a2e00?) 2024-05-15T09:25:25.782129547Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/mux_discovery_complete.go:52 +0xd5 2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0xc000081800?, {0x29377e0?, 0xc0001fec40?}, 0xc0005ab888?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782129547Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithPanicRecovery.withPanicRecovery.func32({0x29377e0?, 0xc0001fec40?}, 0xc0007d18f0?) 2024-05-15T09:25:25.782129547Z k8s.io/apiserver@v0.29.2/pkg/server/filters/wrap.go:74 +0xa6 2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0x29490d0?, {0x29377e0?, 0xc0001fec40?}, 0xc00065eea0?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782129547Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithAuditInit.withAuditInit.func33({0x29377e0, 0xc0001fec40}, 0xc00040c580?) 2024-05-15T09:25:25.782129547Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/audit_init.go:63 +0x12c 2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0x2390e60?, {0x29377e0?, 0xc0001fec40?}, 0xd?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782129547Z github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).buildHandlerChainForOAuth.WithPreserveOAuthHeaders.func2({0x29377e0, 0xc0001fec40}, 0xc0007a2d00) 2024-05-15T09:25:25.782129547Z github.com/openshift/oauth-server/pkg/server/headers/oauthbasic.go:42 +0x16e 2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0xc0005aba80?, {0x29377e0?, 0xc0001fec40?}, 0x24c95d5?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782129547Z github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).buildHandlerChainForOAuth.WithStandardHeaders.func3({0x29377e0, 0xc0001fec40}, 0xc0005abb18?) 2024-05-15T09:25:25.782129547Z github.com/openshift/oauth-server/pkg/server/headers/headers.go:30 +0xde 2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0xc0005abb68?, {0x29377e0?, 0xc0001fec40?}, 0xc00040c580?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782129547Z k8s.io/apiserver/pkg/server.(*APIServerHandler).ServeHTTP(0x3d33480?, {0x29377e0?, 0xc0001fec40?}, 0xc0005abb50?) 2024-05-15T09:25:25.782129547Z k8s.io/apiserver@v0.29.2/pkg/server/handler.go:189 +0x25 2024-05-15T09:25:25.782129547Z net/http.serverHandler.ServeHTTP({0xc0007d1830?}, {0x29377e0?, 0xc0001fec40?}, 0x6?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2938 +0x8e 2024-05-15T09:25:25.782129547Z net/http.(*conn).serve(0xc0007b02d0, {0x29490d0, 0xc000585e90}) 2024-05-15T09:25:25.782129547Z net/http/server.go:2009 +0x5f4 2024-05-15T09:25:25.782129547Z created by net/http.(*Server).Serve in goroutine 249 2024-05-15T09:25:25.782129547Z net/http/server.go:3086 +0x5cb 2024-05-15T09:25:25.782129547Z http: superfluous response.WriteHeader call from k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithPanicRecovery.func19 (wrap.go:57) 2024-05-15T09:25:25.782129547Z E0515 09:25:25.782066 1 wrap.go:58] "apiserver panic'd" method="GET" URI="/oauth/authorize?response_type=token&client_id=openshift-challenging-client" auditID="ac4795ff-5935-4ff5-bc9e-d84018f29469"
Actual results:
Panics when anonymously curl'ed
Expected results:
No panic
Please review the following PR: https://github.com/openshift/machine-api-provider-nutanix/pull/74
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Trying to execute https://github.com/openshift-metal3/dev-scripts to deploy an OCP 4.16 or 4.17 cluster (with the same configuration OCP 4.14 and 4.15 are instead working) with: MIRROR_IMAGES=true INSTALLER_PROXY=true the bootstrap process fails with: level=debug msg= baremetalhost resource not yet available, will retry level=debug msg= baremetalhost resource not yet available, will retry level=info msg= baremetalhost: ostest-master-0: uninitialized level=info msg= baremetalhost: ostest-master-0: registering level=info msg= baremetalhost: ostest-master-1: uninitialized level=info msg= baremetalhost: ostest-master-1: registering level=info msg= baremetalhost: ostest-master-2: uninitialized level=info msg= baremetalhost: ostest-master-2: registering level=info msg= baremetalhost: ostest-master-1: inspecting level=info msg= baremetalhost: ostest-master-2: inspecting level=info msg= baremetalhost: ostest-master-0: inspecting E0514 12:16:51.985417 89709 reflector.go:147] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *unstructured.Unstructured: Get "https://api.ostest.test.metalkube.org:6443/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts?allowWatchBookmarks=true&resourceVersion=5466&timeoutSeconds=547&watch=true": Service Unavailable W0514 12:16:52.979254 89709 reflector.go:539] k8s.io/client-go/tools/watch/informerwatcher.go:146: failed to list *unstructured.Unstructured: Get "https://api.ostest.test.metalkube.org:6443/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts?resourceVersion=5466": Service Unavailable E0514 12:16:52.979293 89709 reflector.go:147] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: Get "https://api.ostest.test.metalkube.org:6443/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts?resourceVersion=5466": Service Unavailable E0514 12:37:01.927140 89709 reflector.go:147] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *unstructured.Unstructured: Get "https://api.ostest.test.metalkube.org:6443/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts?allowWatchBookmarks=true&resourceVersion=7800&timeoutSeconds=383&watch=true": Service Unavailable W0514 12:37:03.173425 89709 reflector.go:539] k8s.io/client-go/tools/watch/informerwatcher.go:146: failed to list *unstructured.Unstructured: Get "https://api.ostest.test.metalkube.org:6443/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts?resourceVersion=7800": Service Unavailable E0514 12:37:03.173473 89709 reflector.go:147] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: Get "https://api.ostest.test.metalkube.org:6443/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts?resourceVersion=7800": Service Unavailable level=debug msg=Fetching Bootstrap SSH Key Pair... level=debug msg=Loading Bootstrap SSH Key Pair... it looks like up to a certain point https://api.ostest.test.metalkube.org:6443 was reachable but then for some reason it started failing because its not using the proxy or is and it shouldn't be (???) The 3 master nodes are reported as: [root@ipi-ci-op-0qigcrln-b54ee-1790684582253694976 home]# oc get baremetalhosts -A NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE openshift-machine-api ostest-master-0 inspecting ostest-bbhxb-master-0 true inspection error 24m openshift-machine-api ostest-master-1 inspecting ostest-bbhxb-master-1 true inspection error 24m openshift-machine-api ostest-master-2 inspecting ostest-bbhxb-master-2 true inspection error 24m With something like: status: errorCount: 5 errorMessage: 'Failed to inspect hardware. Reason: unable to start inspection: Validation of image href http://0.0.0.0:8084/34427934-f1a6-48d6-9666-66872eec9ba2 failed, reason: Got HTTP code 503 instead of 200 in response to HEAD request.' errorType: inspection error on their status
Version-Release number of selected component (if applicable):
4.16, 4.17
How reproducible:
100%
Steps to Reproduce:
1. Try to create an OCP 4.16 cluster with dev-scrips with IP_STACK=v4, MIRROR_IMAGES=true and INSTALLER_PROXY=true 2. 3.
Actual results:
level=info msg= baremetalhost: ostest-master-0: inspecting E0514 12:16:51.985417 89709 reflector.go:147] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *unstructured.Unstructured: Get "https://api.ostest.test.metalkube.org:6443/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts?allowWatchBookmarks=true&resourceVersion=5466&timeoutSeconds=547&watch=true": Service Unavailable
Expected results:
Successful deployment
Additional info:
I'm using IP_STACK=v4, MIRROR_IMAGES=true and INSTALLER_PROXY=true with the same configuration (MIRROR_IMAGES=true and INSTALLER_PROXY=true) OCP 4.14 and OCP 4.15 are working. When removing INSTALLER_PROXY=true, OCP 4.16 is also working. I'm going to attach bootstrap gather logs
Please review the following PR: https://github.com/openshift/csi-driver-manila-operator/pull/231
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/installer/pull/8455
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Bump prometheus-operator to 0.75.1 CMO and downstream fork
Description of problem:
for ipi on vSphere. enable capi in installer and install the cluster.after destroy the cluster, according to the destroy log: it display "all folder deleted". but actually the cluster folder still exists in vSphere Client. example: 05-08 20:24:38.765 level=debug msg=Delete Folder*05-08 20:24:40.649* level=debug msg=All folders deleted*05-08 20:24:40.649* level=debug msg=Delete StoragePolicy=openshift-storage-policy-wwei-0429g-fdwqc*05-08 20:24:41.576* level=info msg=Destroyed StoragePolicy=openshift-storage-policy-wwei-0429g-fdwqc*05-08 20:24:41.576* level=debug msg=Delete Tag=wwei-0429g-fdwqc*05-08 20:24:43.463* level=info msg=Deleted Tag=wwei-0429g-fdwqc*05-08 20:24:43.463* level=debug msg=Delete TagCategory=openshift-wwei-0429g-fdwqc*05-08 20:24:44.825* level=info msg=Deleted TagCategory=openshift-wwei-0429g-fdwqc govc ls /DEVQEdatacenter/vm |grep wwei-0429g-fdwqc/DEVQEdatacenter/vm/wwei-0429g-fdwqc
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-07-025557
How reproducible:
destroy a the cluster with capi
Steps to Reproduce:
1.install cluster with capi 2.destroy cluster and check cluster folder in vSphere client
Actual results:
cluster folder still exists.
Expected results:
cluster folder should not exists in vSphere client after successful destroy.
Additional info:
This is a clone of issue OCPBUGS-38551. The following is the description of the original issue:
—
Description of problem:
If multiple NICs are configured in install-config, the installer will provision nodes properly but will fail in bootstrap due to API validation. > 4.17 will support multiple NICs, < 4.17 will not and will fail. Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: [#1672] failed to create some manifests: Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: "cluster-infrastructure-02-config.yml": failed to create infrastructures.v1.config.openshift.io/cluster -n : Infrastructure.config.openshift.io "cluster" is invalid: [spec.platformSpec.vsphere.failureDomains[0].topology.networks: Too many: 2: must have at most 1 items, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When an image is referenced by tag and digest, oc-mirror skips the image
Version-Release number of selected component (if applicable):
How reproducible:
Do mirror to disk and disk to mirror using the registry.redhat.io/redhat/redhat-operator-index:v4.16 and the operator multiarch-tuning-operator
Steps to Reproduce:
1 mirror to disk 2 disk to mirror
Actual results:
docker://gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1@sha256:d4883d7c622683b3319b5e6b3a7edfbf2594c18060131a8bf64504805f875522 (Operator bundles: [multiarch-tuning-operator.v0.9.0] - Operators: [multiarch-tuning-operator]) error: Invalid source name docker://localhost:55000/kubebuilder/kube-rbac-proxy:v0.13.1:d4883d7c622683b3319b5e6b3a7edfbf2594c18060131a8bf64504805f875522: invalid reference format
Expected results:
The image should be mirrored
Additional info:
Description of problem:
The update of the samples is required for the release of Samples Operator for OCP 4.17
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
Not a bug, but using OCPBUGS so that CI automation can be used in Github. The SO JIRA project is no longer updated with the required versions.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-41787. The following is the description of the original issue:
—
Description of problem:
The test tries to schedule pods on all workers but fails to schedule on infra nodes Warning FailedScheduling 86s default-scheduler 0/9 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 6 node(s) didn 't match pod anti-affinity rules. preemption: 0/9 nodes are available: 3 Preemption is not helpful for scheduling, 6 No preemption victims found for incoming pod. $ oc get nodes NAME STATUS ROLES AGE VERSION ostest-b6fns-infra-0-m4v7t Ready infra,worker 19h v1.30.4 ostest-b6fns-infra-0-pllsf Ready infra,worker 19h v1.30.4 ostest-b6fns-infra-0-vnbp8 Ready infra,worker 19h v1.30.4 ostest-b6fns-master-0 Ready control-plane,master 19h v1.30.4 ostest-b6fns-master-2 Ready control-plane,master 19h v1.30.4 ostest-b6fns-master-lmlxf-1 Ready control-plane,master 17h v1.30.4 ostest-b6fns-worker-0-h527q Ready worker 19h v1.30.4 ostest-b6fns-worker-0-kpvdx Ready worker 19h v1.30.4 ostest-b6fns-worker-0-xfcjf Ready worker 19h v1.30.4 Infra nodes should be removed from the worker nodes in the test
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-09-09-173813
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
In integration with latest Hypershift Operator (0.0.39) and a Hosted Cluster 4.15.x 1. apply a new hostedCluster.Spec.Configuration.Image (insecureRegistries) 2. config is rolledout to all the node pools 3. nodes with the previous config are stuck because machines can't be deleted. So rollout never progress
CAPI shows this log
I0624 14:38:22.520708 1 logger.go:67] "Handling deleted AWSMachine" E0624 14:38:22.520786 1 logger.go:83] "unable to delete machine" err="failed to get raw userdata: failed to retrieve bootstrap data secret for AWSMachine ocm-int-2c3is2isdhgqcu5qat4a7qbo8j6vqm62-ad-int1/ad-int1-workers-16fe3af3-mdvv6: Secret \"user-data-ad-int1-workers-b14ee318\" not found" E0624 14:38:22.521364 1 controller.go:324] "Reconciler error" err="failed to get raw userdata: failed to retrieve bootstrap data secret for AWSMachine ocm-int-2c3is2isdhgqcu5qat4a7qbo8j6vqm62-ad-int1/ad-int1-workers-16fe3af3-mdvv6: Secret \"user-data-ad-int1-workers-b14ee318\" not found" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="ocm-int-2c3is2isdhgqcu5qat4a7qbo8j6vqm62-ad-int1/ad-int1-workers-16fe3af3-mdvv6" namespace="ocm-int-2c3is2isdhgqcu5qat4a7qbo8j6vqm62-ad-int1" name="ad-int1-workers-16fe3af3-mdvv6" reconcileID="8ca6fbef-1031-45df-b0cc-78d2f25607da"
The secret seems to be deleted by HO too early.
Found https://github.com/openshift/hypershift/pull/3969 which may be related
Version-Release number of selected component (if applicable):{code:none}
How reproducible:
Always in ROSA int environment
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Patch example image: additionalTrustedCA: name: "" registrySources: blockedRegistries: - badregistry.io
Slack thread https://redhat-external.slack.com/archives/C01C8502FMM/p1719221463858639
Update latest ironic projects in the ironic containers to get bug and security fixes
Description of problem:
In an attempt to fix https://issues.redhat.com/browse/OCPBUGS-35300, we introduced an Azure-specific dependency on dnsmasq, which introduced a dependency loop. This bug aims to revert that chain.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Installation failed on 4.16 nightly build when waiting for install-complete. API is unavailable. level=info msg=Waiting up to 20m0s (until 5:00AM UTC) for the Kubernetes API at https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443... level=info msg=API v1.29.2+a0beecc up level=info msg=Waiting up to 30m0s (until 5:11AM UTC) for bootstrapping to complete... api available waiting for bootstrap to complete level=info msg=Waiting up to 20m0s (until 5:01AM UTC) for the Kubernetes API at https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443... level=info msg=API v1.29.2+a0beecc up level=info msg=Waiting up to 30m0s (until 5:11AM UTC) for bootstrapping to complete... level=info msg=It is now safe to remove the bootstrap resources level=info msg=Time elapsed: 15m54s Copying kubeconfig to shared dir as kubeconfig-minimal level=info msg=Destroying the bootstrap resources... level=info msg=Waiting up to 40m0s (until 5:39AM UTC) for the cluster at https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443 to initialize... W0313 04:59:34.272442 229 reflector.go:539] k8s.io/client-go/tools/watch/informerwatcher.go:146: failed to list *v1.ClusterVersion: Get "https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": dial tcp 172.212.184.131:6443: i/o timeout I0313 04:59:34.272658 229 trace.go:236] Trace[533197684]: "Reflector ListAndWatch" name:k8s.io/client-go/tools/watch/informerwatcher.go:146 (13-Mar-2024 04:59:04.271) (total time: 30000ms): Trace[533197684]: ---"Objects listed" error:Get "https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": dial tcp 172.212.184.131:6443: i/o timeout 30000ms (04:59:34.272) ... E0313 05:38:18.669780 229 reflector.go:147] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": dial tcp 172.212.184.131:6443: i/o timeout level=error msg=Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 172.212.184.131:6443: i/o timeout level=error msg=Cluster initialization failed because one or more operators are not functioning properly. level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below, level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation level=error msg=failed to initialize the cluster: timed out waiting for the condition On master node, seems that kube-apiserver is not running, [root@ci-op-4sgxj8jx-8482f-hppxj-master-0 ~]# crictl ps | grep apiserver e4b6cc9622b01 ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90 7 minutes ago Running kube-apiserver-cert-syncer 22 3ff4af6614409 kube-apiserver-ci-op-4sgxj8jx-8482f-hppxj-master-0 1249824fe5788 ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90 4 hours ago Running kube-apiserver-insecure-readyz 0 3ff4af6614409 kube-apiserver-ci-op-4sgxj8jx-8482f-hppxj-master-0 ca774b07284f0 ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90 4 hours ago Running kube-apiserver-cert-regeneration-controller 0 3ff4af6614409 kube-apiserver-ci-op-4sgxj8jx-8482f-hppxj-master-0 2931b9a2bbabd ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90 4 hours ago Running openshift-apiserver-check-endpoints 0 4136bf2183de1 apiserver-7df5bb879-xx74p 0c9534aec3b6b 8c9042f97c89d8c8519d6e6235bef5a5346f08e6d7d9864ef0f228b318b4c3de 4 hours ago Running openshift-apiserver 0 4136bf2183de1 apiserver-7df5bb879-xx74p db21a2dd1df33 ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90 4 hours ago Running guard 0 199e1f4e665b9 kube-apiserver-guard-ci-op-4sgxj8jx-8482f-hppxj-master-0 429110f9ea5a3 6a03f3f082f3719e79087d569b3cd1e718fb670d1261fbec9504662f1005b1a5 4 hours ago Running apiserver-watcher 0 7664f480df29d apiserver-watcher-ci-op-4sgxj8jx-8482f-hppxj-master-0 [root@ci-op-4sgxj8jx-8482f-hppxj-master-1 ~]# crictl ps | grep apiserver c64187e7adcc6 ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90 4 hours ago Running openshift-apiserver-check-endpoints 0 1a4a5b247c28a apiserver-7df5bb879-f6v5x ff98c52402288 8c9042f97c89d8c8519d6e6235bef5a5346f08e6d7d9864ef0f228b318b4c3de 4 hours ago Running openshift-apiserver 0 1a4a5b247c28a apiserver-7df5bb879-f6v5x 2f8a97f959409 faa1b95089d101cdc907d7affe310bbff5a9aa8f92c725dc6466afc37e731927 4 hours ago Running oauth-apiserver 0 ffa2c316a0cca apiserver-97fbc599c-2ftl7 72897e30e0df0 6a03f3f082f3719e79087d569b3cd1e718fb670d1261fbec9504662f1005b1a5 4 hours ago Running apiserver-watcher 0 3b6c3849ce91f apiserver-watcher-ci-op-4sgxj8jx-8482f-hppxj-master-1 [root@ci-op-4sgxj8jx-8482f-hppxj-master-2 ~]# crictl ps | grep apiserver 04c426f07573d faa1b95089d101cdc907d7affe310bbff5a9aa8f92c725dc6466afc37e731927 4 hours ago Running oauth-apiserver 0 2172a64fb1a38 apiserver-654dcb4cc6-tq8fj 4dcca5c0e9b99 6a03f3f082f3719e79087d569b3cd1e718fb670d1261fbec9504662f1005b1a5 4 hours ago Running apiserver-watcher 0 1cd99ec327199 apiserver-watcher-ci-op-4sgxj8jx-8482f-hppxj-master-2 And found below error in kubelet log, Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: E0313 06:10:15.004656 23961 kuberuntime_manager.go:1262] container &Container{Name:kube-apiserver,Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:789f242b8bc721b697e265c6f9d025f45e56e990bfd32e331c633fe0b9f076bc,Command:[/bin/bash -ec],Args:[LOCK=/var/log/kube-apiserver/.lock Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: # We should be able to acquire the lock immediatelly. If not, it means the init container has not released it yet and kubelet or CRI-O started container prematurely. Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: exec {LOCK_FD}>${LOCK} && flock --verbose -w 30 "${LOCK_FD}" || { Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: echo "Failed to acquire lock for kube-apiserver. Please check setup container for details. This is likely kubelet or CRI-O bug." Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: exit 1 Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: } Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: if [ -f /etc/kubernetes/static-pod-certs/configmaps/trusted-ca-bundle/ca-bundle.crt ]; then Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: echo "Copying system trust bundle ..." Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: cp -f /etc/kubernetes/static-pod-certs/configmaps/trusted-ca-bundle/ca-bundle.crt /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: fi Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: exec watch-termination --termination-touch-file=/var/log/kube-apiserver/.terminating --termination-log-file=/var/log/kube-apiserver/termination.log --graceful-termination-duration=135s --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig -- hyperkube kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --advertise-address=${HOST_IP} -v=2 --permit-address-sharing Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: ],WorkingDir:,Ports:[]ContainerPort{ContainerPort{Name:,HostPort:6443,ContainerPort:6443,Protocol:TCP,HostIP:,},},Env:[]EnvVar{EnvVar{Name:POD_NAME,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.name,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:POD_NAMESPACE,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:STATIC_POD_VERSION,Value:4,ValueFrom:nil,},EnvVar{Name:HOST_IP,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.hostIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:GOGC,Value:100,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{cpu: {{265 -3} {<nil>} 265m DecimalSI},memory: {{1073741824 0} {<nil>} 1Gi BinarySI},},Claims:[]ResourceClaim{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:resource-dir,ReadOnly:false,MountPath:/etc/kubernetes/static-pod-resources,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:cert-dir,ReadOnly:false,MountPath:/etc/kubernetes/static-pod-certs,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:audit-dir,ReadOnly:false,MountPath:/var/log/kube-apiserver,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:livez,Port:{0 6443 },Host:,Scheme:HTTPS,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:10,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,TerminationGracePeriodSeconds:nil,},ReadinessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:readyz,Port:{0 6443 },Host:,Scheme:HTTPS,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:10,PeriodSeconds:5,SuccessThreshold:1,FailureThreshold:1,TerminationGracePeriodSeconds:nil,},Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:nil,Privileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:FallbackToLogsOnError,VolumeDevices:[]VolumeDevice{},StartupProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:healthz,Port:{0 6443 },Host:,Scheme:HTTPS,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:10,PeriodSeconds:5,SuccessThreshold:1,FailureThreshold:30,TerminationGracePeriodSeconds:nil,},ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod kube-apiserver-ci-op-4sgxj8jx-8482f-hppxj-master-0_openshift-kube-apiserver(196e0956694ff43707b03f4585f3b6cd): CreateContainerConfigError: host IP unknown; known addresses: []
Version-Release number of selected component (if applicable):
4.16 latest nightly build
How reproducible:
frequently
Steps to Reproduce:
1. Install cluster on 4.16 nightly build 2. 3.
Actual results:
Installation failed.
Expected results:
Installation is successful.
Additional info:
Searched CI jobs, found many jobs failed with same error, most are on azure platform. https://search.dptools.openshift.org/?search=failed+to+initialize+the+cluster%3A+timed+out+waiting+for+the+condition&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Description of problem:
During the creation of a 4.16 cluster using the nightly build (--channel-group nightly --version 4.16.0-0.nightly-2024-05-19-235324) with the following command:
osa create cluster --cluster-name $CLUSTER_NAME --sts --mode auto --machine-cidr 10.0.0.0/16 --compute-machine-type m6a.xlarge --region $REGION --oidc-config-id $OIDC_ID --channel-group nightly --version 4.16.0-0.nightly-2024-05-19-235324 --ec2-metadata-http-tokens optional --replicas 2 --service-cidr 172.30.0.0/16 --pod-cidr 10.128.0.0/14 --host-prefix 23 -y
How reproducible:
1. Run the command provided above to create a cluster.Observe the error during the IAM role creation step. 2. Observe the error during the IAM role creation step.
Actual results:
time="2024-05-20T03:21:03Z" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: failed to create IAM roles: failed to create inline policy for role master: AccessDenied: User: arn:aws:sts::890193308254:assumed-role/ManagedOpenShift-Installer-Role/1716175231092827911 is not authorized to perform: iam:PutRolePolicy on resource: role ManagedOpenShift-ControlPlane-Role because no identity-based policy allows the iam:PutRolePolicy action\n\tstatus code: 403, request id: 27f0f631-abdd-47e9-ba02-a2e71a7487dc" time="2024-05-20T03:21:04Z" level=error msg="error after waiting for command completion" error="exit status 4" installID=wx9l766h time="2024-05-20T03:21:04Z" level=error msg="error provisioning cluster" error="exit status 4" installID=wx9l766h time="2024-05-20T03:21:04Z" level=error msg="error running openshift-install, running deprovision to clean up" error="exit status 4" installID=wx9l766h time="2024-05-20T03:21:04Z" level=debug msg="OpenShift Installer v4.16.0
Expected results:
The cluster should be created successfully without IAM permission errors.
Additional info:
- The IAM role ManagedOpenShift-Installer-Role does not have the necessary permissions to perform iam:PutRolePolicy on the ManagedOpenShift-ControlPlane-Role. - This issue was observed with the nightly build 4.16.0-0.nightly-2024-05-19-235324.
More context: https://redhat-internal.slack.com/archives/C070BJ1NS1E/p1716182046041269
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Check webhook-authentication-integrated-oauth secret annotations in openshift-config namespace 2. 3.
Actual results:
No component annotation set
Expected results:
Additional info:
This is a clone of issue OCPBUGS-39298. The following is the description of the original issue:
—
Description of problem:
cluster-capi-operator's manifests-gen tool would generate CAPI providers transport configmaps with missing metadata details
Version-Release number of selected component (if applicable):
4.17, 4.18
How reproducible:
Not impacting payload, only a tooling bug
Description of problem:
This is related to BUG [OCPBUGS-29459](https://issues.redhat.com/browse/OCPBUGS-29459), In addition to fixing this bug, we should fix the logging for machine-config-controller to throw the detailed warnings/error about the faulty malfomed certificate. For example which particular certificate is malformed and what is the actual issue with the malformed certificate for example: `x509: malformed algorithm identifier`, `x509: invalid certificate policies` etc.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
machine-config-controller just shows `Malformed Cert, not syncing` as Info msgs and is failling to log the details for the malformed certificate and it makes the triage/troubleshooting diffcult, its hard to guess what certificate has issue.
Expected results:
machine-config-controller to throw the detailed warnings with detailing which certificate has issue which makes the trouleshooting lot easier. And this should be either error or warnings not the info.
Additional info:
This is related to BUG [OCPBUGS-29459](https://issues.redhat.com/browse/OCPBUGS-29459)
Please review the following PR: https://github.com/openshift/machine-api-provider-aws/pull/102
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/65
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
OCP/RHCOS system daemon(s) like ovs-vswitchd (revalidator process) use the same vCPU (from isolated vCPU pool) that is already reserved by CPU Manager for CNF workloads, causing intermittent issues for CNF workloads performance (and also causing vCPU level overload). Note: NCP 23.11 uses CPU Manager with static policy and Topology Manager set to "single-numa-node". Also, specific isolated and reserved vCPU pools have been defined.
Version-Release number of selected component (if applicable):
4.14.22
How reproducible:
Intermittent at customer environment.
Steps to Reproduce:
1. 2. 3.
Actual results:
ovs-vswitchd is using isolated CPUs
Expected results:
ovs-vswitchd to use only reserved CPUs
Additional info:
We want to understand if customer is hitting the bug: https://issues.redhat.com/browse/OCPBUGS-32407 This bug was fixed at 4.14.25. Customer cluster is 4.14.22. Customer is also asking if it is possible to get a private fix since they cannot update at the moment. All case files have been yanked at both US and EU instances of Supportshell. In case case updates or attachments are not accessible please let me know.
Please review the following PR: https://github.com/openshift/cluster-network-operator/pull/2381
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Tracker issue for bootimage bump in 4.17. This issue should block issues which need a bootimage bump to fix.
Description of problem:
TestAllowedSourceRangesStatus test is flaking with the error: allowed_source_ranges_test.go:197: expected the annotation to be reflected in status.allowedSourceRanges: timed out waiting for the condition I also notice it sometimes coincides with a TestScopeChange error. It may be related updating LoadBalancer type operations, for example, https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/978/pull-ci-openshift-cluster-ingress-operator-master-e2e-azure-operator/1800249453098045440
Version-Release number of selected component (if applicable):
4.17
How reproducible:
~25-50%
Steps to Reproduce:
1. Run cluster-ingress-operator TestAllowedSourceRangesStatus E2E tests 2. 3.
Actual results:
Test is flaking
Expected results:
Test shouldn't flake
Additional info:
Description of problem:
Geneve port has not been created for a set of nodes. ~~~ [arghosh@supportshell-1 03826869]$ omg get nodes |grep -v NAME|wc -l 83 ~~~ ~~~ # crictl exec -ti `crictl ps --name nbdb -q` ovn-nbctl show transit_switch | grep tstor-prd-fc-shop09a | wc -l 73 # crictl exec -ti `crictl ps --name nbdb -q` ovn-sbctl list chassis | grep -c ^hostname 41 # ovs-appctl ofproto/list-tunnels | wc -l 40 ~~~
Version-Release number of selected component (if applicable):
4.14.17
How reproducible:
Not Sure
Steps to Reproduce:
1. 2. 3.
Actual results:
POD to POD connectivity issue when PODs are hosted on different nodes
Expected results:
POD to POD connectivity should work fine
Additional info:
As per customer https://github.com/openshift/ovn-kubernetes/pull/2179 resolves the issue.
The story is to track i18n upload/download routine tasks which are perform every sprint.
A.C.
- Upload strings to Memosource at the start of the sprint and reach out to localization team
- Download translated strings from Memsource when it is ready
- Review the translated strings and open a pull request
- Open a followup story for next sprint
Please review the following PR: https://github.com/openshift/csi-operator/pull/227
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-monitoring-operator/pull/2372
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Tracker issue for bootimage bump in 4.17. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-34692.
Refactor name to Dockerfile.ocp as a better alternative to Dockerfile.rhel7 since contents are actually rhel9.
Possibly reviving OCPBUGS-10771, the control-plane-machine-set ClusterOperator occasionally goes Available=False with reason=UnavailableReplicas. For example, this run includes:
: [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available expand_less 1h34m30s { 3 unexpected clusteroperator state transitions during e2e test run. These did not match any known exceptions, so they cause this test-case to fail: Oct 03 22:03:29.822 - 106s E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s) Oct 03 22:08:34.162 - 98s E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s) Oct 03 22:13:01.645 - 118s E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s)
But those are the nodes rebooting into newer RHCOS, and do not warrant immediate admin intervention. Teaching the CPMS operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.
4.15. Possibly all supported versions of the CPMS operator have this exposure.
Looks like many (all?) 4.15 update jobs have near 100% reproducibility for some kind of issue with CPMS going Available=False, see Actual results below. These are likely for reasons that do not require admin intervention, although figuring that out is tricky today, feel free to push back if you feel that some of these do warrant admin immediate admin intervention.
w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/control-plane-machine-set+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort
periodic-ci-openshift-cluster-etcd-operator-release-4.15-periodics-e2e-aws-etcd-recovery (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 19 runs, 42% failed, 225% of failures match = 95% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 18 runs, 61% failed, 127% of failures match = 78% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-e2e-aws-sdn-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 19 runs, 47% failed, 200% of failures match = 95% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-sdn-arm64 (all) - 9 runs, 78% failed, 114% of failures match = 89% impact periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade (all) - 11 runs, 64% failed, 143% of failures match = 91% impact periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 70 runs, 41% failed, 207% of failures match = 86% impact periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-upgrade (all) - 7 runs, 43% failed, 200% of failures match = 86% impact periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn (all) - 6 runs, 50% failed, 33% of failures match = 17% impact periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 71 runs, 24% failed, 382% of failures match = 92% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 70 runs, 30% failed, 281% of failures match = 84% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 8 runs, 50% failed, 175% of failures match = 88% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 71 runs, 38% failed, 233% of failures match = 89% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 69 runs, 49% failed, 171% of failures match = 84% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-upgrade (all) - 7 runs, 57% failed, 175% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-sdn-upgrade (all) - 6 runs, 33% failed, 250% of failures match = 83% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 63 runs, 37% failed, 222% of failures match = 81% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-upgrade (all) - 6 runs, 33% failed, 250% of failures match = 83% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 7 runs, 43% failed, 233% of failures match = 100% impact periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 13 runs, 54% failed, 100% of failures match = 54% impact periodic-ci-openshift-release-master-okd-scos-4.15-e2e-aws-ovn-upgrade (all) - 16 runs, 63% failed, 90% of failures match = 56% impact
CPMS goes Available=False if and only if immediate admin intervention is appropriate.
Description of problem:
After https://github.com/openshift/cluster-kube-controller-manager-operator/pull/804 was merged the controller no longer updates secret type and this no longer adds owner label. This PR would ensure the secret is created with this label
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-38951. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-43567. The following is the description of the original issue:
—
Description of problem:
With the newer azure-sdk-for-go replacing go-autorest, there was a change to use ClientCertificateCredential that did not include the `SendCertificateChain` option by default that used to be there. The ARO team requires this be set otherwise the 1p integration for SNI will not work. Old version: https://github.com/Azure/go-autorest/blob/f7ea664c9cff3a5257b6dbc4402acadfd8be79f1/autorest/adal/token.go#L262-L264 New version: https://github.com/openshift/installer-aro/pull/37/files#diff-da950a4ddabbede621d9d3b1058bb34f8931c89179306ee88a0e4d76a4cf0b13R294
Version-Release number of selected component (if applicable):
This was introduced in the OpenShift installer PR: https://github.com/openshift/installer/pull/6003
How reproducible:
Every time we authenticate using SNI in Azure.
Steps to Reproduce:
1. Configure a service principal in the Microsoft tenant using SNI 2. Attempt to run the installer using client-certificate credentials to install a cluster with credentials mode in manual
Actual results:
Installation fails as we're unable to authenticate using SNI.
Expected results:
We're able to authenticate using SNI.
Additional info:
This should not have any affect on existing non-SNI based authentication methods using client certificate credentials. It was previously set in autorest for golang, but is not defaulted to in the newer azure-sdk-for-go. Note that only first party Microsoft services will be able to leverage SNI in Microsoft tenants. The test case for this on the installer side would be to ensure it doesn't break manual credential mode installs using a certificate pinned to a service principal.
All we would need changed is to this pass the ` SendCertificateChain: true,` option only on client certificate credentials. Ideally we could back-port this as well to all openshift versions which received the migration from AAD to Microsoft Graph changes.
Currently we are downloading and installing the rpm each build of the upi-installer image. This has caused random timeouts. Determine if it is possible to:
and either copy the tar or follow the steps in the initial container.
Acceptance Criteria:
Description of problem:
Mirroring fails sometimes due to various number of reasons and since mirror fails, current code does not generate idms & itms files. Even if user tries to mirror the operators twice or thrice the operators does not get mirrored and no resources are created to utilize the operators that have already been mirrored. This bug is to create idms and itms even if mirroring fails
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Install latest oc-mirror 2. Use the ImageSetConfig.yaml below apiVersion: mirror.openshift.io/v1alpha2 kind: ImageSetConfiguration archiveSize: 4 mirror: operators: - catalog: registry.redhat.io/redhat/certified-operator-index:v4.15 full: false # only mirror the latest versions - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 full: false # only mirror the latest versions 3. Mirror using the command `oc-mirror -c config.yaml docker://localhost:5000/m2m --dest-skip-verify=false --workspace=file://test`
Actual results:
Mirroring fails and does not generate any idms or itms files
Expected results:
IDMS and ITMS files should be generated for the mirrored operators, even if mirroring fails
Additional info:
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/123
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
ovnkube-master-b5dwz 5/6 CrashLoopBackOff 15 (4m49s ago) 75m ovnkube-master-dm6g5 5/6 CrashLoopBackOff 15 (3m50s ago) 72m ovnkube-master-lzltc 5/6 CrashLoopBackOff 16 (31s ago) 76m
Relevant logs :
1 ovnkube.go:369] failed to start network controller manager: failed to start default network controller: failed to sync address sets on controller init: failed to transact address set sync ops: error in transact with ops [{Op:insert Table:Address_Set Row:map[addresses:{GoSet:[172.21.4.58 172.30.113.119 172.30.113.93 172.30.140.204 172.30.184.23 172.30.20.1 172.30.244.26 172.30.250.254 172.30.29.56 172.30.39.131 172.30.54.87 172.30.54.93 172.30.70.9]} external_ids:{GoMap:map[direction:ingress gress-index:0 ip-family:v4 ...]} log:false match:ip4.src == {$a10011776377603330168, $a10015887742824209439, $a10026019104056290237, $a10029515256826812638, $a5952808452902781817, $a10084011578527782670, $a10086197949337628055, $a10093706521660045086, $a10096260576467608457, $a13012332091214445736, $a10111277808835218114, $a10114713358929465663, $a101155018460287381, $a16191032114896727480, $a14025182946114952022, $a10127722282178953052, $a4829957937622968220, $a10131833063630260035, $a3533891684095375041, $a7785003721317615588, $a10594480726457361847, $a10147006001458235329, $a12372228123457253136, $a10016996505620670018, $a10155660392008449200, $a10155926828030234078, $a15442683337083171453, $a9765064908646909484, $a7550609288882429832, $a11548830526886645428, $a10204075722023637394, $a10211228835433076965, $a5867828639604451547, $a10222049254704513272, $a13856077787103972722, $a11903549070727627659,.... (this is a very long list of ACL)
This is a clone of issue OCPBUGS-39339. The following is the description of the original issue:
—
Description of problem:
The issue comes from https://issues.redhat.com/browse/OCPBUGS-37540?focusedId=25386451&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25386451. Error message is shown when gather bootstrap log bundle although log bundle gzip file is generated. ERROR Invalid log bundle or the bootstrap machine could not be reached and bootstrap logs were not collected.
Version-Release number of selected component (if applicable):
4.17+
How reproducible:
Always
Steps to Reproduce:
1. Run `openshift-install gather bootstrap --dir <install-dir>` 2. 3.
Actual results:
Error message shown in output of command `openshift-install gather bootstrap --dir <install-dir>`
Expected results:
No error message shown there.
Additional info:
Analysis from Rafael, https://issues.redhat.com/browse/OCPBUGS-37540?focusedId=25387767&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25387767
Rebase openshift/etcd to latest 3.5.16 upstream release.
This looks very much like a 'downstream a thing' process, but only making a modification to an existing one.
Currently, the operator-framework-olm monorepo generates a self-hosting catalog from operator-registry.Dockerfile. This image also contains cross-compiled opm binaries for windows and mac, and joins the payload as ose-operator-registry.
To separate concerns, this introduces a new operator-framework-cli image which will be based on scratch, not self-hosting in any way, and just a container to convey repeatably produced o-f CLIs. Right now, this will focus on opm for olm v0 only, but others can be added in future.
Description of problem:
NodePool machine instances are failing to join a HostedCluster. The nodepool status reports InvalidConfig with the following condition: - lastTransitionTime: "2024-05-02T15:08:58Z" message: 'Failed to generate payload: error getting ignition payload: machine-config-server configmap is out of date, waiting for update 5c59871d != 48a6b276' observedGeneration: 1 reason: InvalidConfig status: "" type: ValidGeneratedPayload
Version-Release number of selected component (if applicable):
4.14.21 (HostedCluster), with HyperShift operator db9d81eb56e35b145bbbd878bbbcf742c9e75be2
How reproducible:
100%
Steps to Reproduce:
* Create a ROSA HCP cluster * Waiting for the nodepools to come up * Add an IdP * Delete ignition-server pods in the hostedcontrolplane namespace on the management cluster * Confirm nodepools complain about machine-config-server/token-secret hash mismatch * Scale down/up by deleting machine.cluster.x-k8s.io resources or otherwise
Actual results:
Nodes are not created
Expected results:
Node is created
Additional info:
The AWSMachine resources were created along with the corresponding ec2 instances. However, they were never ignited. Deleting AWSMachine resources, resulted in successful ignition of new nodes.
Note: Please find logs from both HCP namespaces in the comment https://issues.redhat.com/browse/OCPBUGS-33377?focusedId=24690046&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-24690046
Description of problem:
Subnets created by the installer are tagged with kubernetes.io/cluster/<infra_id> set to 'shared' instead of 'owned'.
Version-Release number of selected component (if applicable):
4.16.z
How reproducible:
Any time a 4.16 cluster is installed
Steps to Reproduce:
1. Install a fresh 4.16 cluster without providing an existing VPC.
Actual results:
Subnets are tagged with kubernetes.io/cluster/<infra_id>: shared
Expected results:
Subnets created by the installer are tagged with kubernetes.io/cluster/<infra_id>: owned
Additional info:
Slack discussion here - https://redhat-internal.slack.com/archives/C68TNFWA2/p1720728359424529
Description of problem:
Some events have time related infomration set to null (firstTimestamp, lastTimestamp, eventTime)
Version-Release number of selected component (if applicable):
cluster-logging.v5.8.0
How reproducible:
100%
Steps to Reproduce:
1.Stop one of the masters 2.Start the master 3.Wait untill the ENV stabilizes 4. oc get events -A | grep unknown
Actual results:
oc get events -A | grep unknow default <unknown> Normal TerminationStart namespace/kube-system Received signal to terminate, becoming unready, but keeping serving default <unknown> Normal TerminationPreShutdownHooksFinished namespace/kube-system All pre-shutdown hooks have been finished default <unknown> Normal TerminationMinimalShutdownDurationFinished namespace/kube-system The minimal shutdown duration of 0s finished ....
Expected results:
All time related information is set correctly
Additional info:
This causes issues with external monitoring systems. Events with no timestamp will never show or push other events from the view depending on the sorting order of the timestamp. The operator of the environment has then trouble to see what is happening there.
Description of problem:
During HyperShift operator updates/rollout, previous ignition-server token and user-data secrets are not properly cleaned up and causing them to be abandoned on the control plane.
Version-Release number of selected component (if applicable):
4.15.6+
How reproducible:
100%
Steps to Reproduce:
1. Deploy hypershift-operator <4.15.6 2. Create HostedCluster and NodePool 3. Update hypershift-operator to 4.15.8+
Actual results:
Previous token and user-data secrets are now unmanaged and abandoned
Expected results:
HyperShift operator to properly clean them up
Additional info:
Introduced by https://github.com/openshift/hypershift/pull/3730
`openshift-tests` doesn't have an easy way to figure out what version it's running; not every subcommand prints it out.
This is a clone of issue OCPBUGS-42546. The following is the description of the original issue:
—
Description of problem:
When machineconfig fails to generate, we set upgradeable=false and degrade pools. The expectation is that the CO would also degrade after some time (normally 30 minutes) since master pool is degraded, but that doesn't seem to be happening. Based on our initial investigation, the event/degrade is happening but it seems to be being cleared.
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Should be always
Steps to Reproduce:
1. Apply a wrong config, such as a bad image.config object: spec: registrySources: allowedRegistries: - test.reg blockedRegistries: - blocked.reg 2. upgrade the cluster or roll out a new MCO pod 3. observe that pools are degraded but the CO isn't
Actual results:
Expected results:
Additional info:
Currently, several of our projects are using registry.ci.openshift.org/ocp/builder:rhel-9-golang-1.22-openshift-4.17 (or other versions of images from that family) as part of their build.
This image has more tooling in it and is more closely aligned with what is used for building shipping images:
registry.ci.openshift.org/openshift/release:rhel-9-release-golang-1.22-openshift-4.17
As an OpenShift developer, I would like to use the same builder images across our team's builds where possible to reduce confusion. Please change all non-UBI builds to use the openshift/release image instead of the ocp/builder image in these repos:
https://github.com/openshift/vertical-pod-autoscaler-operator
https://github.com/openshift/kubernetes-autoscaler (VPA images only)
https://github.com/openshift/cluster-resource-override-admission-operator
https://github.com/openshift/cluster-resource-override-admission
Also update the main branch to match images of any CI builds that are changed:
https://github.com/openshift/release
When switching from ipForwarding: Global to Restricted, sysctl settings are not adjusted
Switch from:
# oc edit network.operator/cluster apiVersion: operator.openshift.io/v1 kind: Network metadata: annotations: networkoperator.openshift.io/ovn-cluster-initiator: 10.19.1.66 creationTimestamp: "2023-11-22T12:14:46Z" generation: 207 name: cluster resourceVersion: "235152" uid: 225d404d-4e26-41bf-8e77-4fc44948f239 spec: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 defaultNetwork: ovnKubernetesConfig: egressIPConfig: {} gatewayConfig: ipForwarding: Global (...)
To:
# oc edit network.operator/cluster apiVersion: operator.openshift.io/v1 kind: Network metadata: annotations: networkoperator.openshift.io/ovn-cluster-initiator: 10.19.1.66 creationTimestamp: "2023-11-22T12:14:46Z" generation: 207 name: cluster resourceVersion: "235152" uid: 225d404d-4e26-41bf-8e77-4fc44948f239 spec: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 defaultNetwork: ovnKubernetesConfig: egressIPConfig: {} gatewayConfig: ipForwarding: Restricted
You'll see that the pods are updated:
# oc get pods -o yaml -n openshift-ovn-kubernetes ovnkube-node-fnl9z | grep sysctl -C10 fi admin_network_policy_enabled_flag= if [[ "false" == "true" ]]; then admin_network_policy_enabled_flag="--enable-admin-network-policy" fi # If IP Forwarding mode is global set it in the host here. ip_forwarding_flag= if [ "Restricted" == "Global" ]; then sysctl -w net.ipv4.ip_forward=1 sysctl -w net.ipv6.conf.all.forwarding=1 else ip_forwarding_flag="--disable-forwarding" fi NETWORK_NODE_IDENTITY_ENABLE= if [[ "true" == "true" ]]; then NETWORK_NODE_IDENTITY_ENABLE=" --bootstrap-kubeconfig=/var/lib/kubelet/kubeconfig --cert-dir=/etc/ovn/ovnkube-node-certs --cert-duration=24h
And that ovnkube correctly takes the settings:
# ps aux | grep disable-for root 74963 0.3 0.0 8085828 153464 ? Ssl Nov22 3:38 /usr/bin/ovnkube --init-ovnkube-controller master1.site1.r450.org --init-node master1.site1.r450.org --config-file=/run/ovnkube-config/ovnkube.conf --ovn-empty-lb-events --loglevel 4 --inactivity-probe=180000 --gateway-mode shared --gateway-interface br-ex --metrics-bind-address 127.0.0.1:29103 --ovn-metrics-bind-address 127.0.0.1:29105 --metrics-enable-pprof --metrics-enable-config-duration --export-ovs-metrics --disable-snat-multiple-gws --enable-multi-network --enable-multicast --zone master1.site1.r450.org --enable-interconnect --acl-logging-rate-limit 20 --enable-multi-external-gateway=true --disable-forwarding --bootstrap-kubeconfig=/var/lib/kubelet/kubeconfig --cert-dir=/etc/ovn/ovnkube-node-certs --cert-duration=24h root 2096007 0.0 0.0 3880 2144 pts/0 S+ 10:07 0:00 grep --color=auto disable-for
But sysctls are never restricted:
[root@master1 ~]# sysctl -a | grep forward net.ipv4.conf.0eca9d9e7fd3231.bc_forwarding = 0 net.ipv4.conf.0eca9d9e7fd3231.forwarding = 1 net.ipv4.conf.0eca9d9e7fd3231.mc_forwarding = 0 net.ipv4.conf.21a32cf76c3bcdf.bc_forwarding = 0 net.ipv4.conf.21a32cf76c3bcdf.forwarding = 1 net.ipv4.conf.21a32cf76c3bcdf.mc_forwarding = 0 net.ipv4.conf.22f9bca61beeaba.bc_forwarding = 0 net.ipv4.conf.22f9bca61beeaba.forwarding = 1 net.ipv4.conf.22f9bca61beeaba.mc_forwarding = 0 net.ipv4.conf.2ee438a7201c1f7.bc_forwarding = 0 net.ipv4.conf.2ee438a7201c1f7.forwarding = 1 net.ipv4.conf.2ee438a7201c1f7.mc_forwarding = 0 net.ipv4.conf.3560ce219f7b591.bc_forwarding = 0 net.ipv4.conf.3560ce219f7b591.forwarding = 1 net.ipv4.conf.3560ce219f7b591.mc_forwarding = 0 net.ipv4.conf.507c81eb9944c2e.bc_forwarding = 0 net.ipv4.conf.507c81eb9944c2e.forwarding = 1 net.ipv4.conf.507c81eb9944c2e.mc_forwarding = 0 net.ipv4.conf.6278633ca74482f.bc_forwarding = 0 net.ipv4.conf.6278633ca74482f.forwarding = 1 net.ipv4.conf.6278633ca74482f.mc_forwarding = 0 net.ipv4.conf.68b572ce18f3b82.bc_forwarding = 0 net.ipv4.conf.68b572ce18f3b82.forwarding = 1 net.ipv4.conf.68b572ce18f3b82.mc_forwarding = 0 net.ipv4.conf.7291c80dd47a6f3.bc_forwarding = 0 net.ipv4.conf.7291c80dd47a6f3.forwarding = 1 net.ipv4.conf.7291c80dd47a6f3.mc_forwarding = 0 net.ipv4.conf.76abdac44c6aee7.bc_forwarding = 0 net.ipv4.conf.76abdac44c6aee7.forwarding = 1 net.ipv4.conf.76abdac44c6aee7.mc_forwarding = 0 net.ipv4.conf.7f9abb486611f68.bc_forwarding = 0 net.ipv4.conf.7f9abb486611f68.forwarding = 1 net.ipv4.conf.7f9abb486611f68.mc_forwarding = 0 net.ipv4.conf.8cd86bfb8ea635f.bc_forwarding = 0 net.ipv4.conf.8cd86bfb8ea635f.forwarding = 1 net.ipv4.conf.8cd86bfb8ea635f.mc_forwarding = 0 net.ipv4.conf.8e87bd3f6ddc9f8.bc_forwarding = 0 net.ipv4.conf.8e87bd3f6ddc9f8.forwarding = 1 net.ipv4.conf.8e87bd3f6ddc9f8.mc_forwarding = 0 net.ipv4.conf.91079c8f5c1630f.bc_forwarding = 0 net.ipv4.conf.91079c8f5c1630f.forwarding = 1 net.ipv4.conf.91079c8f5c1630f.mc_forwarding = 0 net.ipv4.conf.92e754a12836f63.bc_forwarding = 0 net.ipv4.conf.92e754a12836f63.forwarding = 1 net.ipv4.conf.92e754a12836f63.mc_forwarding = 0 net.ipv4.conf.a5c01549a6070ab.bc_forwarding = 0 net.ipv4.conf.a5c01549a6070ab.forwarding = 1 net.ipv4.conf.a5c01549a6070ab.mc_forwarding = 0 net.ipv4.conf.a621d1234f0f25a.bc_forwarding = 0 net.ipv4.conf.a621d1234f0f25a.forwarding = 1 net.ipv4.conf.a621d1234f0f25a.mc_forwarding = 0 net.ipv4.conf.all.bc_forwarding = 0 net.ipv4.conf.all.forwarding = 1 net.ipv4.conf.all.mc_forwarding = 0 net.ipv4.conf.br-ex.bc_forwarding = 0 net.ipv4.conf.br-ex.forwarding = 1 net.ipv4.conf.br-ex.mc_forwarding = 0 net.ipv4.conf.br-int.bc_forwarding = 0 net.ipv4.conf.br-int.forwarding = 1 net.ipv4.conf.br-int.mc_forwarding = 0 net.ipv4.conf.c3f3da187245cf6.bc_forwarding = 0 net.ipv4.conf.c3f3da187245cf6.forwarding = 1 net.ipv4.conf.c3f3da187245cf6.mc_forwarding = 0 net.ipv4.conf.c7e518fff8ff973.bc_forwarding = 0 net.ipv4.conf.c7e518fff8ff973.forwarding = 1 net.ipv4.conf.c7e518fff8ff973.mc_forwarding = 0 net.ipv4.conf.d17c6fb6d3dd021.bc_forwarding = 0 net.ipv4.conf.d17c6fb6d3dd021.forwarding = 1 net.ipv4.conf.d17c6fb6d3dd021.mc_forwarding = 0 net.ipv4.conf.default.bc_forwarding = 0 net.ipv4.conf.default.forwarding = 1 net.ipv4.conf.default.mc_forwarding = 0 net.ipv4.conf.eno8303.bc_forwarding = 0 net.ipv4.conf.eno8303.forwarding = 1 net.ipv4.conf.eno8303.mc_forwarding = 0 net.ipv4.conf.eno8403.bc_forwarding = 0 net.ipv4.conf.eno8403.forwarding = 1 net.ipv4.conf.eno8403.mc_forwarding = 0 net.ipv4.conf.ens1f0.bc_forwarding = 0 net.ipv4.conf.ens1f0.forwarding = 1 net.ipv4.conf.ens1f0.mc_forwarding = 0 net.ipv4.conf.ens1f0/3516.bc_forwarding = 0 net.ipv4.conf.ens1f0/3516.forwarding = 1 net.ipv4.conf.ens1f0/3516.mc_forwarding = 0 net.ipv4.conf.ens1f0/3517.bc_forwarding = 0 net.ipv4.conf.ens1f0/3517.forwarding = 1 net.ipv4.conf.ens1f0/3517.mc_forwarding = 0 net.ipv4.conf.ens1f0/3518.bc_forwarding = 0 net.ipv4.conf.ens1f0/3518.forwarding = 1 net.ipv4.conf.ens1f0/3518.mc_forwarding = 0 net.ipv4.conf.ens1f1.bc_forwarding = 0 net.ipv4.conf.ens1f1.forwarding = 1 net.ipv4.conf.ens1f1.mc_forwarding = 0 net.ipv4.conf.ens3f0.bc_forwarding = 0 net.ipv4.conf.ens3f0.forwarding = 1 net.ipv4.conf.ens3f0.mc_forwarding = 0 net.ipv4.conf.ens3f1.bc_forwarding = 0 net.ipv4.conf.ens3f1.forwarding = 1 net.ipv4.conf.ens3f1.mc_forwarding = 0 net.ipv4.conf.fcb6e9468a65d70.bc_forwarding = 0 net.ipv4.conf.fcb6e9468a65d70.forwarding = 1 net.ipv4.conf.fcb6e9468a65d70.mc_forwarding = 0 net.ipv4.conf.fcd96084b7f5a9a.bc_forwarding = 0 net.ipv4.conf.fcd96084b7f5a9a.forwarding = 1 net.ipv4.conf.fcd96084b7f5a9a.mc_forwarding = 0 net.ipv4.conf.genev_sys_6081.bc_forwarding = 0 net.ipv4.conf.genev_sys_6081.forwarding = 1 net.ipv4.conf.genev_sys_6081.mc_forwarding = 0 net.ipv4.conf.lo.bc_forwarding = 0 net.ipv4.conf.lo.forwarding = 1 net.ipv4.conf.lo.mc_forwarding = 0 net.ipv4.conf.ovn-k8s-mp0.bc_forwarding = 0 net.ipv4.conf.ovn-k8s-mp0.forwarding = 1 net.ipv4.conf.ovn-k8s-mp0.mc_forwarding = 0 net.ipv4.conf.ovs-system.bc_forwarding = 0 net.ipv4.conf.ovs-system.forwarding = 1 net.ipv4.conf.ovs-system.mc_forwarding = 0 net.ipv4.ip_forward = 1 net.ipv4.ip_forward_update_priority = 1 net.ipv4.ip_forward_use_pmtu = 0 net.ipv6.conf.0eca9d9e7fd3231.forwarding = 1 net.ipv6.conf.0eca9d9e7fd3231.mc_forwarding = 0 net.ipv6.conf.21a32cf76c3bcdf.forwarding = 1 net.ipv6.conf.21a32cf76c3bcdf.mc_forwarding = 0 net.ipv6.conf.22f9bca61beeaba.forwarding = 1 net.ipv6.conf.22f9bca61beeaba.mc_forwarding = 0 net.ipv6.conf.2ee438a7201c1f7.forwarding = 1 net.ipv6.conf.2ee438a7201c1f7.mc_forwarding = 0 net.ipv6.conf.3560ce219f7b591.forwarding = 1 net.ipv6.conf.3560ce219f7b591.mc_forwarding = 0 net.ipv6.conf.507c81eb9944c2e.forwarding = 1 net.ipv6.conf.507c81eb9944c2e.mc_forwarding = 0 net.ipv6.conf.6278633ca74482f.forwarding = 1 net.ipv6.conf.6278633ca74482f.mc_forwarding = 0 net.ipv6.conf.68b572ce18f3b82.forwarding = 1 net.ipv6.conf.68b572ce18f3b82.mc_forwarding = 0 net.ipv6.conf.7291c80dd47a6f3.forwarding = 1 net.ipv6.conf.7291c80dd47a6f3.mc_forwarding = 0 net.ipv6.conf.76abdac44c6aee7.forwarding = 1 net.ipv6.conf.76abdac44c6aee7.mc_forwarding = 0 net.ipv6.conf.7f9abb486611f68.forwarding = 1 net.ipv6.conf.7f9abb486611f68.mc_forwarding = 0 net.ipv6.conf.8cd86bfb8ea635f.forwarding = 1 net.ipv6.conf.8cd86bfb8ea635f.mc_forwarding = 0 net.ipv6.conf.8e87bd3f6ddc9f8.forwarding = 1 net.ipv6.conf.8e87bd3f6ddc9f8.mc_forwarding = 0 net.ipv6.conf.91079c8f5c1630f.forwarding = 1 net.ipv6.conf.91079c8f5c1630f.mc_forwarding = 0 net.ipv6.conf.92e754a12836f63.forwarding = 1 net.ipv6.conf.92e754a12836f63.mc_forwarding = 0 net.ipv6.conf.a5c01549a6070ab.forwarding = 1 net.ipv6.conf.a5c01549a6070ab.mc_forwarding = 0 net.ipv6.conf.a621d1234f0f25a.forwarding = 1 net.ipv6.conf.a621d1234f0f25a.mc_forwarding = 0 net.ipv6.conf.all.forwarding = 1 net.ipv6.conf.all.mc_forwarding = 0 net.ipv6.conf.br-ex.forwarding = 1 net.ipv6.conf.br-ex.mc_forwarding = 0 net.ipv6.conf.br-int.forwarding = 1 net.ipv6.conf.br-int.mc_forwarding = 0 net.ipv6.conf.c3f3da187245cf6.forwarding = 1 net.ipv6.conf.c3f3da187245cf6.mc_forwarding = 0 net.ipv6.conf.c7e518fff8ff973.forwarding = 1 net.ipv6.conf.c7e518fff8ff973.mc_forwarding = 0 net.ipv6.conf.d17c6fb6d3dd021.forwarding = 1 net.ipv6.conf.d17c6fb6d3dd021.mc_forwarding = 0 net.ipv6.conf.default.forwarding = 1 net.ipv6.conf.default.mc_forwarding = 0 net.ipv6.conf.eno8303.forwarding = 1 net.ipv6.conf.eno8303.mc_forwarding = 0 net.ipv6.conf.eno8403.forwarding = 1 net.ipv6.conf.eno8403.mc_forwarding = 0 net.ipv6.conf.ens1f0.forwarding = 1 net.ipv6.conf.ens1f0.mc_forwarding = 0 net.ipv6.conf.ens1f0/3516.forwarding = 0 net.ipv6.conf.ens1f0/3516.mc_forwarding = 0 net.ipv6.conf.ens1f0/3517.forwarding = 0 net.ipv6.conf.ens1f0/3517.mc_forwarding = 0 net.ipv6.conf.ens1f0/3518.forwarding = 0 net.ipv6.conf.ens1f0/3518.mc_forwarding = 0 net.ipv6.conf.ens1f1.forwarding = 1 net.ipv6.conf.ens1f1.mc_forwarding = 0 net.ipv6.conf.ens3f0.forwarding = 1 net.ipv6.conf.ens3f0.mc_forwarding = 0 net.ipv6.conf.ens3f1.forwarding = 1 net.ipv6.conf.ens3f1.mc_forwarding = 0 net.ipv6.conf.fcb6e9468a65d70.forwarding = 1 net.ipv6.conf.fcb6e9468a65d70.mc_forwarding = 0 net.ipv6.conf.fcd96084b7f5a9a.forwarding = 1 net.ipv6.conf.fcd96084b7f5a9a.mc_forwarding = 0 net.ipv6.conf.genev_sys_6081.forwarding = 1 net.ipv6.conf.genev_sys_6081.mc_forwarding = 0 net.ipv6.conf.lo.forwarding = 1 net.ipv6.conf.lo.mc_forwarding = 0 net.ipv6.conf.ovn-k8s-mp0.forwarding = 1 net.ipv6.conf.ovn-k8s-mp0.mc_forwarding = 0 net.ipv6.conf.ovs-system.forwarding = 1 net.ipv6.conf.ovs-system.mc_forwarding = 0
It's logical that this is happening, because nowhere in the code is there a mechanism to tune the global sysctl back to 0 when the mode is switched from `Global` to `Restricted`. There's also no mechanism to sequentially reboot the nodes so that they'd reboot back to their defaults (= sysctl ip forward off).
Description of problem:
The kubevirt passt network binding need a global namespace to work, using the default namespace does not look the best option. We should be able to deploy at openshift-cnv and users allow to read the nads there to be able to use passt.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create a nad at openshift-cnv namespace 2. Try to use that nad from non openshift-cnv pods 3.
Actual results:
Fail for pods to start
Expected results:
Pod can start and use the nad
Additional info:
Description of problem:
DeploymentConfigs deprecation info alert is shows on the Edit deployment form. It should be shows on only deploymentConfigs pages.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create a deployment 2. Open Edit deployment form from the actions menu 3.
Actual results:
DeploymentConfigs deprecation info alert present on the edit deployment form
Expected results:
DeploymentConfigs deprecation info alert should not be shown for the Deployment
Additional info:
This is a clone of issue OCPBUGS-29497. The following is the description of the original issue:
—
While updating an HC with controllerAvailabilityPolicy of SingleReplica, the HCP doesn't fully rollout with 3 pod stuck in Pending
multus-admission-controller-5b5c95684b-v5qgd 0/2 Pending 0 4m36s network-node-identity-7b54d84df4-dxx27 0/3 Pending 0 4m12s ovnkube-control-plane-647ffb5f4d-hk6fg 0/3 Pending 0 4m21s
This is because these deployment all have requiredDuringSchedulingIgnoredDuringExecution zone anti-affinity and maxUnavailable: 25% (i.e. 1)
Thus the old pod blocks scheduling of the new pod.
This is a clone of issue OCPBUGS-43518. The following is the description of the original issue:
—
Description of problem:
Necessary security group rules are not created when using installer created VPC.
Version-Release number of selected component (if applicable):
4.17.2
How reproducible:
Easily
Steps to Reproduce:
1. Try to deploy a power vs cluster and have the installer create the VPC, or remove required rules from a VPC you're bringing. 2. Control plane nodes fail to bootstrap. 3. Fail
Actual results:
Install fails
Expected results:
Install succeeds
Additional info:
Fix identified
Description of problem:
Documentation for User Workload Monitoring implies that default retention time is 15d, when it is actually 24h in practice
Version-Release number of selected component (if applicable):
4.12/4.13/4.14/4.15
How reproducible:
100%
Steps to Reproduce:
1. Install a cluster 2. enable user workload monitoring 3. check pod manifest and check for retention time
Actual results:
Retention time is 24h
Expected results:
Retention time is 15d instead of 24h
Additional info:
In the agent installer, assisted-service must always use the openshift-baremetal-installer binary (which is dynamically linked) to ensure that if the target cluster is in FIPS mode the installer will be able to run. (This was implemented in MGMT-15150.)
A recent change for OCPBUGS-33227 has switched to using the statically-linked openshift-installer for 4.16 and later. This breaks FIPS on the agent-based installer.
It appears that CI tests for the agent installer (the compact-ipv4 job runs with FIPS enabled) did not detect this, because we are unable to correctly determine the "version" of OpenShift being installed when it is in fact a CI payload.
Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/809
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Failed to deploy baremetal cluster as cluster nodes are not introspected
Version-Release number of selected component (if applicable):
4.15.15
How reproducible:
periodically
Steps to Reproduce:
1. Deploy baremetal dualstack cluster with disabled provisioning network 2. 3.
Actual results:
Cluster fails to deploy as ironic.service fails to start on the bootstrap node: [root@api ~]# systemctl status ironic.service ○ ironic.service - Ironic baremetal deployment service Loaded: loaded (/etc/containers/systemd/ironic.container; generated) Active: inactive (dead) May 27 08:01:05 api.kni-qe-4.lab.eng.rdu2.redhat.com systemd[1]: Dependency failed for Ironic baremetal deployment service. May 27 08:01:05 api.kni-qe-4.lab.eng.rdu2.redhat.com systemd[1]: ironic.service: Job ironic.service/start failed with result 'dependency'.
Expected results:
ironic.service is started, nodes are introspected and cluster is deployed
Additional info:
Description of problem:
`preserveBootstrapIgnition` was named after the implementation details in terraform for how to make deleting S3 objects optional. The motivation behind the change was that some customers run installs in subscriptions where policies do not allow deleting s3 objects. They didn't want the install to fail because of that. With the move from terraform to capi/capa, this is now implemented differently: capa always tries to delete the s3 objects but will ignore any permission errors if `preserveBootstrapIgnition` is set. We should rename this option so it's clear that the objects will be deleted if there are enough permissions. My suggestion is to name something similar to what's named in CAPA: `allowBestEffortDeleteIgnition`. Ideally we should deprecate `preserveBootstrapIgnition` in 4.16 and remove it in 4.17.
Version-Release number of selected component (if applicable):
4.14+ but I don't think we want to change this for terraform-based installs
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
https://github.com/openshift/installer/pull/7288
Description of problem:
If I use custom CVO capabilities via the install config, I can create a capability set that disables the Ingress capability. However, once the cluster boots up, the Ingress capability will always be enabled. This creates a dissonance between the desired install config and what happens. It would be better to fail the install at install-config validation to prevent that dissonance.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-38450. The following is the description of the original issue:
—
Description of problem:
Day2 add node with oc binary is not working for ARM64 on baremetal CI running
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Running compact agent installation on arm64 platform 2. After the cluster is ready, run day2 install 3. Day2 install fail with error, worker-a-00 is not reachable
Actual results:
Day2 install exit with error.
Expected results:
Day2 install should works
Additional info:
Job link: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/54181/rehearse-54181-periodic-ci-openshift-openshift-tests-private-release-4.17-arm64-nightly-baremetal-compact-abi-ipv4-static-day2-f7/1823641309190033408 Error message from console when running day2 install: rsync: [sender] link_stat "/assets/node.x86_64.iso" failed: No such file or directory (2) command terminated with exit code 23 rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1823) [Receiver=3.2.3] rsync: [Receiver] write error: Broken pipe (32) error: exit status 23 {"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:84","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun","level":"error","msg":"Error executing test process","severity":"error","time":"2024-08-13T14:32:20Z"} error: failed to execute wrapped command: exit status 1
Description of problem:
Compared to other COs, the MCO seems to be doing a lot more direct API calls to CO objects: https://gist.github.com/deads2k/227479c81e9a57af6c018711548e4600 Most of these are GETs but we are also doing a lot of UPDATE calls, neither of which should be all that necessary for us. The MCO pod seems to be doing a lot of direct GETs and no-op UPDATEs in older code, which we should clean up and bring down the count. Some more context in the slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1712953161264199
Version-Release number of selected component (if applicable):
All
How reproducible:
Very
Steps to Reproduce:
1. look at e2e test bundles under /artifacts/junit/just-users-audit-log-summary__xxxxxx.json 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Now that PowerVS uses the upi-installer image, it is encountering the following error: mkdir: cannot create directory '/output/.ssh': Permission denied cp: cannot create regular file '/output/.ssh/': Not a directory
Version-Release number of selected component (if applicable):
4.17.0
How reproducible:
Always
Steps to Reproduce:
1. Look at CI run
This is a clone of issue OCPBUGS-44022. The following is the description of the original issue:
—
Description of problem:
We should decrease the verbosity level for the IBM CAPI module. This will affect the output of the file .openshift_install.log
As a dev, I want to be able to:
so that I can achieve
Description of criteria:
We initially mirror some APIs in the third party folder.
MCO API was moved to openshift/api so we can just consume it via vendoring with no need to adhoc hacks.
HardwareDetails is a pointer and we fail to check if it's null. The installer panics when attempting to collect gather logs from masters.
We need to update the crio test for workload partitioning to give out more useful information, currently it's hard to tell which container or pod has a cpu affinity mismatch.
More info on change: https://github.com/openshift/origin/pull/28852
Description of problem:
The "Auth Token GCP" filter in OperatorHub is displayed all the time, but in stead it should be rendered only for GPC cluster that have Manual creadential mode. When an GCP WIF capable operator is installed and the cluster is in GCP WIF mode, the Console should require the user to enter the necessary information about the GCP project, account, service account etc, which is in turn to be injected the operator's deployment via subscription.config (exactly how Azure and AWS STS got implemented in Console)
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Steps to Reproduce:
1. On a non-GCP cluster, navigate to OperatorHub 2. check available filters 3.
Actual results:
"Auth Token GCP" filter is available in OperatorHub
Expected results:
"Auth Token GCP" filter should not be available in OperatorHub for a non-GCP cluster. When selecting an operator that supports "Auth token GCP" as indicated by the annotation features.operators.openshift.io/token-auth-gcp: "true" the console needs to, aligned with how it works AWS/Azure auth capable operators, force the user to input the required information to auth against GCP via WIF in the form of env vars that are set up using subscription.config on the operator. The exact names need to come out of https://issues.redhat.com/browse/CCO-574
Additional info:
Azure PR - https://github.com/openshift/console/pull/13082 AWS PR - https://github.com/openshift/console/pull/12778
UI Screen Design can be taken from the existing implementation of the Console support short-lived token setup flow for AWS and Azure described here: https://docs.google.com/document/d/1iFNpyycby_rOY1wUew-yl3uPWlE00krTgr9XHDZOTNo/edit
Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver-operator/pull/71
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
With the changes in 4.17 to add authentication to the assisted-service API, it requires an additional step for users to retrieve data via this API. This will make it more difficult to request the data in customer cases. It would be useful to set the assisted-service log level to debug in order to capture additional logging in agent-gathers, and remove the need for requesting data from the API.
Description of problem:
HostedCluster fails to update from 4.14.9 to 4.14.24. This was attempted using the HCP KubeVirt platform, but could impact other platforms as well.
Version-Release number of selected component (if applicable):
4.14.9
How reproducible:
100%
Steps to Reproduce:
1.Create an HCP KubeVirt cluster with 4.14.9 and wait for it to reach Completed 2.Update the HostedCluster's release image to 4.14.24
Actual results:
HOstedCluster is stuck in a partial update state indefinitely with this condition - lastTransitionTime: "2024-05-14T17:37:16Z" message: 'Working towards 4.14.24: 478 of 599 done (79% complete), waiting on csi-snapshot-controller, image-registry, storage' observedGeneration: 4 reason: ClusterOperatorsUpdating status: "True" type: ClusterVersionProgressing
Expected results:
HostedCluster updates successfully.
Additional info:
Updating from 4.14.24 to 4.14.25 worked in this environment. We noted that 4.14.9 -> 4.14.24 did not work. this was reproduced in multiple environments. This was also observed using both MCE 2.4 and MCE 2.5 across 4.14 and 4.15 infra clusters
As a HyperShift user, I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/189
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-43625. The following is the description of the original issue:
—
Component Readiness has found a potential regression in the following test:
install should succeed: infrastructure
installer fails with:
time="2024-10-20T04:34:57Z" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: control-plane machines were not provisioned within 15m0s: client rate limiter Wait returned an error: context deadline exceeded"
Significant regression detected.
Fishers Exact probability of a regression: 99.96%.
Test pass rate dropped from 98.94% to 89.29%.
Sample (being evaluated) Release: 4.18
Start Time: 2024-10-14T00:00:00Z
End Time: 2024-10-21T23:59:59Z
Success Rate: 89.29%
Successes: 25
Failures: 3
Flakes: 0
Base (historical) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-10-01T23:59:59Z
Success Rate: 98.94%
Successes: 93
Failures: 1
Flakes: 0
Description of problem:
The configured HTTP proxy in a HostedCluster is not used when generating the user data for worker instances.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
always
Steps to Reproduce:
1. Create a public hosted cluster that has access to the outside only via proxy 2. Wait for machines to ignite
Actual results:
1. Machines do not ignite/join as nodes
Expected results:
Machines join as nodes
Additional info:
The proxy resource that is used to generate the user data snippet is empty.
Description of problem:
The option "Auto deploy when new image is available" becomes unchecked when editing a deployment from web console
Version-Release number of selected component (if applicable):
4.15.17
How reproducible:
100%
Steps to Reproduce:
1. Goto Workloads --> Deployments --> Edit Deployment --> Under Images section --> Tick the option "Auto deploy when new Image is available" and now save deployment. 2. Now again edit the deployment and observe that the option "Auto deploy when new Image is available" is unchecked. 3. Same test work fine in 4.14 cluster.
Actual results:
Option "Auto deploy when new Image is available" is in unchecked state.
Expected results:
Option "Auto deploy when new Image is available" remains in checked state.
Additional info:
Since in CI we use the --no-index option to emulate a pip disconnected environment to try and reproduce the downstream build conditions, it's not possible to test normal dependencies from source as pip won't be able to retrieve them from any remote source.
To work around that we download all the packages first removing the --no-index option but using the --no-deps option, preventing downloading dependencies.
This forces to download only the packages specified in the requirements file, ensuring total control on main libraries and dependencies and allowing us to be as granular as needed, easily switching between RPMs and source packages for testing or even for downstream builds.
When we install them afterwards if any dependency is missing the installation process will fail in CI, allowing us to correct the dependencies list directly the change PR.
Downloading the libraries first and then installing them with the same option allows more flexibility and almost a 1-to-1 copy of the downstream build environment that cachito uses.
Description of the problem:
Assisted-Service logs addresses instead of actual values during cluster's registration.
How reproducible:
100%
Steps to reproduce:
1. Register a cluster and look at the logs
Actual results:
Apr 17 10:48:09 master service[2732]: time="2024-04-17T10:48:09Z" level=info msg="Register cluster: agent-sno with id 026efda3-fd2c-40d3-a65f-8a22acd6267a and params &{AdditionalNtpSource:<nil> APIVips:[] BaseDNSDomain:abi-ci.com ClusterNetworkCidr:<nil> ClusterNetworkHostPrefix:0 ClusterNetworks:[0xc00111cc00] CPUArchitecture:s390x DiskEncryption:<nil> HighAvailabilityMode:0xc0011ce5a0 HTTPProxy:<nil> HTTPSProxy:<nil> Hyperthreading:<nil> IgnitionEndpoint:<nil> IngressVips:[] MachineNetworks:[0xc001380340] Name:0xc0011ce5b0 NetworkType:0xc0011ce5c0 NoProxy:<nil> OcpReleaseImage: OlmOperators:[] OpenshiftVersion:0xc0011ce5d0 Platform:0xc0009659e0 PullSecret:0xc0011ce5f0 SchedulableMasters:0xc0010cc710 ServiceNetworkCidr:<nil> ServiceNetworks:[0xc001380380]
...
Expected results:
All values shown (without the secrets)
Description of problem:
Setting capabilities as below in install-config: -------- capabilities: baselineCapabilitySet: v4.14 additionalEnabledCapabilities: - CloudCredential Continue to create manifests, installer should exit with error message that "the marketplace capability requires the OperatorLifecycleManager capability", as what is done in https://github.com/openshift/installer/pull/7495/. In that PR, seems that only check when baselineCapabilitySet is set to None. When baselineCapabilitySet is set to v4.x, it also includes capability "marketplace", and needs to do such pre-check.
Version-Release number of selected component (if applicable):
4.15/4.16
How reproducible:
Always
Steps to Reproduce:
1. Prepare install-config and set baselineCapabilitySet to v4.x (x<15) 2. Create manifests 3.
Actual results:
Manifests are created successfully.
Expected results:
Installer exited with error message that something like "the marketplace capability requires the OperatorLifecycleManager capability"
Additional info:
The goal is to collect metrics about AdminNetworkPolicy and BaselineAdminNetworkPolicy CRDs because its essentially to understand how the users are using this feature and in fact if they are using it OR not. This is required for 4.16 Feature https://issues.redhat.com/browse/SDN-4157 and hoping to get approval and PRs merged before the code freeze time frame for 4.16 (April 26th 2024)
admin_network_policy_total represents the total number of admin network policies in the cluster
Labels: None
See https://github.com/ovn-org/ovn-kubernetes/pull/4239 for more information
Cardinality of the metric is at most 1.
baseline_admin_network_policy_total represents the total number of baseline admin network policies in the cluster (0 or 1)
Labels: None
See https://github.com/ovn-org/ovn-kubernetes/pull/4239 for more information
Cardinality of the metric is at most 1.
We don't need the above two anymore because we have https://redhat-internal.slack.com/archives/C0VMT03S5/p1712567951869459?thread_ts=1712346681.157809&cid=C0VMT03S5
Instead of that we are adding two other metrics for rule count: (https://issues.redhat.com/browse/MON-3828 )
admin_network_policy_db_objects_total represents the total number of OVN NBDB objects (table_name) owned by AdminNetworkPolicy controller in the cluster
{{Labels: }}
See https://github.com/ovn-org/ovn-kubernetes/pull/4254 for more information
Cardinality of the metric is at most 3.
baseline_admin_network_policy_db_objects_total represents the total number of OVN NBDB objects (table_name) owned by BaselineAdminNetworkPolicy controller in the cluster
{{Labels: }}
See https://github.com/ovn-org/ovn-kubernetes/pull/4254 for more information
Cardinality of the metric is at most 3.
In a cluster with external OIDC environment we need to replace global refresh sync lock in OIDC provider with per-refresh-token one. The work should replace the sync lock that would apply to all HTTP-serving spawned goroutines with a sync-lock that is specific to each of the refresh tokens
Description of problem:
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Actual results:
Expected results:
That reduces token refresh request handling time by about 30%.
Additional info:
This is a clone of issue OCPBUGS-41631. The following is the description of the original issue:
—
Description of problem:
Panic seen in below CI job when run the below command
$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?name=^periodic&type=junit&search=machine-config-controller.*Observed+a+panic' | grep 'failures match' periodic-ci-openshift-insights-operator-stage-insights-operator-e2e-tests-periodic (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-insights-operator-release-4.17-insights-operator-e2e-tests-periodic (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
Panic observed:
E0910 09:00:04.283647 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 268 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x36c8b40, 0x5660c90}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000ce8540?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b panic({0x36c8b40?, 0x5660c90?}) /usr/lib/golang/src/runtime/panic.go:770 +0x132 github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).updateNode(0xc000d6e360, {0x3abd580?, 0xc00224a608}, {0x3abd580?, 0xc001bd2308}) /go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:585 +0x1f3 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:246 k8s.io/client-go/tools/cache.(*processorListener).run.func1() /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:976 +0xea k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001933f70, {0x3faaba0, 0xc000759710}, 0x1, 0xc00097bda0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000750f70, 0x3b9aca00, 0x0, 0x1, 0xc00097bda0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f k8s.io/apimachinery/pkg/util/wait.Until(...) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 k8s.io/client-go/tools/cache.(*processorListener).run(0xc000dc2630) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:972 +0x69 k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1() /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x52 created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 261 /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x33204b3]
Version-Release number of selected component (if applicable):
How reproducible:
Seen in this CI run -https://prow.ci.openshift.org/job-history/test-platform-results/logs/periodic-ci-openshift-insights-operator-stage-insights-operator-e2e-tests-periodic
Steps to Reproduce:
$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?name=^periodic&type=junit&search=machine-config-controller.*Observed+a+panic' | grep 'failures match'
Actual results:
Expected results:
No panic to observe
Additional info:
Description of the problem:
Soft timeout , 24 installation . slow down the network (external link to 40Mbps)
Installation takes long hour .
After 10 hours event log messages not sent .
In the example here we see :
5/29/2024, 11:05:05 AM warning Cluster 97709caf-5081-43e7-b5cc-80873ab1442d: finalizing stage Waiting for cluster operators has been active more than the expected completion time (600 minutes) 5/29/2024, 11:03:02 AM The following operators are experiencing issues: insights, kube-controller-manager 5/29/2024, 1:10:02 AM Operator console status: available message: All is well 5/29/2024, 1:09:03 AM Operator console status: progressing message: SyncLoopRefreshProgressing: Working toward version 4.15.14, 1 replicas available 5/29/2024, 1:04:03 AM Operator console status: failed message: RouteHealthDegraded: route not yet available, https://console-openshift-console.apps.test-infra-cluster-356d2e39.redhat.com returns '503 Service Unavailable'
Base on logs events. the message below triggered after 10 hours and two minute later we got timeout.
5/29/2024, 11:03:02 AM | The following operators are experiencing issues: insights, kube-controller-manager |
Looks like it will be better to allow messaging earlier . maybe we can tune and see the message "the following operators...." after 1 or 2 hours .
Keeping 10 hours with info wont help to customer and he may stop installation or hide real bugs.
**
test-infra-cluster-356d2e39_97709caf-5081-43e7-b5cc-80873ab1442d.tar
Description of problem:
This is a port of https://issues.redhat.com/browse/OCPBUGS-38470 to 4.17.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/187
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Seen in a 4.16.1 CI run:
: [bz-Etcd] clusteroperator/etcd should not change condition/Available expand_less 1h28m39s { 2 unexpected clusteroperator state transitions during e2e test run. These did not match any known exceptions, so they cause this test-case to fail: Jun 27 14:17:18.966 E clusteroperator/etcd condition/Available reason/EtcdMembers_NoQuorum status/False EtcdMembersAvailable: 1 of 3 members are available, ip-10-0-71-113.us-west-1.compute.internal is unhealthy, ip-10-0-58-93.us-west-1.compute.internal is unhealthy Jun 27 14:17:18.966 - 75s E clusteroperator/etcd condition/Available reason/EtcdMembers_NoQuorum status/False EtcdMembersAvailable: 1 of 3 members are available, ip-10-0-71-113.us-west-1.compute.internal is unhealthy, ip-10-0-58-93.us-west-1.compute.internal is unhealthy
But further digging turned up no sign that quorum had had any difficulties. It seems like the difficulty was the GetMemberHealth structure, which currently allows timelines like:
That can leave 30+s gaps of nominal Healthy:false for MemberC when in fact MemberC was completely fine.
I suspect that the "was really short" Took:27.199µs got a "took too long" context deadline exceeded because GetMemberHealth has a 30s timeout per member, while many (all?) of its callers have a 30s DefaultClientTimeout. Which means by the time we get to MemberC, we've already spend our Context and we're starved of time to actually check MemberC. It may be more reliable to refactor and probe all known members in parallel, and to keep probing in the event of failures while you wait for the slowest member-probe to get back to you, because I suspect a re-probe of MemberC (or even a single probe that was granted reasonable time to complete) while we waited on MemberB would have succeeded and told us MemberC was actually fine.
Exposure is manageable, because this is self-healing, and quorum is actually ok. But still worth fixing because it spooks admins (and the origin CI test suite) if you tell them you're Available=False, and we want to save that for situations where the component is actually having trouble like quorum loss, and not burn signal-to-noise by claiming EtcdMembers_NoQuorum when it's really BriefIssuesScrapingMemberAHealthAndWeWillllTryAgainSoon.
Seen in 4.16.1, but the code is old, so likely a longstanding issue.
Luckily for customers, but unluckily for QE, network or whatever hiccups when connecting to members seem rare, so we don't trip the condition that exposes this issue often.
1. Figure out which order etcd is probing members in.
2. Stop the first or second member, in a way that makes its health probes time out ~30s.
3. Monitor the etcd ClusterOperator Available condition.
Available goes False claiming EtcdMembers_NoQuorum, as the operator starves itself of the time it needs to actually probe the third member.
Available stays True, as the etcd operator take the full 30s to check on all members, and see that two of them are completely happy.
Description of problem:
Checking the vsphere-problem-detector-operator log in 4.17.0-0.nightly-2024-07-28-191830, it threw the error message as below: W0729 01:36:04.693729 1 reflector.go:547] k8s.io/client-go@v0.30.2/tools/cache/reflector.go:232: failed to list *v1.ClusterCSIDriver: clustercsidrivers.operator.openshift.io is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:vsphere-problem-detector-operator" cannot list resource "clustercsidrivers" in API group "operator.openshift.io" at the cluster scope E0729 01:36:04.693816 1 reflector.go:150] k8s.io/client-go@v0.30.2/tools/cache/reflector.go:232: Failed to watch *v1.ClusterCSIDriver: failed to list *v1.ClusterCSIDriver: clustercsidrivers.operator.openshift.io is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:vsphere-problem-detector-operator" cannot list resource "clustercsidrivers" in API group "operator.openshift.io" at the cluster scope
And vsphere-problem-detector-operator continue restarting: vsphere-problem-detector-operator-76d6885898-vsww4 1/1 Running 34 (11m ago) 7h18m
It might be caused by https://github.com/openshift/vsphere-problem-detector/pull/166 and we have not added the clusterrole in openshift/cluster-storage-operator repo yet.
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-07-28-191830
How reproducible:
Always
Steps to Reproduce:
See Description
Actual results:
vsphere-problem-detector-operator report permission lack and restart
Expected results:
vsphere-problem-detector-operator should not report permission lack and restart
Additional info:
This is a clone of issue OCPBUGS-38573. The following is the description of the original issue:
—
Description of problem:
While working on the readiness probes we have discovered that the single member health check always allocates a new client. Since this is an expensive operation, we can make use of the pooled client (that already has a connection open) and change the endpoints for a brief period of time to the single member we want to check. This should reduce CEO's and etcd CPU consumption.
Version-Release number of selected component (if applicable):
any supported version
How reproducible:
always, but technical detail
Steps to Reproduce:
na
Actual results:
CEO creates a new etcd client when it is checking a single member health
Expected results:
CEO should use the existing pooled client to check for single member health
Additional info:
This is a clone of issue OCPBUGS-43925. The following is the description of the original issue:
—
Description of problem:
BuildConfig form breaks on manually enter the Git URL after selecting the source type as Git
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Navigate to Create BuildConfig form page 2. Select source type as Git 3. Enter the git url by typing manually do not paste or select from the suggestion
Actual results:
Console breaks
Expected results:
Console should not break and user should be able tocreate BuildConfig
Additional info:
This is a clone of issue OCPBUGS-42880. The following is the description of the original issue:
—
When the cluster version operator has already accepted an update to 4.(y+1).z, it should accept retargets to 4.(y+1).z' even if ClusterVersion has Upgradeable=False (unless there are overrides, those are explicitly supposed to block patch updates). It currently blocks these retargets, which can make it hard for a cluster admin to say "hey, this update is stuck on a bug in 4.(y+1).z, and I want to retarget to 4.(y+1).z' to pick up the fix for that bug so the update can complete".
Spun out from Evgeni Vakhonin 's testing of OTA-861.
Reproduced in a 4.15.35 CVO.
Reproduced in my first try, but I have not made additional attempts.
1. Install 4.y, e.g. with Cluster Bot launch 4.14.38 aws.
2. Request an update to a 4.(y+1).z:
a. oc adm upgrade channel candidate-4.15
b. oc adm upgrade --to 4.15.35
3. Wait until the update has been accepted...
$ oc adm upgrade | head -n1 info: An upgrade is in progress. Working towards 4.15.35: 10 of 873 done (1% complete
4. Inject an Upgradeable=False situation for testing:
$ oc -n openshift-config-managed patch configmap admin-gates --type json -p '[ {"op": "add", "path": "/data/ack-4.14-kube-1.29-api-removals-in-4.16", value: "testing"}]'
And after a minute or two, the CVO has noticed and set Upgradeable=False:
$ oc adm upgrade info: An upgrade is in progress. Working towards 4.15.35: 109 of 873 done (12% complete), waiting on etcd, kube-apiserver Upgradeable=False Reason: AdminAckRequired Message: testing Upstream: https://api.integration.openshift.com/api/upgrades_info/graph Channel: candidate-4.15 (available channels: candidate-4.15, candidate-4.16, fast-4.15, fast-4.16) Recommended updates: VERSION IMAGE 4.15.36 quay.io/openshift-release-dev/ocp-release@sha256:a8579cdecf1d45d33b5e88d6e1922df3037d05b09bcff7f08556b75898ab2f46
5. Request a patch-bumping retarget to 4.(y+1).z':
$ oc adm upgrade --allow-upgrade-with-warnings --to 4.15.36 warning: --allow-upgrade-with-warnings is bypassing: the cluster is already upgrading: Reason: ClusterOperatorsUpdating Message: Working towards 4.15.35: 109 of 873 done (12% complete), waiting on etcd, kube-apiserver Requested update to 4.15.36
6. Check the status of the retarget request: oc adm upgrade
The retarget was rejected:
$ oc adm upgrade info: An upgrade is in progress. Working towards 4.15.35: 109 of 873 done (12% complete), waiting on etcd, kube-apiserver Upgradeable=False Reason: AdminAckRequired Message: testing ReleaseAccepted=False Reason: PreconditionChecks Message: Preconditions failed for payload loaded version="4.15.36" image="quay.io/openshift-release-dev/ocp-release@sha256:a8579cdecf1d45d33b5e88d6e1922df3037d05b09bcff7f08556b75898ab2f46": Precondition "ClusterVersionUpgradeable" failed because of "AdminAckRequired": testing Upstream: https://api.integration.openshift.com/api/upgrades_info/graph Channel: candidate-4.15 (available channels: candidate-4.15, candidate-4.16, fast-4.15, fast-4.16) Recommended updates: VERSION IMAGE 4.15.36 quay.io/openshift-release-dev/ocp-release@sha256:a8579cdecf1d45d33b5e88d6e1922df3037d05b09bcff7f08556b75898ab2f46
The retarget should have been accepted, because 4.15.35 was already accepted, and 4.15.35 to 4.15.36 is a patch bump where Upgradeable=False is
This GetCurrentVersion is looking in history for the most recent Completed entry. But for the Upgradeable precondition, we want to be looking in status.desired for the currently accepted entry, regardless of whether we've completed reconciling it or not.
This solves 2 problems which were introduced when we moved our API to a separate sub module:
1. Duplicate API files: currently all our API is vendored in the main module.
2. API imports in our code are using the vendored version which led to poor UX because local changes to the API are not being reflected immediately in the code until the vendor dir is updated. Also using the IDE "go to" feature will send you to the vendor folder where you can't do any updates/changes.
Please review the following PR: https://github.com/openshift/prom-label-proxy/pull/370
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The documentation files in the Github repository of Installer do not mention apiVIPs and ingressVIPs, mentioning instead the deprecated fields apiVIP and ingressVIP.
Note that the information contained in the customer-facing documentation is correct: https://docs.openshift.com/container-platform/4.16/installing/installing_openstack/installing-openstack-installer-custom.html
Hello Team,
After the hard reboot of all nodes due to a power outage, failure of image pull of NTO preventing "ocp-tuned-one-shot.service" startup result in dependency failure for kubelet and crio services,
------------
journalctl_--no-pager
Aug 26 17:07:46 ocp05 systemd[1]: Reached target The firstboot OS update has completed.
Aug 26 17:07:46 ocp05 resolv-prepender.sh[3577]: NM resolv-prepender: Starting download of baremetal runtime cfg image
Aug 26 17:07:46 ocp05 systemd[1]: Starting Writes IP address configuration so that kubelet and crio services select a valid node IP...
Aug 26 17:07:46 ocp05 systemd[1]: Starting TuneD service from NTO image...
Aug 26 17:07:46 ocp05 nm-dispatcher[3687]: NM resolv-prepender triggered by lo up.
Aug 26 17:07:46 ocp05 resolv-prepender.sh[3644]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cf4faeb258c222ba4e04806fd3a7373d3bc1f43a66e141d4b7ece0307f597c72...
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + [[ OVNKubernetes == \O\V\N\K\u\b\e\r\n\e\t\e\s ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + [[ lo == \W\i\r\e\d\ \C\o\n\n\e\c\t\i\o\n ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + '[' -z ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + echo 'Not a DHCP4 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: Not a DHCP4 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + exit 0
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: + '[' -z '' ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: + echo 'Not a DHCP6 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: Not a DHCP6 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: + exit 0
Aug 26 17:07:46 ocp05 bash[3655]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cf4faeb258c222ba4e04806fd3a7373d3bc1f43a66e141d4b7ece0307f597c72...
Aug 26 17:07:46 ocp05 podman[3661]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b6ace44ba73bc0cef451bcf755c7fcddabe66b79df649058dc4b263e052ae26...
Aug 26 17:07:46 ocp05 podman[3661]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b6ace44ba73bc0cef451bcf755c7fcddabe66b79df649058dc4b263e052ae26: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 10.112.227.10:53: server misbehaving
Aug 26 17:07:46 ocp05 systemd[1]: ocp-tuned-one-shot.service: Main process exited, code=exited, status=125/n/a
Aug 26 17:07:46 ocp05 nm-dispatcher[3793]: NM resolv-prepender triggered by brtrunk up.
Aug 26 17:07:46 ocp05 systemd[1]: ocp-tuned-one-shot.service: Failed with result 'exit-code'.
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + [[ OVNKubernetes == \O\V\N\K\u\b\e\r\n\e\t\e\s ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + [[ brtrunk == \W\i\r\e\d\ \C\o\n\n\e\c\t\i\o\n ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + '[' -z ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + echo 'Not a DHCP4 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: Not a DHCP4 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + exit 0
Aug 26 17:07:46 ocp05 systemd[1]: Failed to start TuneD service from NTO image.
Aug 26 17:07:46 ocp05 systemd[1]: Dependency failed for Dependencies necessary to run kubelet.
Aug 26 17:07:46 ocp05 systemd[1]: Dependency failed for Kubernetes Kubelet.
Aug 26 17:07:46 ocp05 systemd[1]: kubelet.service: Job kubelet.service/start failed with result 'dependency'.
Aug 26 17:07:46 ocp05 systemd[1]: Dependency failed for Container Runtime Interface for OCI (CRI-O).
Aug 26 17:07:46 ocp05 systemd[1]: crio.service: Job crio.service/start failed with result 'dependency'.
Aug 26 17:07:46 ocp05 systemd[1]: kubelet-dependencies.target: Job kubelet-dependencies.target/start failed with result 'dependency'.
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: + '[' -z '' ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: + echo 'Not a DHCP6 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: Not a DHCP6 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: + exit 0
-----------
-----------
$ oc get proxy config cluster -oyaml
status:
httpProxy: http://proxy_ip:8080
httpsProxy: http://proxy_ip:8080
$ cat /etc/mco/proxy.env
HTTP_PROXY=http://proxy_ip:8080
HTTPS_PROXY=http://proxy_ip:8080
-----------
-----------
× ocp-tuned-one-shot.service - TuneD service from NTO image
Loaded: loaded (/etc/systemd/system/ocp-tuned-one-shot.service; enabled; preset: disabled)
Active: failed (Result: exit-code) since Mon 2024-08-26 17:07:46 UTC; 2h 30min ago
Main PID: 3661 (code=exited, status=125)
Aug 26 17:07:46 ocp05 podman[3661]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b6ace44ba73bc0cef451bcf755c7fcddabe66b79df649058dc4b263e052ae26: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 10.112.227.10:53: server misbehaving
-----------
Please review the following PR: https://github.com/openshift/kubevirt-csi-driver/pull/43
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This comes from this bug https://issues.redhat.com/browse/OCPBUGS-29940
After applying the workaround suggested [1][2] with "oc adm must-gather --node-name" we found another issue where must-gather creates the debug pod on all master nodes and gets stuck for a while because the script gather_network_logs_basics loop. Filtering out the NotReady nodes would allow us to apply the workaround.
The script gather_network_logs_basics gets the master nodes by label (node-role.kubernetes.io/master) and saves them in the CLUSTER_NODES variable. It then passes this as a parameter to the function gather_multus_logs $CLUSTER_NODES, where it loops through the list of master nodes and performs debugging for each node.
collection-scripts/gather_network_logs_basics
...
CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -oname)}"
/usr/bin/gather_multus_logs $CLUSTER_NODES
...
collection-scripts/gather_multus_logs ... function gather_multus_logs { for NODE in "$@"; do nodefilename=$(echo "$NODE" | sed -e 's|node/||') out=$(oc debug "${NODE}" -- \ /bin/bash -c "cat $INPUT_LOG_PATH" 2>/dev/null) && echo "$out" 1> "${OUTPUT_LOG_PATH}/multus-log-$nodefilename.log" done }
This could be resolved with something similar to this:
CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -o json | jq -r '.items[] | select(.status.conditions[] | select(.type=="Ready" and .status=="True")).metadata.name')}"
/usr/bin/gather_multus_logs $CLUSTER_NODES
[1] - https://access.redhat.com/solutions/6962230
[2] - https://issues.redhat.com/browse/OCPBUGS-29940
Please review the following PR: https://github.com/openshift/cloud-provider-openstack/pull/283
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Reviewing https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&arch=amd64&baseEndTime=2024-02-28%2023%3A59%3A59&baseRelease=4.15&baseStartTime=2024-02-01%2000%3A00%3A00&capability=operator-conditions&component=Cloud%20Compute%20%2F%20Other%20Provider&confidence=95&environment=ovn%20no-upgrade%20amd64%20azure%20standard&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&network=ovn&pity=5&platform=azure&platform=azure&sampleEndTime=2024-06-05%2023%3A59%3A59&sampleRelease=4.15&sampleStartTime=2024-05-30%2000%3A00%3A00&testId=Operator%20results%3A6d9ee55972f66121016367d07d52f0a9&testName=operator%20conditions%20control-plane-machine-set&upgrade=no-upgrade&upgrade=no-upgrade&variant=standard&variant=standard, it appears that the Azure tests are failing frequently with "Told to stop trying". Check failed before until passed. Reviewing this, it appears that the rollout happened as expected, but the until function got a non-retryable error and exited, while the check saw that the Deletion timestamp was set and the Machine went into Running, which caused it to fail. We should investigate why the until failed in this case as it should have seen the same machines and therefore should have seen a Running machine and passed.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Note: also notify the Hive team we're doing these bumps.
Description of problem:
Pod stuck in creating state when running performance benchmark The exact error when describing the pod - Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 45s (x114 over 3h47m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_client-1-5c978b7665-n4tds_cluster-density-v2-35_f57d8281-5a79-4c91-9b83-bb3e4b553597_0(5a8d6897ca792d91f1c52054f5f8c596530fbf72d3abb07b19a20fd9c95cc564): error adding pod cluster-density-v2-35_client-1-5c978b7665-n4tds to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&\{ContainerID:5a8d6897ca792d91f1c52054f5f8c596530fbf72d3abb07b19a20fd9c95cc564 Netns:/var/run/netns/e06c9af7-c13d-426f-9a00-73c54441a20b IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=cluster-density-v2-35;K8S_POD_NAME=client-1-5c978b7665-n4tds;K8S_POD_INFRA_CONTAINER_ID=5a8d6897ca792d91f1c52054f5f8c596530fbf72d3abb07b19a20fd9c95cc564;K8S_POD_UID=f57d8281-5a79-4c91-9b83-bb3e4b553597 Path: StdinData:[123 34 98 105 110 68 105 114 34 58 34 47 118 97 114 47 108 105 98 47 99 110 105 47 98 105 110 34 44 34 99 104 114 111 111 116 68 105 114 34 58 34 47 104 111 115 116 114 111 111 116 34 44 34 99 108 117 115 116 101 114 78 101 116 119 111 114 107 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 99 110 105 47 110 101 116 46 100 47 49 48 45 111 118 110 45 107 117 98 101 114 110 101 116 101 115 46 99 111 110 102 34 44 34 99 110 105 67 111 110 102 105 103 68 105 114 34 58 34 47 104 111 115 116 47 101 116 99 47 99 110 105 47 110 101 116 46 100 34 44 34 99 110 105 86 101 114 115 105 111 110 34 58 34 48 46 51 46 49 34 44 34 100 97 101 109 111 110 83 111 99 107 101 116 68 105 114 34 58 34 47 114 117 110 47 109 117 108 116 117 115 47 115 111 99 107 101 116 34 44 34 103 108 111 98 97 108 78 97 109 101 115 112 97 99 101 115 34 58 34 100 101 102 97 117 108 116 44 111 112 101 110 115 104 105 102 116 45 109 117 108 116 117 115 44 111 112 101 110 115 104 105 102 116 45 115 114 105 111 118 45 110 101 116 119 111 114 107 45 111 112 101 114 97 116 111 114 34 44 34 108 111 103 76 101 118 101 108 34 58 34 118 101 114 98 111 115 101 34 44 34 108 111 103 84 111 83 116 100 101 114 114 34 58 116 114 117 101 44 34 109 117 108 116 117 115 65 117 116 111 99 111 110 102 105 103 68 105 114 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 99 110 105 47 110 101 116 46 100 34 44 34 109 117 108 116 117 115 67 111 110 102 105 103 70 105 108 101 34 58 34 97 117 116 111 34 44 34 110 97 109 101 34 58 34 109 117 108 116 117 115 45 99 110 105 45 110 101 116 119 111 114 107 34 44 34 110 97 109 101 115 112 97 99 101 73 115 111 108 97 116 105 111 110 34 58 116 114 117 101 44 34 112 101 114 78 111 100 101 67 101 114 116 105 102 105 99 97 116 101 34 58 123 34 98 111 111 116 115 116 114 97 112 75 117 98 101 99 111 110 102 105 103 34 58 34 47 118 97 114 47 108 105 98 47 107 117 98 101 108 101 116 47 107 117 98 101 99 111 110 102 105 103 34 44 34 99 101 114 116 68 105 114 34 58 34 47 101 116 99 47 99 110 105 47 109 117 108 116 117 115 47 99 101 114 116 115 34 44 34 99 101 114 116 68 117 114 97 116 105 111 110 34 58 34 50 52 104 34 44 34 101 110 97 98 108 101 100 34 58 116 114 117 101 125 44 34 115 111 99 107 101 116 68 105 114 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 115 111 99 107 101 116 34 44 34 116 121 112 101 34 58 34 109 117 108 116 117 115 45 115 104 105 109 34 125]} ContainerID:"5a8d6897ca792d91f1c52054f5f8c596530fbf72d3abb07b19a20fd9c95cc564" Netns:"/var/run/netns/e06c9af7-c13d-426f-9a00-73c54441a20b" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=cluster-density-v2-35;K8S_POD_NAME=client-1-5c978b7665-n4tds;K8S_POD_INFRA_CONTAINER_ID=5a8d6897ca792d91f1c52054f5f8c596530fbf72d3abb07b19a20fd9c95cc564;K8S_POD_UID=f57d8281-5a79-4c91-9b83-bb3e4b553597" Path:"" ERRORED: error configuring pod [cluster-density-v2-35/client-1-5c978b7665-n4tds] networking: [cluster-density-v2-35/client-1-5c978b7665-n4tds/f57d8281-5a79-4c91-9b83-bb3e4b553597:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[cluster-density-v2-35/client-1-5c978b7665-n4tds 5a8d6897ca792d91f1c52054f5f8c596530fbf72d3abb07b19a20fd9c95cc564 network default NAD default] [cluster-density-v2-35/client-1-5c978b7665-n4tds 5a8d6897ca792d91f1c52054f5f8c596530fbf72d3abb07b19a20fd9c95cc564 network default NAD default] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:83:03:f6 [10.131.3.246/23] ' ': StdinData: \{"binDir":"/var/lib/cni/bin","clusterNetwork":"/host/run/multus/cni/net.d/10-ovn-kubernetes.conf","cniVersion":"0.3.1","daemonSocketDir":"/run/multus/socket","globalNamespaces":"default,openshift-multus,openshift-sriov-network-operator","logLevel":"verbose","logToStderr":true,"name":"multus-cni-network","namespaceIsolation":true,"type":"multus-shim"}
Version-Release number of selected component (if applicable):
4.16.0-ec.5\{code} How reproducible: {code:none} 50-60% It seems to be related to the number of times I have ran our test on a single cluster. Many of our performance tests are on ephemeral clusters - so we build the cluster, run the test, tear down. Currently I have a long lived cluster (1 week old), and I have been running many performance tests against this cluster -- serially. After each test, the previous resources are cleaned up. \{code} Steps to Reproduce: {code:none} 1. Use the following cmdline as an example. 2. ./bin/amd64/kube-burner-ocp cluster-density-v2 --iterations 90 3. Repeat until issue arises ( usually after 3-4 attempts)./ \{code} Actual results: {code:none} client-1-5c978b7665-n4tds 0/1 ContainerCreating 0 4h14m
Expected results:
For the benchmark not to get stuck waiting for this pod. \{code} Additional info: {code:none} Looking at the ovnkube-controller pod logs, grepping for the pod which was stuck oc logs -n openshift-ovn-kubernetes ovnkube-node-qpkws -c ovnkube-controller | grep client-1-5c978b7665-n4tds W0425 13:12:09.302395 6996 base_network_controller_policy.go:545] Failed to get get LSP for pod cluster-density-v2-35/client-1-5c978b7665-n4tds NAD default for networkPolicy allow-from-openshift-ingress, err: logical port cluster-density-v2-35/client-1-5c978b7665-n4tds for pod cluster-density-v2-35_client-1-5c978b7665-n4tds not found in cache I0425 13:12:09.302412 6996 obj_retry.go:370] Retry add failed for *factory.localPodSelector cluster-density-v2-35/client-1-5c978b7665-n4tds, will try again later: unable to get port info for pod cluster-density-v2-35/client-1-5c978b7665-n4tds NAD default W0425 13:12:09.908446 6996 helper_linux.go:481] [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4] pod uid f57d8281-5a79-4c91-9b83-bb3e4b553597: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:83:03:f6 [10.131.3.246/23] I0425 13:12:09.963651 6996 cni.go:279] [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default] ADD finished CNI request [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default], result "", err failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:83:03:f6 [10.131.3.246/23] I0425 13:12:09.988397 6996 cni.go:258] [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default] DEL starting CNI request [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default] W0425 13:12:09.996899 6996 helper_linux.go:697] Failed to delete pod "cluster-density-v2-35/client-1-5c978b7665-n4tds" interface 7f80514901cbc57: failed to lookup link 7f80514901cbc57: Link not found I0425 13:12:10.009234 6996 cni.go:279] [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default] DEL finished CNI request [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default], result "\{\"dns\":{}}", err <nil> I0425 13:12:10.059917 6996 cni.go:258] [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default] DEL starting CNI request [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default]
Description of problem:
Dynamic plugins using PatternFly 4 could be referring to PF4 variables that do not exist in OpenShift 4.15+. Currently this is causing contrast issues for ACM in dark mode for donut charts.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. Install ACM on OpenShift 4.15 2. Switch to dark mode 3. Observe Home > Overview page
Actual results:
Some categories in the donut charts cannot be seen due to low contrast
Expected results:
Colors should match those seen in OpenShift 4.14 and earlier
Additional info:
Also posted about this on Slack: https://redhat-internal.slack.com/archives/C011BL0FEKZ/p1720467671332249 Variables like --pf-chart-color-gold-300 are no longer provided, although the PF5 equivalent, --pf-v5-chart-color-gold-300, is available. The stylesheet @patternfly/patternfly/patternfly-charts.scss is present, but not the V4 version. Hopefully it is possible to also include these styles since the names now include a version.
Description of problem:
The 'Getting started resources' card on the Cluster overview includes a link to 'View all steps in documentation', but this link is not valid for ROSA and OSD so it should be hidden.
Placeholder to upadte Makefile for the merge-bot.
Description of problem:
If a cluster is running with user-workload-monitoring enabled, running an ose-tests suite against the cluster will fail the data collection step. This is because there is a query in the test framework that assumes that the number of prometheus instances that the thanos pods will connect to will match exactly the number of platform prometheus instances. However, it doesn't account for thanos also connecting to the user-workload-monitoring instances. As such, the test suite will always fail against a cluster that is healthy and running user-workload-monitoring in addition to the normal openshift-monitoring stack.
Version-Release number of selected component (if applicable):
4.15.13
How reproducible:
Consistent
Steps to Reproduce:
1. Create an OpenShift cluster 2. Enable workload monitoring 3. Attempt to run an ose-tests suite. For example, the CNI conformance suite documented here: https://access.redhat.com/documentation/en-us/red_hat_software_certification/2024/html/red_hat_software_certification_workflow_guide/con_cni-certification_openshift-sw-cert-workflow-working-with-cloud-native-network-function#running-the-cni-tests_openshift-sw-cert-workflow-working-with-container-network-interface
Actual results:
The error message `#### at least one Prometheus sidecar isn't ready` will be displayed, and the metrics collection will fail
Expected results:
Metrics collection succeeds with no errors
Additional info:
This is a clone of issue OCPBUGS-35262. The following is the description of the original issue:
—
Description of problem:
installing into Shared VPC stuck in waiting for network infrastructure ready
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-06-10-225505
How reproducible:
Always
Steps to Reproduce:
1. "create install-config" and then insert Shared VPC settings (see [1]) 2. activate the service account which has the minimum permissions in the host project (see [2]) 3. "create cluster" FYI The GCP project "openshift-qe" is the service project, and the GCP project "openshift-qe-shared-vpc" is the host project.
Actual results:
1. Getting stuck in waiting for network infrastructure to become ready, until Ctrl+C is pressed. 2. 2 firewall-rules are created in the service project unexpectedly (see [3]).
Expected results:
The installation should succeed, and there should be no any firewall-rule getting created either in the service project or in the host project.
Additional info:
Description of problem:
ROSA Cluster creation goes into error status sometimes with version 4.16.0-0.nightly-2024-06-14-072943
Version-Release number of selected component (if applicable):
How reproducible:
60%
Steps to Reproduce:
1. Prepare VPC 2. Create a rosa sts cluster cluster with subnets 3. Wait for cluster ready
Actual results:
Cluster goes into error status
Expected results:
Cluster get ready
Additional info:
The failure happens by CI job triggering Here are the Jobs:
This fix contains the following changes coming from updated version of kubernetes up to v1.30.6: Changelog: v1.30.6: https://github.com/kubernetes/kubernetes/blob/release-1.30/CHANGELOG/CHANGELOG-1.30.md#changelog-since-v1305
Please review the following PR: https://github.com/openshift/router/pull/604
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When create hostedcluster with -role-arn, --sts-credsfails failed
Version-Release number of selected component (if applicable):
4.16 4.17
How reproducible:
100%
Steps to Reproduce:
1. hypershift-no-cgo create iam cli-role 2. aws sts get-session-token --output json 3. hcp create cluster aws --role-arn xxx --sts-creds xxx
Actual results:
2024-06-06T04:34:39Z ERROR Failed to create cluster {"error": "failed to create iam: AccessDenied: User: arn:aws:sts::301721915996:assumed-role/6cd90f28a6449141869b/cli-create-iam is not authorized to perform: iam:TagOpenIDConnectProvider on resource: arn:aws:iam::301721915996:oidc-provider/hypershift-ci-oidc.s3.us-east-1.amazonaws.com/6cd90f28a6449141869b because no identity-based policy allows the iam:TagOpenIDConnectProvider action\n\tstatus code: 403, request id: 20e16ec4-b9a1-4fa4-aa34-1344145d41fd"} github.com/openshift/hypershift/product-cli/cmd/cluster/aws.NewCreateCommand.func1 /remote-source/app/product-cli/cmd/cluster/aws/create.go:60 github.com/spf13/cobra.(*Command).execute /remote-source/app/vendor/github.com/spf13/cobra/command.go:983 github.com/spf13/cobra.(*Command).ExecuteC /remote-source/app/vendor/github.com/spf13/cobra/command.go:1115 github.com/spf13/cobra.(*Command).Execute /remote-source/app/vendor/github.com/spf13/cobra/command.go:1039 github.com/spf13/cobra.(*Command).ExecuteContext /remote-source/app/vendor/github.com/spf13/cobra/command.go:1032 main.main /remote-source/app/product-cli/main.go:60 runtime.main /usr/lib/golang/src/runtime/proc.go:271 Error: failed to create iam: AccessDenied: User: arn:aws:sts::301721915996:assumed-role/6cd90f28a6449141869b/cli-create-iam is not authorized to perform: iam:TagOpenIDConnectProvider on resource: arn:aws:iam::301721915996:oidc-provider/hypershift-ci-oidc.s3.us-east-1.amazonaws.com/6cd90f28a6449141869b because no identity-based policy allows the iam:TagOpenIDConnectProvider action status code: 403, request id: 20e16ec4-b9a1-4fa4-aa34-1344145d41fd failed to create iam: AccessDenied: User: arn:aws:sts::301721915996:assumed-role/6cd90f28a6449141869b/cli-create-iam is not authorized to perform: iam:TagOpenIDConnectProvider on resource: arn:aws:iam::301721915996:oidc-provider/hypershift-ci-oidc.s3.us-east-1.amazonaws.com/6cd90f28a6449141869b because no identity-based policy allows the iam:TagOpenIDConnectProvider action status code: 403, request id: 20e16ec4-b9a1-4fa4-aa34-1344145d41fd {"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:84","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun","level":"error","msg":"Error executing test process","severity":"error","time":"2024-06-06T04:34:39Z"} error: failed to execute wrapped command: exit status 1
Expected results:
create hostedcluster successful
Additional info:
Full Logs: https://docs.google.com/document/d/1AnvAHXPfPYtP6KRcAKOebAx1wXjhWMOn3TW604XK09o/edit
The same command can be successfully created the second time
Description of problem:
Customer reports that in the OpenShift Container Platform for a single namespace they are seeing a "TypeError: Cannot read properties of null (reading 'metadata')" error when navigating to the Topology view (Developer Console):
TypeError: Cannot read properties of null (reading 'metadata') at s (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1220454) at s (https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:424007) at t.a (https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:330465) at na (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:263:58879) at Hs (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:263:111315) at xl (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:263:98327) at Cl (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:263:98255) at _l (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:263:98118) at pl (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:263:95105) at https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:263:44774
Screenshot is available in the linked Support Case. The following Stack Trace is shown:
at t.a (https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:330387) at g at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at g at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at g at a (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:245070) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at g at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at g at t.a (https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:426770) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at g at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at a (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:242507) at svg at div at https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:603940 at u (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:602181) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at e.a (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:398426) at div at https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:353461 at https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:354168 at s (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1405970) at S (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:98:86864) at i (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:452052) at withFallback(Connect(withUserSettingsCompatibility(undefined))) at div at div at c (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:62178) at div at div at c (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:545565) at d (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:775077) at div at d (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:458280) at div at div at c (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:719437) at div at c (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:9899) at div at https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:512628 at S (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:98:86864) at t (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:123:75018) at https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:511867 at https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:150:220157 at https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:375316 at div at R (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:183146) at N (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:183594) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:509351 at https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:548866 at S (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:98:86864) at div at div at t.b (https://console.apps.example.com/static/dev-console/code-refs/common-chunk-5e4f38c02bde64a97ae5.min.js:1:113711) at t.a (https://console.apps.example.com/static/dev-console/code-refs/common-chunk-5e4f38c02bde64a97ae5.min.js:1:116541) at u (https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:305613) at https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:509656 at i (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:452052) at withFallback() at t.a (https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:553554) at t (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:21:67625) at I (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1533554) at t (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:21:69670) at Suspense at i (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:452052) at section at m (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:720427) at div at div at t.a (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1533801) at div at div at c (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:545565) at d (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:775077) at div at d (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:458280) at l (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1175827) at https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:458912 at S (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:98:86864) at main at div at v (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:264220) at div at div at c (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:62178) at div at div at c (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:545565) at d (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:775077) at div at d (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:458280) at Un (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:36:183620) at t.default (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:880042) at e.default (https://console.apps.example.com/static/quick-start-chunk-794085a235e14913bdf3.min.js:1:3540) at s (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:239711) at t.a (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1610459) at ee (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1628636) at _t (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:36:142374) at ee (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1628636) at ee (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1628636) at ee (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1628636) at i (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:830807) at t.a (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1604651) at t.a (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1604840) at t.a (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1602256) at te (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1628767) at https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1631899 at r (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:36:121910) at t (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:21:67625) at t (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:21:69670) at t (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:21:64230) at re (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1632210) at t.a (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:804787) at t.a (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1079398) at s (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:654118) at t.a (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:150:195887) at Suspense
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.13.38 Developer Console
How reproducible:
Only on customer side, in a single namespace on a single cluster
Steps to Reproduce:
1. On a particular cluster, enter the Developer Console 2. Navigate to "Topology"
Actual results:
Loading the page fails with the error "TypeError: Cannot read properties of null (reading 'metadata')"
Expected results:
No error is shown. The Topology view is shown
Additional info:
- Screenshot available in linked Support Case - HAR file is available in linked Support Case
This is a clone of issue OCPBUGS-37588. The following is the description of the original issue:
—
Description of problem:
Creating and destroying transit gateways (TG) during CI testing is costing an abnormal amount of money. Since the monetary cost for creating a TG is high, provide support for a user created TG when creating an OpenShift cluster.
Version-Release number of selected component (if applicable):
all
How reproducible:
always
Description of problem:
After installing the Pipelines Operator on a local cluster (OpenShift local), the Pipelines features was shown the Console.
But when selecting the Build option "Pipelines" a warning was shown:
The pipeline template for Dockerfiles is not available at this time.
Anyway it was possible to push the Create button and create a Deployment. But because there is no build process created, it couldn't successful start.
After ~20 min after the Pipeline operator says that it was successfully installed, the Pipeline templates in the openshift-pipelines namespaces appear, and I could create valid Deployment.
Version-Release number of selected component (if applicable):
How reproducible:
Sometimes, maybe depending on the internet connection speed.
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
TRT has detected a consistent long term trend where the oauth-apiserver appears to have more disruption than it did in 4.16, for minor upgrades on azure.
The problem appears roughly over the 90th percentile, we picked it up at P95 where it shows a consistent 5-8s more than we'd expect given the data in 4.16 ga.
The problem hits ONLY oauth, affecting both new and reused connections, as well as the cached variants meaning etcd should be out of the picture. You'll see a few very short blips where all four of these backends lose connectivity for ~1s throughout the run, several times over. It looks like it may be correlated to the oauth-operator reporting:
source/OperatorAvailable display/true condition/Available reason/APIServices_Error status/False APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" [2s]
Sample jobs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669509775822848
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669509775822848/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122854.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827077182283845632
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827077182283845632/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240823-212127.json&overrideDisplayFlag=1&selectedSources=OperatorDegraded&selectedSources=EtcdLog&selectedSources=Disruption&selectedSources=E2EFailed
More can be found using the first link to the dashboard in this post and scrolling down to most recent job runs, and looking for high numbers.
The operator degraded is probably the strongest symptom to persue as it appears in most of the above.
If you find any runs where other backends are disrupted, especially kube-api, I would suggest ignoring those as they are unlikely to be the same fingerprint as the error being described here.
The multus-admission-controller does not retain its container resource requests/limits if manually set. The cluster-network-operator overwrites any modifications on the next reconciliation. This resource preservation support has already been added to all other components in https://github.com/openshift/hypershift/pull/1082 and https://github.com/openshift/hypershift/pull/3120. Similar changes should be made for the multus-admission-controller so all hosted control plane components demonstrate the same resource preservation behavior.
Please review the following PR: https://github.com/openshift/ironic-static-ip-manager/pull/43
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Inspection is failing on hosts which special characters found in serial number of block devices: Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: 2024-07-03 09:16:11.325 1 DEBUG ironic_python_agent.inspector [-] collected data: {'inventory'....'error': "The following errors were encountered:\n* collector logs failed: 'utf-8' codec can't decode byte 0xff in position 12: invalid start byte"} call_inspector /usr/lib/python3.9/site-packages/ironic_python_agent/inspector.py:128 Serial found: "serial": "2HC015KJ0000\udcff\udcff\udcff\udcff\udcff\udcff\udcff\udcff" Interesting stacktrace error: Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1260-1267: surrogates not allowed Full stack trace: ~~~ Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: 2024-07-03 09:16:11.628 1 DEBUG oslo_concurrency.processutils [-] CMD "lsblk -bia --json -oKNAME,MODEL,SIZE,ROTA,TYPE,UUID,PARTUUID,SERIAL" returned: 0 in 0.006s e xecute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:422 Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: --- Logging error --- Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: --- Logging error --- Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: Traceback (most recent call last): Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: File "/usr/lib64/python3.9/logging/__init__.py", line 1086, in emit Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Traceback (most recent call last): Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib64/python3.9/logging/__init__.py", line 1086, in emit Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: stream.write(msg + self.terminator) Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1260-1267: surrogates not allowed Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Call stack: Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: stream.write(msg + self.terminator) Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1260-1267: surrogates not allowed Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/bin/ironic-python-agent", line 10, in <module> Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: sys.exit(run()) Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/cmd/agent.py", line 50, in run Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: agent.IronicPythonAgent(CONF.api_url, Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: Call stack: Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/agent.py", line 485, in run Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: self.process_lookup_data(content) Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/agent.py", line 400, in process_lookup_data Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: hardware.cache_node(self.node) Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 3179, in cache_node Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: dispatch_to_managers('wait_for_disks') Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 3124, in dispatch_to_managers Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: return getattr(manager, method)(*args, **kwargs) Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 997, in wait_for_disks Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: self.get_os_install_device() Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1518, in get_os_install_device Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: block_devices = self.list_block_devices_check_skip_list( Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1495, in list_block_devices_check_skip_list Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: block_devices = self.list_block_devices( Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1460, in list_block_devices Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: block_devices = list_all_block_devices() Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 526, in list_all_block_devices Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: report = il_utils.execute('lsblk', '-bia', '--json', Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_lib/utils.py", line 111, in execute Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: _log(result[0], result[1]) Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_lib/utils.py", line 99, in _log Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: LOG.debug('Command stdout is: "%s"', stdout) Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Message: 'Command stdout is: "%s"' Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Arguments: ('{\n "blockdevices": [\n {\n "kname": "loop0",\n "model": null,\n "size": 67467313152,\n "rota": false,\n "type": "loop",\n "uuid": "28f5ff52-7f5b-4e5a-bcf2-59813e5aef5a",\n "partuuid": null,\n "serial": null\n },{\n "kname": "loop1",\n "model": null,\n "size": 1027846144,\n "rota": false,\n "type": "loop",\n "uuid": null,\n "partuuid": null,\n "serial": null\n },{\n "kname": "sda",\n "model": "LITEON IT ECE-12",\n "size": 120034123776,\n "rota": false,\n "type": "disk",\n "uuid": null,\n "partuuid": null,\n "serial": "XXXXXXXXXXXXXXXXXX"\n },{\n "kname": "sdb",\n "model": "LITEON IT ECE-12",\n "size": 120034123776,\n "rota": false,\n "type": "disk",\n "uuid": null,\n "partuuid": null,\n "serial": "XXXXXXXXXXXXXXXXXXXX"\n },{\n "kname": "sdc",\n "model": "External",\n "size": 0,\n "rota": true,\n "type": "disk",\n "uuid": null,\n "partuuid": null,\n "serial": "2HC015KJ0000\udcff\udcff\udcff\udcff\udcff\udcff\udcff\udcff"\n }\n ]\n}\n',) ~~~
Version-Release number of selected component (if applicable):
OCP 4.14.28
How reproducible:
Always
Steps to Reproduce:
1. Add a BMH with a bad utf-8 characters in serial 2. 3.
Actual results:
Inspection fail
Expected results:
Inspection works
Additional info:
This is a clone of issue OCPBUGS-37780. The following is the description of the original issue:
—
As of now, it is possible to set different architectures for the compute machine pools when both the 'worker' and 'edge' machine pools are defined in the install-config.
Example:
compute: - name: worker architecture: arm64 ... - name: edge architecture: amd64 platform: aws: zones: ${edge_zones_str}
See https://github.com/openshift/installer/blob/master/pkg/types/validation/installconfig.go#L631
Description of problem:
The pod of catalogsource without registryPoll wasn't recreated during the node failure
jiazha-mac:~ jiazha$ oc get pods NAME READY STATUS RESTARTS AGE certified-operators-rcs64 1/1 Running 0 123m community-operators-8mxh6 1/1 Running 0 123m marketplace-operator-769fbb9898-czsfn 1/1 Running 4 (117m ago) 136m qe-app-registry-5jxlx 1/1 Running 0 106m redhat-marketplace-4bgv9 1/1 Running 0 123m redhat-operators-ww5tb 1/1 Running 0 123m test-2xvt8 1/1 Terminating 0 12m jiazha-mac:~ jiazha$ oc get pods test-2xvt8 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-2xvt8 1/1 Running 0 7m6s 10.129.2.26 qe-daily-417-0708-cv2p6-worker-westus-gcrrc <none> <none> jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc NAME STATUS ROLES AGE VERSION qe-daily-417-0708-cv2p6-worker-westus-gcrrc NotReady worker 116m v1.30.2+421e90e
Version-Release number of selected component (if applicable):
Cluster version is 4.17.0-0.nightly-2024-07-07-131215
How reproducible:
always
Steps to Reproduce:
1. create a catalogsource without the registryPoll configure. jiazha-mac:~ jiazha$ cat cs-32183.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: test namespace: openshift-marketplace spec: displayName: Test Operators image: registry.redhat.io/redhat/redhat-operator-index:v4.16 publisher: OpenShift QE sourceType: grpc jiazha-mac:~ jiazha$ oc create -f cs-32183.yaml catalogsource.operators.coreos.com/test created jiazha-mac:~ jiazha$ oc get pods test-2xvt8 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-2xvt8 1/1 Running 0 3m18s 10.129.2.26 qe-daily-417-0708-cv2p6-worker-westus-gcrrc <none> <none> 2. Stop the node jiazha-mac:~ jiazha$ oc debug node/qe-daily-417-0708-cv2p6-worker-westus-gcrrc Temporary namespace openshift-debug-q4d5k is created for debugging node... Starting pod/qe-daily-417-0708-cv2p6-worker-westus-gcrrc-debug-v665f ... To use host binaries, run `chroot /host` Pod IP: 10.0.128.5 If you don't see a command prompt, try pressing enter. sh-5.1# chroot /host sh-5.1# systemctl stop kubelet; sleep 600; systemctl start kubelet Removing debug pod ... Temporary namespace openshift-debug-q4d5k was removed. jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc NAME STATUS ROLES AGE VERSION qe-daily-417-0708-cv2p6-worker-westus-gcrrc NotReady worker 115m v1.30.2+421e90e 3. check it this catalogsource's pod recreated.
Actual results:
No new pod was generated.
jiazha-mac:~ jiazha$ oc get pods NAME READY STATUS RESTARTS AGE certified-operators-rcs64 1/1 Running 0 123m community-operators-8mxh6 1/1 Running 0 123m marketplace-operator-769fbb9898-czsfn 1/1 Running 4 (117m ago) 136m qe-app-registry-5jxlx 1/1 Running 0 106m redhat-marketplace-4bgv9 1/1 Running 0 123m redhat-operators-ww5tb 1/1 Running 0 123m test-2xvt8 1/1 Terminating 0 12m
once node recovery, a new pod was generated.
jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc
NAME STATUS ROLES AGE VERSION
qe-daily-417-0708-cv2p6-worker-westus-gcrrc Ready worker 127m v1.30.2+421e90e
jiazha-mac:~ jiazha$ oc get pods
NAME READY STATUS RESTARTS AGE
certified-operators-rcs64 1/1 Running 0 127m
community-operators-8mxh6 1/1 Running 0 127m
marketplace-operator-769fbb9898-czsfn 1/1 Running 4 (121m ago) 140m
qe-app-registry-5jxlx 1/1 Running 0 109m
redhat-marketplace-4bgv9 1/1 Running 0 127m
redhat-operators-ww5tb 1/1 Running 0 127m
test-wqxvg 1/1 Running 0 27s
Expected results:
During the node failure, a new catalog source pod should be generated.
Additional info:
Hi Team,
After some more investigating the source code of operator-lifecycle-manager, we figure out the reason.
apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: redhat-operator-index namespace: openshift-marketplace spec: image: quay-server.bastion.tokyo.com:5000/redhat/redhat-operator-index-logging:logging-vstable-5.8-v5.8.5 sourceType: grpc
And we verified that the catalog pod can be recreated on other node if we add the configuration of registryPoll to catalogsource as the following (The lines with <==).
apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: redhat-operator-index namespace: openshift-marketplace spec: image: quay-server.bastion.tokyo.com:5000/redhat/redhat-operator-index-logging:logging-vstable-5.8-v5.8.5 sourceType: grpc updateStrategy: <== registryPoll: <== interval: 10m <==
The registryPoll is NOT MUST for catalogsource.
So the commit [1] trying to fix the issue in EnsureRegistryServer() is not properly.
[1] https://github.com/operator-framework/operator-lifecycle-manager/pull/3201/files
[2] https://github.com/joelanford/operator-lifecycle-manager/blob/82f499723e52e85f28653af0610b6e7feff096cf/pkg/controller/registry/reconciler/grpc.go#L290
[3] https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/operators/catalog/operator.go#L1009
[4] https://docs.openshift.com/container-platform/4.16/operators/admin/olm-managing-custom-catalogs.html
Description of problem:
Once you registers a IDMS/ICSP which only contains the root url for the source registry, the registry-overrides is not properly filled and they are ignored. Sample: apiVersion: config.openshift.io/v1 kind: ImageDigestMirrorSet metadata: name: image-policy spec: imageDigestMirrors: - mirrors: - registry.vshiray.net/redhat.io source: registry.redhat.io - mirrors: - registry.vshiray.net/connect.redhat.com source: registry.connect.redhat.com - mirrors: - registry.vshiray.net/gcr.io source: gcr.io - mirrors: - registry.vshiray.net/docker.io source: docker.io
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Deploy a IDMS with root registries 2. Try to deploy a Disconnected HostedCluster using the internal registry
Actual results:
The registry-overrides is empty so the disconnected deployment is stuck
Expected results:
The registry-overrides flag is properly filled and the deployment could continue
Additional info:
To workaround this, you just must create a new IDMS which points to the right OCP release version: apiVersion: config.openshift.io/v1 kind: ImageDigestMirrorSet metadata: name: ocp-release spec: imageDigestMirrors: - mirrors: - registry.vshiray.net/quay.io/openshift-release-dev/ocp-v4.0-art-dev source: quay.io/openshift-release-dev/ocp-v4.0-art-dev - mirrors: - registry.vshiray.net/quay.io/openshift-release-dev/ocp-release source: quay.io/openshift-release-dev/ocp-release
Description of problem:
azure-disk-csi-driver doesnt use registryOverrides
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%
Steps to Reproduce:
1.set registry override on CPO 2.watch that azure-disk-csi-driver continues to use default registry 3.
Actual results:
azure-disk-csi-driver uses default registry
Expected results:
azure-disk-csi-driver mirrored registry
Additional info:
This is a clone of issue OCPBUGS-38936. The following is the description of the original issue:
—
Description of problem:
NodePool Controller doesn't respect LatestSupportedVersion https://github.com/openshift/hypershift/blob/main/support/supportedversion/version.go#L19
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create HostedCluster / NodePool 2. Upgrade both HostedCluster and NodePool at the same time to a version higher than the LatestSupportedVersion
Actual results:
NodePool tries to upgrade to the new version while the HostedCluster ValidReleaseImage condition fails with: 'the latest version supported is: "x.y.z". Attempting to use: "x.y.z"'
Expected results:
NodePool ValidReleaseImage condition also fails
Additional info:
Description of problem:
release-4.17 of openshift/cluster-api-provider-openstack is missing some commits that were backported in upstream project into the release-0.10 branch. We should import them in our downstream fork.
Please review the following PR: https://github.com/openshift/images/pull/181
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
All of OCP release process relies on version we expose as a label (io.openshift.build.versions) in hyperkube image, see https://github.com/openshift/kubernetes/blob/master/openshift-hack/images/hyperkube/Dockerfile.rhel.
Unfortunately, our CI is not picking that version, but rather is trying to guess a version based on the available tags, we should ensure that all the build processes read that label, rather than requiring a manual tag push when doing k8s bump.
Description of problem:
It is currently possible to watch a singular namespaced resource without providing a namespace. This is inconsistent with one-off requests for these resources and could also return unexpected results, since namespaced resource names do not need to be unique at the cluster scope.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. Visit the details page of a namespaced resource 2. Replace the 'ns/<namespace>' segment of the URL with 'cluster'
Actual results:
Details for the resource are rendered momentarily, then a 404 after a few seconds.
Expected results:
We should show a 404 error when the page loads.
Additional info:
There is also probably a case where we could visit a resource details page of a namespaced resource that has an identically named resource in another namespace, then change the URL to a cluster-scoped path, and we'll see the details for the other resource.
See watchK8sObject for the root cause. We should probably only start the websocket if we have a successful initial poll request. We also should probably terminate the websocket if the poll request fails at any point.
Description of problem:
When mirroring content with oc-mirror v2, some required images for OpenShift installation are missing from the registry
Version-Release number of selected component (if applicable):
OpenShift installer version: v4.15.17 [admin@registry ~]$ oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202406131906.p0.g7c0889f.assembly.stream.el9-7c0889f", GitCommit:"7c0889f4bd343ccaaba5f33b7b861db29b1e5e49", GitTreeState:"clean", BuildDate:"2024-06-13T22:07:44Z", GoVersion:"go1.21.9 (Red Hat 1.21.9-1.el9_4) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Use oc-mirror v2 to mirror content. $ cat imageset-config-ocmirrorv2-v4.15.yaml kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: platform: channels: - name: stable-4.15 minVersion: 4.15.17 type: ocp operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 full: false packages: - name: ansible-automation-platform-operator - name: cluster-logging - name: datagrid - name: devworkspace-operator - name: multicluster-engine - name: multicluster-global-hub-operator-rh - name: odf-operator - name: quay-operator - name: rhbk-operator - name: skupper-operator - name: servicemeshoperator - name: submariner - name: lvms-operator - name: odf-lvm-operator - catalog: registry.redhat.io/redhat/certified-operator-index:v4.15 full: false packages: - name: crunchy-postgres-operator - name: nginx-ingress-operator - catalog: registry.redhat.io/redhat/community-operator-index:v4.15 full: false packages: - name: argocd-operator - name: cockroachdb - name: infinispan - name: keycloak-operator - name: mariadb-operator - name: nfs-provisioner-operator - name: postgresql - name: skupper-operator additionalImages: - name: registry.redhat.io/ubi8/ubi:latest - name: registry.access.redhat.com/ubi8/nodejs-18 - name: registry.redhat.io/openshift4/ose-prometheus:v4.14.0 - name: registry.redhat.io/service-interconnect/skupper-router-rhel9:2.4.3 - name: registry.redhat.io/service-interconnect/skupper-config-sync-rhel9:1.4.4 - name: registry.redhat.io/service-interconnect/skupper-service-controller-rhel9:1.4.4 - name: registry.redhat.io/service-interconnect/skupper-flow-collector-rhel9:1.4.4 helm: {} Run oc-mirror using the command: oc-mirror --v2 \ -c imageset-config-ocmirrorv2-v4.15.yaml \ --workspace file:////data/oc-mirror/workdir/ \ docker://registry.local.momolab.io:8443/mirror
Steps to Reproduce:
1. Install Red Hat Quay mirror registry 2. Mirror using oc-mirror v2 command and steps above 3. Install OpenShift
Actual results:
Installation fails
Expected results:
Installation succeeds
Additional info:
## Check logs on coreos: [core@sno1 ~]$ journalctl -b -f -u release-image.service -u bootkube.service Jul 02 03:46:22 sno1.local.momolab.io bootkube.sh[13486]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06: (Mirrors also failed: [registry.local.momolab.io:8443/mirror/openshift/release@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06: reading manifest sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06 in registry.local.momolab.io:8443/mirror/openshift/release: name unknown: repository not found]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06: reading manifest sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized ## Check if that image was pulled: [admin@registry ~]$ cat /data/oc-mirror/workdir/working-dir/dry-run/mapping.txt | grep -i f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06 docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06=docker://registry.local.momolab.io:8443/mirror/openshift-release-dev/ocp-v4.0-art-dev@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06 ## Problem is, it doesn't exist on the registry (also via UI): [admin@registry ~]$ podman pull registry.local.momolab.io:8443/mirror/openshift-release-dev/ocp-v4.0-art-dev@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06 Trying to pull registry.local.momolab.io:8443/mirror/openshift-release-dev/ocp-v4.0-art-dev@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06... Error: initializing source docker://registry.local.momolab.io:8443/mirror/openshift-release-dev/ocp-v4.0-art-dev@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06: reading manifest sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06 in registry.local.momolab.io:8443/mirror/openshift-release-dev/ocp-v4.0-art-dev: manifest unknown
Description of problem:
aws capi installs, particularly when running under heavy load in ci, can sometimes fail with:
level=info msg=Creating private Hosted Zone level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed provisioning resources after infrastructure ready: failed to create private hosted zone: error creating private hosted zone: HostedZoneAlreadyExists: A hosted zone has already been created with the specified caller reference. level=error msg= status code: 409, request id: f173760d-ab43-41b8-a8a0-568cf387bf5e
Version-Release number of selected component (if applicable):
How reproducible:
not reproducible - needs to be discovered in ci
Steps to Reproduce:
1. 2. 3.
Actual results:
install fails due to existing hosted zone
Expected results:
HostedZoneAlreadyExists error should not cause install to fail
Additional info:
Description of problem:
This is a followup of https://issues.redhat.com/browse/OCPBUGS-34996, in which comments led us to better understand the issue customers are facing. LDAP IDP traffic from the oauth pod seems to be going through the configured HTTP(S) proxy, while it should not due to it being a different protocol. This results in customers adding the ldap endpoint to their no-proxy config to circumvent the issue.
Version-Release number of selected component (if applicable):
4.15.11
How reproducible:
Steps to Reproduce:
(From the customer) 1. Configure LDAP IDP 2. Configure Proxy 3. LDAP IDP communication from the control plane oauth pod goes through proxy instead of going to the ldap endpoint directly
Actual results:
LDAP IDP communication from the control plane oauth pod goes through proxy
Expected results:
LDAP IDP communication from the control plane oauth pod should go to the ldap endpoint directly using the ldap protocol, it should not go through the proxy settings
Additional info:
For more information, see linked tickets.
Description of problem:
Hypershift Operator pods are running with higher PriorityClass but external-dns is set to default class with lower preemption priority, this has made the pod to preempt during migration. Observed while performance testing dynamic serving spec migration on MC. # oc get pods -n hypershift NAME READY STATUS RESTARTS AGE external-dns-7f95b5cdc-9hnjs 0/1 Pending 0 23m operator-956bdb486-djjvb 1/1 Running 0 116m operator-956bdb486-ppgzt 1/1 Running 0 115m external-dns pod.spec preemptionPolicy: PreemptLowerPriority priority: 0 priorityClassName: default operator pods.spec preemptionPolicy: PreemptLowerPriority priority: 100003000 priorityClassName: hypershift-operator
Version-Release number of selected component (if applicable):
On Management Cluster 4.14.7
How reproducible:
Always
Steps to Reproduce:
1. Setup a MC with request serving and autoscaling machinesets 2. Load up the MC to its max capacity 3. Watch external-dns pod gets preempted when resources needed by other pods
Actual results:
External-dns pod goes to pending state until new node comes up
Expected results:
Since this is also a critical pod like hypershift operator, as it would affect HC dns configuration, this one needs to be a higher priority pod as well.
Additional info:
stage: perf3 sector
Description of the problem:
Using ACM, when adding a node to a spoke, it's showing as stuck in installing but appear to have installed successfully. Confirmed workloads can be scheduled on the new nodes.
How reproducible:
Unsure atm.
Steps to reproduce:
1. Install a spoke cluster
2. Add a new node day 2, with the operation timing out. (In this case there was a x509 cert issue).
Actual results:
Node eventually gets added to cluster and accepts workloads, but the GUI does not reflect this.
Expected results:
If node actually succeeds in joining cluster, update the GUI to say so.
Description of problem:
nmstate-configuration.service failed due to wrong variable name $hostname_file https://github.com/openshift/machine-config-operator/blob/5a6e8b81f13de2dbf606a497140ac6e9c2a00e6f/templates/common/baremetal/files/nmstate-configuration.yaml#L26
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
always
Steps to Reproduce:
1. install cluster via dev-script, with node-specific network configuration
Actual results:
nmstate-configuration failed: sh-5.1# journalctl -u nmstate-configuration May 07 02:19:54 worker-0 systemd[1]: Starting Applies per-node NMState network configuration... May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + systemctl -q is-enabled mtu-migration May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + echo 'Cleaning up left over mtu migration configuration' May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: Cleaning up left over mtu migration configuration May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + rm -rf /etc/cno/mtu-migration May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + '[' -e /etc/nmstate/openshift/applied ']' May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + src_path=/etc/nmstate/openshift May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + dst_path=/etc/nmstate May 07 02:19:54 worker-0 systemd[1]: nmstate-configuration.service: Main process exited, code=exited, status=1/FAILURE May 07 02:19:54 worker-0 nmstate-configuration.sh[1565]: ++ hostname -s May 07 02:19:54 worker-0 systemd[1]: nmstate-configuration.service: Failed with result 'exit-code'. May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + hostname=worker-0 May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + host_file=worker-0.yml May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + cluster_file=cluster.yml May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + config_file= May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + '[' -s /etc/nmstate/openshift/worker-0.yml ']' May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: /usr/local/bin/nmstate-configuration.sh: line 22: hostname_file: unbound variable May 07 02:19:54 worker-0 systemd[1]: Failed to start Applies per-node NMState network configuration.
Expected results:
cluster can be setup successfully with node-specific network configuration via new mechanism
Additional info:
Starting with payload 4.17.0-0.nightly-2024-06-27-123139 we are seeing hypershift-release-4.17-periodics-e2e-aws-ovn-conformance failures due to
: [sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] { "metric": { "__name__": "ALERTS", "alertname": "PrometheusKubernetesListWatchFailures", "alertstate": "firing", "container": "kube-rbac-proxy", "endpoint": "metrics", "instance": "10.132.0.19:9092", "job": "prometheus-k8s", "namespace": "openshift-monitoring", "pod": "prometheus-k8s-0", "prometheus": "openshift-monitoring/k8s", "service": "prometheus-k8s", "severity": "warning" },
It looks like this was introduced with cluster-monitoring-operator/pull/2392
This is a clone of issue OCPBUGS-38037. The following is the description of the original issue:
—
Description of problem:
When running oc-mirror in mirror to disk mode in an air gapped environment with `graph: true`, and having UPDATE_URL_OVERRIDE environment variable defined, oc-mirror is still reaching out to api.openshift.com, to get the graph.tar.gz. This causes the mirroring to fail, as this URL is not reacheable from an air-gapped environment
Version-Release number of selected component (if applicable):
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202407260908.p0.gdfed9f1.assembly.stream.el9-dfed9f1", GitCommit:"dfed9f10cd9aabfe3fe8dae0e6a8afe237c901ba", GitTreeState:"clean", BuildDate:"2024-07-26T09:52:14Z", GoVersion:"go1.21.11 (Red Hat 1.21.11-1.el9_4) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. Setup OSUS in a reacheable network 2. Cut all internet connection except for the mirror registry and OSUS service 3. Run oc-mirror in mirror to disk mode with graph:true in the imagesetconfig
Actual results:
Expected results:
Should not fail
Additional info:
Description of problem:
ci/prow/security is failing: k8s.io/client-go/transport
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. trigger ci/prow/security on a pull request 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/209
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-38174. The following is the description of the original issue:
—
Description of problem:
The prometheus operator fails to reconcile when proxy settings like no_proxy are set in the Alertmanager configuration secret.
Version-Release number of selected component (if applicable):
4.15.z and later
How reproducible:
Always when AlertmanagerConfig is enabled
Steps to Reproduce:
1. Enable UWM with AlertmanagerConfig enableUserWorkload: true alertmanagerMain: enableUserAlertmanagerConfig: true 2. Edit the "alertmanager.yaml" key in the alertmanager-main secret (see attached configuration file) 3. Wait for a couple of minutes.
Actual results:
Monitoring ClusterOperator goes Degraded=True.
Expected results:
No error
Additional info:
The Prometheus operator logs show that it doesn't understand the proxy_from_environment field.
This is a clone of issue OCPBUGS-42237. The following is the description of the original issue:
—
Description of problem:
The samples operator sync for OCP 4.18 includes an update to the ruby imagestream. This removes EOLed versions of Ruby and upgrades the images to be ubi9 based
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Always
Steps to Reproduce:
1. Run build suite tests 2. 3.
Actual results:
Tests fail trying to pull image. Example: Error pulling image "image-registry.openshift-image-registry.svc:5000/openshift/ruby:3.0-ubi8": initializing source docker://image-registry.openshift-image-registry.svc:5000/openshift/ruby:3.0-ubi8: reading manifest 3.0-ubi8 in image-registry.openshift-image-registry.svc:5000/openshift/ruby: manifest unknown
Expected results:
Builds can pull image, and the tests succeed.
Additional info:
As part of the continued deprecation of the Samples Operator, these tests should create their own Ruby imagestream that is kept current.
Please review the following PR: https://github.com/openshift/kube-state-metrics/pull/112
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When investigating https://issues.redhat.com/browse/OCPBUGS-34819 we encountered an issue with the LB creation but also noticed that masters are using an S3 stub ignition even though they don't have to. Although that can be harmless, we are adding an extra, useless hop that we don't need.
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Change the AWSMachineTemplate ignition.storageType to UnencryptedUserData
Include English Translations text for Supported Languages in User Preference Dropdown
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Refactor name to Dockerfile.ocp as a better, version independent alternative
This is a clone of issue OCPBUGS-38326. The following is the description of the original issue:
—
Description of problem:
Failed to create NetworkAttachmentDefinition for namespace scoped CRD in layer3
Version-Release number of selected component (if applicable):
4.17
How reproducible:
always
Steps to Reproduce:
1. apply CRD yaml file 2. check the NetworkAttachmentDefinition status
Actual results:
status with error
Expected results:
NetworkAttachmentDefinition has been created
We removed this in 4.18, but we also should remove this in 4.17 since the saas template was not used then either.
Not removing this in 4.17 also causes issues backporting ARO HCP API changes; we need to backport changes related to that work to 4.17.
Example - https://github.com/openshift/hypershift/pull/4640#issuecomment-2320233415
This payload run detects a panic in CVO code. The following payloads did not see the same panic. Bug should be prioritized by CVO team accordingly.
Relevant Job run: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade/1782008003688402944
Panic trace as showed in this log:
I0421 13:06:29.113325 1 availableupdates.go:61] First attempt to retrieve available updates I0421 13:06:29.119731 1 cvo.go:721] Finished syncing available updates "openshift-cluster-version/version" (6.46969ms) I0421 13:06:29.120687 1 sync_worker.go:229] Notify the sync worker: Cluster operator etcd changed Degraded from "False" to "True" I0421 13:06:29.120697 1 sync_worker.go:579] Cluster operator etcd changed Degraded from "False" to "True" E0421 13:06:29.121014 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 185 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1bbc580?, 0x30cdc90}) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x1e3efe0?}) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b panic({0x1bbc580?, 0x30cdc90?}) /usr/lib/golang/src/runtime/panic.go:914 +0x21f github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWork).calculateNextFrom(0xc002944000, 0x0) /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:725 +0x58 github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start.func1() /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:584 +0x2f2 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000101800?, {0x2194c80, 0xc0026245d0}, 0x1, 0xc000118120) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x989680, 0x0, 0x0?, 0x0?) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f k8s.io/apimachinery/pkg/util/wait.Until(...) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start(0xc002398c80, {0x21b41b8, 0xc0004be230}, 0x10) /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:564 +0x135 github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run.func2() /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:431 +0x5d created by github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run in goroutine 118 /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:429 +0x49d E0421 13:06:29.121188 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 185 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1bbc580?, 0x30cdc90}) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000000002?}) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b panic({0x1bbc580?, 0x30cdc90?}) /usr/lib/golang/src/runtime/panic.go:914 +0x21f k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x1e3efe0?}) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xcd panic({0x1bbc580?, 0x30cdc90?}) /usr/lib/golang/src/runtime/panic.go:914 +0x21f github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWork).calculateNextFrom(0xc002944000, 0x0) /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:725 +0x58 github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start.func1() /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:584 +0x2f2 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000101800?, {0x2194c80, 0xc0026245d0}, 0x1, 0xc000118120) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x989680, 0x0, 0x0?, 0x0?) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f k8s.io/apimachinery/pkg/util/wait.Until(...) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start(0xc002398c80, {0x21b41b8, 0xc0004be230}, 0x10) /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:564 +0x135 github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run.func2() /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:431 +0x5d created by github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run in goroutine 118 /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:429 +0x49d I0421 13:06:29.120720 1 cvo.go:738] Started syncing upgradeable "openshift-cluster-version/version" I0421 13:06:29.123165 1 upgradeable.go:69] Upgradeability last checked 5.274200045s ago, will not re-check until 2024-04-21T13:08:23Z I0421 13:06:29.123195 1 cvo.go:740] Finished syncing upgradeable "openshift-cluster-version/version" (2.469943ms) panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x58 pc=0x195c018] goroutine 185 [running]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000000002?}) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xcd panic({0x1bbc580?, 0x30cdc90?}) /usr/lib/golang/src/runtime/panic.go:914 +0x21f k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x1e3efe0?}) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xcd panic({0x1bbc580?, 0x30cdc90?}) /usr/lib/golang/src/runtime/panic.go:914 +0x21f github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWork).calculateNextFrom(0xc002944000, 0x0) /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:725 +0x58 github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start.func1() /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:584 +0x2f2 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000101800?, {0x2194c80, 0xc0026245d0}, 0x1, 0xc000118120) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x989680, 0x0, 0x0?, 0x0?) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f k8s.io/apimachinery/pkg/util/wait.Until(...) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start(0xc002398c80, {0x21b41b8, 0xc0004be230}, 0x10) /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:564 +0x135 github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run.func2() /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:431 +0x5d created by github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run in goroutine 118 /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:429 +0x49d
CI is occasionally bumping into failures like:
: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] expand_less 53m22s { fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:186]: during upgrade to registry.build05.ci.openshift.org/ci-op-kj8vc4dt/release@sha256:74bc38fc3a1d5b5ac8e84566d54d827c8aa88019dbdbf3b02bef77715b93c210: the "master" pool should be updated before the CVO reports available at the new version Ginkgo exit error 1: exit with code 1}
where the machine-config operator is rolling the control-plane MachineConfigPool after the ClusterVersion update completes:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/machineconfigpools.json | jq -r '.items[] | select(.metadata.name == "master").status | [.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message] | sort[]' 2024-05-17T12:57:04Z RenderDegraded=False : 2024-05-17T12:58:35Z Degraded=False : 2024-05-17T12:58:35Z NodeDegraded=False : 2024-05-17T15:13:22Z Updated=True : All nodes are updated with MachineConfig rendered-master-4fcadad80c9941813b00ca7e3eef8e69 2024-05-17T15:13:22Z Updating=False : $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[0].completionTime' 2024-05-17T14:15:22Z
Because of changes to registry pull secrets:
$ dump() { > curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade /gather-extra/artifacts/machineconfigs.json | jq -r ".items[] | select(.metadata.name == \"$1\").spec.config.storage.files[] | select(.path == \"/etc/mco/internal-registry-pull-secret.json\").contents.source" | python3 -c 'import urllib.parse, sys; print(urllib.parse.unquote(sys.stdin.read()).split(",", 1)[-1])' | jq -c '.auths | to_entries[]' > } $ diff -u0 <(dump rendered-master-d6a8cd53ae132250832cc8267e070af6) <(dump rendered-master-4fcadad80c9941813b00ca7e3eef8e69) | sed 's/"value":.*/.../' --- /dev/fd/63 2024-05-17 12:28:37.882351026 -0700 +++ /dev/fd/62 2024-05-17 12:28:37.883351026 -0700 @@ -1 +1 @@ -{"key":"172.30.124.169:5000",... +{"key":"172.30.124.169:5000",... @@ -3,3 +3,3 @@ -{"key":"default-route-openshift-image-registry.apps.ci-op-kj8vc4dt-6c39f.ci2.azure.devcluster.openshift.com",... -{"key":"image-registry.openshift-image-registry.svc.cluster.local:5000",... -{"key":"image-registry.openshift-image-registry.svc:5000",... +{"key":"default-route-openshift-image-registry.apps.ci-op-kj8vc4dt-6c39f.ci2.azure.devcluster.openshift.com",... +{"key":"image-registry.openshift-image-registry.svc.cluster.local:5000",... +{"key":"image-registry.openshift-image-registry.svc:5000",...
Seen in 4.16-to-4.16 Azure update CI. Unclear what the wider scope is.
Sippy reports Success Rate: 94.27% post regression, so a rare race.
But using CI search to pick jobs with 10 or more runs over the past 2 days:
$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=48h&type=junit&search=master.*pool+should+be+updated+before+the+CVO+reports+available' | grep '[0-9][0-9] runs.*failures ma tch' | sort periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade (all) - 52 runs, 50% failed, 12% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade (all) - 80 runs, 20% failed, 25% of failures match = 5% impact periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade (all) - 82 runs, 21% failed, 59% of failures match = 12% impact periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade (all) - 80 runs, 53% failed, 14% of failures match = 8% impact periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-sdn-upgrade (all) - 50 runs, 12% failed, 50% of failures match = 6% impact pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-ovn-upgrade (all) - 14 runs, 21% failed, 33% of failures match = 7% impact pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn-upgrade (all) - 11 runs, 36% failed, 75% of failures match = 27% impact pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change (all) - 11 runs, 18% failed, 100% of failures match = 18% impact pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade (all) - 19 runs, 21% failed, 25% of failures match = 5% impact pull-ci-openshift-machine-config-operator-master-e2e-azure-ovn-upgrade-out-of-change (all) - 21 runs, 48% failed, 50% of failures match = 24% impact pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade (all) - 16 runs, 81% failed, 15% of failures match = 13% impact pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade (all) - 16 runs, 25% failed, 75% of failures match = 19% impact pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade (all) - 26 runs, 35% failed, 67% of failures match = 23% impact
shows some flavors like pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn-upgrade up at a 27% hit rates.
Unclear.
Pull secret changes after the ClusterVersion update cause an unexpected master MachineConfigPool roll.
No MachineConfigPool roll after the ClusterVersion update completes.
Description of problem:
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-13-084629
How reproducible:
100%
Steps to Reproduce:
1.apply configmap ***** apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | prometheusK8s: remoteWrite: - url: "http://invalid-remote-storage.example.com:9090/api/v1/write" queue_config: max_retries: 1 ***** 2. check logs % oc logs -c prometheus prometheus-k8s-0 -n openshift-monitoring ... ts=2024-06-14T01:28:01.804Z caller=dedupe.go:112 component=remote level=warn remote_name=5ca657 url=http://invalid-remote-storage.example.com:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://invalid-remote-storage.example.com:9090/api/v1/write\": dial tcp: lookup invalid-remote-storage.example.com on 172.30.0.10:53: no such host" 3.query after 15mins % oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=ALERTS{alertname="PrometheusRemoteStorageFailures"}' | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 145 100 78 100 67 928 797 --:--:-- --:--:-- --:--:-- 1726 { "status": "success", "data": { "resultType": "vector", "result": [], "analysis": {} } } % oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=prometheus_remote_storage_failures_total' | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 124 100 78 100 46 1040 613 --:--:-- --:--:-- --:--:-- 1653 { "status": "success", "data": { "resultType": "vector", "result": [], "analysis": {} } }
Actual results:
alert did not triggeted
Expected results:
alert triggered, able to see the alert and metrics
Additional info:
below metrics show as `No datapoints found.` prometheus_remote_storage_failures_total prometheus_remote_storage_samples_dropped_total prometheus_remote_storage_retries_total
`prometheus_remote_storage_samples_failed_total` value is 0
Description of the problem:
BE version ~2.32 (master) - block authenticated proxy unencoded url - 'http://ocp-edge:red@hat@10.6.48.65:3132' - with 2 @
Currently BE accepts such a url, although it is not supported.
How reproducible:
100%
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
This is a clone of issue OCPBUGS-38990. The following is the description of the original issue:
—
Description of problem:
node-joiner pod does not honour cluster wide testing
Version-Release number of selected component (if applicable):
OCP 4.16.6
How reproducible:
Always
Steps to Reproduce:
1. Configure an OpenShift cluster wide proxy according to https://docs.openshift.com/container-platform/4.16/networking/enable-cluster-wide-proxy.html and add Red Hat urls (quay.io and alii) to the proxy allow list. 2. Add a node to a cluster using a node joiner pod, following https://github.com/openshift/installer/blob/master/docs/user/agent/add-node/add-nodes.md
Actual results:
Error retrieving the images on quay.io time=2024-08-22T08:39:02Z level=error msg=Release Image arch could not be found: command '[oc adm release info quay.io/openshift-release-dev/ocp-release@sha256:24ea553ce2e79fab0ff9cf2917d26433cffb3da954583921926034b9d5d309bd -o=go-template={{if and .metadata.metadata (index . "metadata" "metadata" "release.openshift.io/architecture")}}{{index . "metadata" "metadata" "release.openshift.io/architecture"}}{{else}}{{.config.architecture}}{{end}} --insecure=true --registry-config=/tmp/registry-config1164077466]' exited with non-zero exit code 1:time=2024-08-22T08:39:02Z level=error msg=error: unable to read image quay.io/openshift-release-dev/ocp-release@sha256:24ea553ce2e79fab0ff9cf2917d26433cffb3da954583921926034b9d5d309bd: Get "http://quay.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Expected results:
node-joiner is able to downoad the images using the proxy
Additional info:
By allowing full direct internet access, without a proxy, the node joiner pod is able to download image from quay.io.
So there is a strong suspicion that the http timeout error above comes from the pod not being to use the proxy.
Restricted environementes when external internet access is only allowed through a proxy allow lists is quite common in corporate environements.
Please consider honouring the openshift proxy configuration .
Description of problem:
- One node [ rendezvous] is failed to add the cluster and there are some pending CSR's. - omc get csr NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-44qjs 21m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-9n9hc 5m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-9xw24 1h kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-brm6f 1h kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-dz75g 36m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-l8c7v 1h kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-mv7w5 52m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-v6pgd 1h kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending
In order to complete the installation, cu needs to approve the those CSR's manually.
Steps to Reproduce:
agent-based installation.
Actual results:
CSR's are in pending state.
Expected results:
CSR's should approved automatically
Additional info:
Logs : https://drive.google.com/drive/folders/1UCgC6oMx28k-_WXy8w1iN_t9h9rtmnfo?usp=sharing
Description of problem:
failed job: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1023/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-gatewayapi/1796261717831847936 seeing below error: level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: error unpacking terraform: could not unpack the directory for the aws provider: open mirror/openshift/local/aws: file does not exist
Version-Release number of selected component (if applicable):
4.16/4.17
How reproducible:
100%
Steps to Reproduce:
1. create AWS cluster with "CustomNoUpgrade" featureSet is configured install-config.yaml ---------------------- featureSet: CustomNoUpgrade featureGates: [GatewayAPIEnabled=true] 2.
Actual results:
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: error unpacking terraform: could not unpack the directory for the aws provider: open mirror/openshift/local/aws: file does not exist
Expected results:
install should be successful
Additional info:
workaround is to add ClusterAPIInstallAWS=true to feature_gates as well, .e.g featureSet: CustomNoUpgrade featureGates: [GatewayAPIEnabled=true,ClusterAPIInstallAWS=true]
discussion thread: https://redhat-internal.slack.com/archives/C68TNFWA2/p1716887301410459
Description of problem:
Examples in docs/user/gcp/customization can't directly be used to install a cluster.
Description of problem:
The installation of compact and HA clusters is failing in the vSphere environment. During the cluster setup, two master nodes were observed to be in a "Not Ready" state, and the rendezvous host failed to join the cluster.
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-09-25-131159
How reproducible:
100%
Actual results:
level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected level=info msg=Use the following commands to gather logs from the cluster level=info msg=openshift-install gather bootstrap --help level=error msg=Bootstrap failed to complete: : bootstrap process timed out: context deadline exceeded ERROR: Bootstrap failed. Aborting execution.
Expected results:
Installation should be successful.
Additional info:
Description of problem:
Build tests in OCP 4.14 reference Ruby images that are now EOL. The related code in our sample ruby build was deleted.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Run the build suite for OCP 4.14 against a 4.14 cluster
Actual results:
Test [sig-builds][Feature:Builds][Slow] builds with a context directory s2i context directory build should s2i build an application using a context directory [apigroup:build.openshift.io] fails 2024-05-08T11:11:57.558298778Z I0508 11:11:57.558273 1 builder.go:400] Powered by buildah v1.31.0 2024-05-08T11:11:57.581578795Z I0508 11:11:57.581509 1 builder.go:473] effective capabilities: [audit_control=true audit_read=true audit_write=true block_suspend=true bpf=true checkpoint_restore=true chown=true dac_override=true dac_read_search=true fowner=true fsetid=true ipc_lock=true ipc_owner=true kill=true lease=true linux_immutable=true mac_admin=true mac_override=true mknod=true net_admin=true net_bind_service=true net_broadcast=true net_raw=true perfmon=true setfcap=true setgid=true setpcap=true setuid=true sys_admin=true sys_boot=true sys_chroot=true sys_module=true sys_nice=true sys_pacct=true sys_ptrace=true sys_rawio=true sys_resource=true sys_time=true sys_tty_config=true syslog=true wake_alarm=true] 2024-05-08T11:11:57.583755245Z I0508 11:11:57.583715 1 builder.go:401] redacted build: {"kind":"Build","apiVersion":"build.openshift.io/v1","metadata":{"name":"s2icontext-1","namespace":"e2e-test-contextdir-wpphk","uid":"c2db2893-06e5-4274-96ae-d8cd635a1f8d","resourceVersion":"51882","generation":1,"creationTimestamp":"2024-05-08T11:11:55Z","labels":{"buildconfig":"s2icontext","openshift.io/build-config.name":"s2icontext","openshift.io/build.start-policy":"Serial"},"annotations":{"openshift.io/build-config.name":"s2icontext","openshift.io/build.number":"1"},"ownerReferences":[{"apiVersion":"build.openshift.io/v1","kind":"BuildConfig","name":"s2icontext","uid":"b7dbb52b-ae66-4465-babc-728ae3ceed9a","controller":true}],"managedFields":[{"manager":"openshift-apiserver","operation":"Update","apiVersion":"build.openshift.io/v1","time":"2024-05-08T11:11:55Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:openshift.io/build-config.name":{},"f:openshift.io/build.number":{}},"f:labels":{".":{},"f:buildconfig":{},"f:openshift.io/build-config.name":{},"f:openshift.io/build.start-policy":{}},"f:ownerReferences":{".":{},"k:{\"uid\":\"b7dbb52b-ae66-4465-babc-728ae3ceed9a\"}":{}}},"f:spec":{"f:output":{"f:to":{}},"f:serviceAccount":{},"f:source":{"f:contextDir":{},"f:git":{".":{},"f:uri":{}},"f:type":{}},"f:strategy":{"f:sourceStrategy":{".":{},"f:env":{},"f:from":{},"f:pullSecret":{}},"f:type":{}},"f:triggeredBy":{}},"f:status":{"f:conditions":{".":{},"k:{\"type\":\"New\"}":{".":{},"f:lastTransitionTime":{},"f:lastUpdateTime":{},"f:status":{},"f:type":{}}},"f:config":{},"f:phase":{}}}}]},"spec":{"serviceAccount":"builder","source":{"type":"Git","git":{"uri":"https://github.com/sclorg/s2i-ruby-container"},"contextDir":"2.7/test/puma-test-app"},"strategy":{"type":"Source","sourceStrategy":{"from":{"kind":"DockerImage","name":"image-registry.openshift-image-registry.svc:5000/openshift/ruby:2.7-ubi8"},"pullSecret":{"name":"builder-dockercfg-v9xk2"},"env":[{"name":"BUILD_LOGLEVEL","value":"5"}]}},"output":{"to":{"kind":"DockerImage","name":"image-registry.openshift-image-registry.svc:5000/e2e-test-contextdir-wpphk/test:latest"},"pushSecret":{"name":"builder-dockercfg-v9xk2"}},"resources":{},"postCommit":{},"nodeSelector":null,"triggeredBy":[{"message":"Manually triggered"}]},"status":{"phase":"New","outputDockerImageReference":"image-registry.openshift-image-registry.svc:5000/e2e-test-contextdir-wpphk/test:latest","config":{"kind":"BuildConfig","namespace":"e2e-test-contextdir-wpphk","name":"s2icontext"},"output":{},"conditions":[{"type":"New","status":"True","lastUpdateTime":"2024-05-08T11:11:55Z","lastTransitionTime":"2024-05-08T11:11:55Z"}]}} 2024-05-08T11:11:57.584949442Z Cloning "https://github.com/sclorg/s2i-ruby-container" ... 2024-05-08T11:11:57.585044449Z I0508 11:11:57.585030 1 source.go:237] git ls-remote --heads https://github.com/sclorg/s2i-ruby-container 2024-05-08T11:11:57.585081852Z I0508 11:11:57.585072 1 repository.go:450] Executing git ls-remote --heads https://github.com/sclorg/s2i-ruby-container 2024-05-08T11:11:57.840621917Z I0508 11:11:57.840572 1 source.go:237] 663daf43b2abb5662504638d017c7175a6cff59d refs/heads/3.2-experimental 2024-05-08T11:11:57.840621917Z 88b4e684576b3fe0e06c82bd43265e41a8129c5d refs/heads/add_test_latest_imagestreams 2024-05-08T11:11:57.840621917Z 12a863ab4b050a1365d6d59970dddc6743e8bc8c refs/heads/master 2024-05-08T11:11:57.840730405Z I0508 11:11:57.840714 1 source.go:69] Cloning source from https://github.com/sclorg/s2i-ruby-container 2024-05-08T11:11:57.840793509Z I0508 11:11:57.840781 1 repository.go:450] Executing git clone --recursive --depth=1 https://github.com/sclorg/s2i-ruby-container /tmp/build/inputs 2024-05-08T11:11:59.073229755Z I0508 11:11:59.073183 1 repository.go:450] Executing git rev-parse --abbrev-ref HEAD 2024-05-08T11:11:59.080132731Z I0508 11:11:59.080079 1 repository.go:450] Executing git rev-parse --verify HEAD 2024-05-08T11:11:59.083626287Z I0508 11:11:59.083586 1 repository.go:450] Executing git --no-pager show -s --format=%an HEAD 2024-05-08T11:11:59.115407368Z I0508 11:11:59.115361 1 repository.go:450] Executing git --no-pager show -s --format=%ae HEAD 2024-05-08T11:11:59.195276873Z I0508 11:11:59.195231 1 repository.go:450] Executing git --no-pager show -s --format=%cn HEAD 2024-05-08T11:11:59.198916080Z I0508 11:11:59.198879 1 repository.go:450] Executing git --no-pager show -s --format=%ce HEAD 2024-05-08T11:11:59.204712375Z I0508 11:11:59.204663 1 repository.go:450] Executing git --no-pager show -s --format=%ad HEAD 2024-05-08T11:11:59.211098793Z I0508 11:11:59.211051 1 repository.go:450] Executing git --no-pager show -s --format=%<(80,trunc)%s HEAD 2024-05-08T11:11:59.216192627Z I0508 11:11:59.216149 1 repository.go:450] Executing git config --get remote.origin.url 2024-05-08T11:11:59.218615714Z Commit: 12a863ab4b050a1365d6d59970dddc6743e8bc8c (Bump common from `1f774c8` to `a957816` (#537)) 2024-05-08T11:11:59.218661988Z Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> 2024-05-08T11:11:59.218683019Z Date: Tue Apr 9 15:24:11 2024 +0200 2024-05-08T11:11:59.218722882Z I0508 11:11:59.218711 1 repository.go:450] Executing git rev-parse --abbrev-ref HEAD 2024-05-08T11:11:59.234411732Z I0508 11:11:59.234366 1 repository.go:450] Executing git rev-parse --verify HEAD 2024-05-08T11:11:59.237729596Z I0508 11:11:59.237698 1 repository.go:450] Executing git --no-pager show -s --format=%an HEAD 2024-05-08T11:11:59.255304604Z I0508 11:11:59.255269 1 repository.go:450] Executing git --no-pager show -s --format=%ae HEAD 2024-05-08T11:11:59.261113560Z I0508 11:11:59.261074 1 repository.go:450] Executing git --no-pager show -s --format=%cn HEAD 2024-05-08T11:11:59.270006232Z I0508 11:11:59.269961 1 repository.go:450] Executing git --no-pager show -s --format=%ce HEAD 2024-05-08T11:11:59.278485984Z I0508 11:11:59.278443 1 repository.go:450] Executing git --no-pager show -s --format=%ad HEAD 2024-05-08T11:11:59.281940527Z I0508 11:11:59.281906 1 repository.go:450] Executing git --no-pager show -s --format=%<(80,trunc)%s HEAD 2024-05-08T11:11:59.299465312Z I0508 11:11:59.299423 1 repository.go:450] Executing git config --get remote.origin.url 2024-05-08T11:11:59.374652834Z error: provided context directory does not exist: 2.7/test/puma-test-app
Expected results:
Tests succeed
Additional info:
Ruby 2.7 is EOL and not searchable in the Red Hat container catalog. Failing test: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-openshift-controller-manager-operator/344/pull-ci-openshift-cluster-openshift-controller-manager-operator-release-4.14-openshift-e2e-aws-builds-techpreview/1788152058105303040
https://github.com/openshift/console/pull/13420 updated the console to use the new OpenShift branding for the favicon, but this change was not applied to oauth-templates.
Description of problem:
Compute nodes table, does not display correct filesystem data
Version-Release number of selected component (if applicable):
4.16.0-0.ci-2024-04-29-054754
How reproducible:
Always
Steps to Reproduce:
1. In an Openshift cluster 4.16.0-0.ci-2024-04-29-054754 2. Go to the Compute / Nodes menu 3. Check the Filesystem column
Actual results:
There is no storage data displayed
Expected results:
The query is executed correctly and the storage data is displayed correctly
Additional info:
The query has an error as is not concatenating things correctly: https://github.com/openshift/console/blob/master/frontend/packages/console-app/src/components/nodes/NodesPage.tsx#L413
Description of problem:
Should not panic when specify wrong loglevel for oc-mirror
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202407291514.p0.gdbf115f.assembly.stream.el9-dbf115f", GitCommit:"dbf115f547a19f12ab72e7b326be219a47d460a0", GitTreeState:"clean", BuildDate:"2024-07-29T15:52:52Z", GoVersion:"go1.22.4 (Red Hat 1.22.4-2.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
100%
Steps to Reproduce:
1. Run command: `oc-mirror -c config-36410.yaml --from file://out36410 docker://quay.io/zhouying7780/36410test --v2 --loglevel -h`
Actual results:
The command panic with error: oc-mirror -c config-36410.yaml --from file://out36410 docker://quay.io/zhouying7780/36410test --v2 --loglevel -h 2024/07/31 05:22:41 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/07/31 05:22:41 [INFO] : 👋 Hello, welcome to oc-mirror 2024/07/31 05:22:41 [INFO] : ⚙️ setting up the environment for you... 2024/07/31 05:22:41 [INFO] : 🔀 workflow mode: diskToMirror 2024/07/31 05:22:41 [ERROR] : parsing config error parsing local storage configuration : invalid loglevel -h Must be one of [error, warn, info, debug] panic: StorageDriver not registered: goroutine 1 [running]:github.com/distribution/distribution/v3/registry/handlers.NewApp({0x5634e98, 0x76ea4a0}, 0xc000a7c388) /go/src/github.com/openshift/oc-mirror/vendor/github.com/distribution/distribution/v3/registry/handlers/app.go:126 +0x2374github.com/distribution/distribution/v3/registry.NewRegistry({0x5634e98?, 0x76ea4a0?}, 0xc000a7c388) /go/src/github.com/openshift/oc-mirror/vendor/github.com/distribution/distribution/v3/registry/registry.go:141 +0x56github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).setupLocalStorage(0xc000a78488) /go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:571 +0x3c6github.com/openshift/oc-mirror/v2/internal/pkg/cli.NewMirrorCmd.func1(0xc00090f208, {0xc0007ae300, 0x1, 0x8}) /go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:201 +0x27fgithub.com/spf13/cobra.(*Command).execute(0xc00090f208, {0xc0000520a0, 0x8, 0x8}) /go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:987 +0xab1github.com/spf13/cobra.(*Command).ExecuteC(0xc00090f208) /go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:1115 +0x3ffgithub.com/spf13/cobra.(*Command).Execute(0x74bc8d8?) /go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:1039 +0x13main.main() /go/src/github.com/openshift/oc-mirror/cmd/oc-mirror/main.go:10 +0x18
Expected results:
Exit with error , should not panic
Download and merge French and Spanish languages translations in the OCP Console.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
Cypress test cannot be run locally. This appears to be the result of `window.SERVER_FLAGS.authDisabled` always having a value of `false` when auth is in fact disabled.
How reproducible:
Always
Steps to Reproduce:
1. Run `yarn test-cypress-console` with [auth disabled|https://github.com/openshift/console?tab=readme-ov-file#openshift-no-authentication] 2. Run any of the tests (e.g., masthead.cy.ts) 3. Note the test fails because the test tries to login even though auth is disabled. This appears to be because https://github.com/openshift/console/blob/d26868305edc663e8b251e5d73a7c62f7a01cd8c/frontend/packages/integration-tests-cypress/support/login.ts#L28 fails since `window.SERVER_FLAGS.authDisabled` incorrectly has a value of `false`.
Description of problem:
After successfully creating a NAD of type: "OVN Kubernetes secondary localnet network", when viewing the object in the GUI, it will say that it is of type "OVN Kubernetes L2 overlay network". When examining the objects YAML, it is still correctly configured as a NAD type of localnet. Version-Release number of selected component: OCP Virtualization 4.15.1 How reproducible:100% Steps to Reproduce: 1. Create appropriate NNCP and apply for example: apiVersion: nmstate.io/v1 kind: NodeNetworkConfigurationPolicy metadata: name: nncp-br-ex-vlan-101 spec: nodeSelector: node-role.kubernetes.io/worker: '' desiredState: ovn: bridge-mappings: - localnet: vlan-101 bridge: br-ex state: present 2. Create localnet type NAD (from GUI or YAML) For example: apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: vlan-101 namespace: default spec: config: |2 { "name":"br-ex", "type":"ovn-k8s-cni-overlay", "cniVersion":"0.4.0", "topology":"localnet", "vlanID":101, "netAttachDefName":"default/vlan-101" } 3. View through the GUI by clicking on Networking->NetworkAttachementDefinitions->NAD you just created 4. When you look under type it will incorrectly display as Type: OVN Kubernetes L2 overlay Network Actual results: Type is displayed as OVN Kubernetes L2 overlay Network If you examine the YAML for the NAD you will see that it is indeed still of type localnet Please see attached screenshots for display of NAD type and the actual YAML of NAD. At this point in time it looks as though this is just a display error. Expected results: Type should be displayed as OVN Kubernetes secondary localnet network
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-33308. The following is the description of the original issue:
—
Description of problem:
When creating an OCP cluster on AWS and selecting "publish: Internal," the ingress operator may create external LB mappings to external subnets. This can occur if public subnets were specified during installation at install-config. https://docs.openshift.com/container-platform/4.15/installing/installing_aws/installing-aws-private.html#private-clusters-about-aws_installing-aws-private A configuration validation should be added to the installer.
Version-Release number of selected component (if applicable):
4.14+ probably older versions as well.
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Slack thread: https://redhat-internal.slack.com/archives/C68TNFWA2/p1714986876688959
Description of problem:
The TestControllerConfigStuff e2e test was mistakenly merged into the main branch of the MCO repository. This test was supposed to be ephemeral and not actually merged into the repo. It was discovered during the cherrypick process for 4.16 and was removed there. However, it is still part of the main branch and should be removed.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
Run the test-e2e-techpreview CI job
Actual results:
This test should not be present nor should it execute.
Expected results:
This test is actually present.
Additional info:
openshift-install image-based create config-template --dir configuration-dir INFO Config-Template created in: configuration-dir openshift-install image-based create config-template --dir configuration-dir FATAL failed to fetch Image-based Config ISO configuration: failed to load asset "Image-based Config ISO configuration": invalid Image-based Config configuration: networkConfig: Invalid value: interfaces: FATAL - ipv4: FATAL address: FATAL - ip: 192.168.122.2 FATAL prefix-length: 23 FATAL dhcp: false FATAL enabled: true FATAL mac-address: "00:00:00:00:00:00" FATAL name: eth0 FATAL state: up FATAL type: ethernet FATAL : install nmstate package, exec: "nmstatectl": executable file not found in $PATH
We shouldn't see the above error.
Please review the following PR: https://github.com/openshift/thanos/pull/146
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem
Debug into one of the worker nodes on the hosted cluster:
oc debug node/ip-10-1-0-97.ca-central-1.compute.internal nslookup kubernetes.default.svc.cluster.local Server: 10.1.0.2 Address: 10.1.0.2#53 ** server can't find kubernetes.default.svc.cluster.local: NXDOMAIN curl -k https://172.30.0.1:443/readyz curl: (7) Failed to connect to 172.30.0.1 port 443: Connection refused sh-5.1# curl -k https://172.20.0.1:443/readyz ok
Version-Release number of selected component (if applicable):
4.15.20
Steps to Reproduce:
Unknown
Actual results:
Pods on a hosted cluster's workers unable to connect to their internal kube apiserver via the service IP.
Expected results:
Pods on a hosted cluster's workers have connectivity to their kube apiserver via the service IP.
Additional info:
Checked the "Konnectivity server" logs on Dynatrace and found the error below occurs repeatedly
E0724 01:02:00.223151 1 server.go:895] "DIAL_RSP contains failure" err="dial tcp 172.30.176.80:8443: i/o timeout" dialID=8375732890105363305 agentID="1eab211f-6ea1-46ea-bc78-14d75d6ba325" E0724 01:02:00.223482 1 tunnel.go:150] "Received failure on connection" err="read tcp 10.128.17.15:8090->10.128.82.107:52462: use of closed network connection"
Relevant OHSS Ticket: https://issues.redhat.com/browse/OHSS-36053
Description of problem:
When running agent-based installation with arm64 and multi payload, after booting the iso file, assisted-service raise the error, and the installation fail to start: Openshift version 4.16.0-0.nightly-arm64-2024-04-02-182838 for CPU architecture arm64 is not supported: no release image found for openshiftVersion: '4.16.0-0.nightly-arm64-2024-04-02-182838' and CPU architecture 'arm64'" go-id=419 pkg=Inventory request_id=5817b856-ca79-43c0-84f1-b38f733c192f The same error when running the installation with multi-arch build in assisted-service.log: Openshift version 4.16.0-0.nightly-multi-2024-04-01-135550 for CPU architecture multi is not supported: no release image found for openshiftVersion: '4.16.0-0.nightly-multi-2024-04-01-135550' and CPU architecture 'multi'" go-id=306 pkg=Inventory request_id=21a47a40-1de9-4ee3-9906-a2dd90b14ec8 Amd64 build works fine for now.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Create agent iso file with openshift-install binary: openshift-install agent create image with arm64/multi payload 2. Booting the iso file 3. Track the "openshift-install agent wait-for bootstrap-complete" output and assisted-service log
Actual results:
The installation can't start with error
Expected results:
The installation is working fine
Additional info:
assisted-service log: https://docs.google.com/spreadsheets/d/1Jm-eZDrVz5so4BxsWpUOlr3l_90VmJ8FVEvqUwG8ltg/edit#gid=0 Job fail url: multi payload: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-multi-nightly-baremetal-compact-agent-ipv4-dhcp-day2-amd-mixarch-f14/1774134780246364160 arm64 payload: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-arm64-nightly-baremetal-pxe-ha-agent-ipv4-static-connected-f14/1773354788239446016
Please review the following PR: https://github.com/openshift/monitoring-plugin/pull/119
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Bootstrap destroy failed in CI with: level=fatal msg=error destroying bootstrap resources failed during the destroy bootstrap hook: failed to remove bootstrap SSH rule: failed to update AWSCluster during bootstrap destroy: Operation cannot be fulfilled on awsclusters.infrastructure.cluster.x-k8s.io "ci-op-nk1s6685-77004-4gb4d": the object has been modified; please apply your changes to the latest version and try again
Version-Release number of selected component (if applicable):
How reproducible:
Unclear. CI search returns no results. Observed it as a single failure (aws-ovn job, linked below) in the testing of https://amd64.ocp.releases.ci.openshift.org/releasestream/4.17.0-0.nightly/release/4.17.0-0.nightly-2024-06-15-004118
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Two possible solutions:
Description of problem:
The current api version used by the registry operator does not include the recently added "ChunkSizeMiB" feature gate. We need to bump the openshift/api to latest so that this feature gate becomes available for use.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Exercise the v2 Release Candidate in the CI.
Description of problem:
Private HC provision failed on AWS.
How reproducible:
Always.
Steps to Reproduce:
Create a private HC on AWS following the steps in https://hypershift-docs.netlify.app/how-to/aws/deploy-aws-private-clusters/: RELEASE_IMAGE=registry.ci.openshift.org/ocp/release:4.17.0-0.nightly-2024-06-20-005211 HO_IMAGE=quay.io/hypershift/hypershift-operator:latest BUCKET_NAME=fxie-hcp-bucket REGION=us-east-2 AWS_CREDS="$HOME/.aws/credentials" CLUSTER_NAME=fxie-hcp-1 BASE_DOMAIN=qe.devcluster.openshift.com EXT_DNS_DOMAIN=hypershift-ext.qe.devcluster.openshift.com PULL_SECRET="/Users/fxie/Projects/hypershift/.dockerconfigjson" hypershift install --oidc-storage-provider-s3-bucket-name $BUCKET_NAME --oidc-storage-provider-s3-credentials $AWS_CREDS --oidc-storage-provider-s3-region $REGION --private-platform AWS --aws-private-creds $AWS_CREDS --aws-private-region=$REGION --wait-until-available --hypershift-image $HO_IMAGE hypershift create cluster aws --pull-secret=$PULL_SECRET --aws-creds=$AWS_CREDS --name=$CLUSTER_NAME --base-domain=$BASE_DOMAIN --node-pool-replicas=2 --region=$REGION --endpoint-access=Private --release-image=$RELEASE_IMAGE --generate-ssh
Additional info:
From the MC: $ for k in $(oc get secret -n clusters-fxie-hcp-1 | grep -i kubeconfig | awk '{print $1}'); do echo $k; oc extract secret/$k -n clusters-fxie-hcp-1 --to - 2>/dev/null | grep -i 'server:'; done admin-kubeconfig server: https://a621f63c3c65f4e459f2044b9521b5e9-082a734ef867f25a.elb.us-east-2.amazonaws.com:6443 aws-pod-identity-webhook-kubeconfig server: https://kube-apiserver:6443 bootstrap-kubeconfig server: https://api.fxie-hcp-1.hypershift.local:443 cloud-credential-operator-kubeconfig server: https://kube-apiserver:6443 dns-operator-kubeconfig server: https://kube-apiserver:6443 fxie-hcp-1-2bsct-kubeconfig server: https://kube-apiserver:6443 ingress-operator-kubeconfig server: https://kube-apiserver:6443 kube-controller-manager-kubeconfig server: https://kube-apiserver:6443 kube-scheduler-kubeconfig server: https://kube-apiserver:6443 localhost-kubeconfig server: https://localhost:6443 service-network-admin-kubeconfig server: https://kube-apiserver:6443
The bootstrap-kubeconfig uses an incorrect KAS port (should be 6443 since the KAS is exposed through LB), causing kubelet on each HC node to use the same incorrect port. As a result AWS VMs are provisioned but cannot join the HC as nodes.
From a bastion: [ec2-user@ip-10-0-5-182 ~]$ nc -zv api.fxie-hcp-1.hypershift.local 443 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connection timed out. [ec2-user@ip-10-0-5-182 ~]$ nc -zv api.fxie-hcp-1.hypershift.local 6443 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 10.0.143.91:6443. Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.
Besides, the CNO also passes the wrong KAS port to Network components on the HC.
Same for HA proxy configuration on the VMs:
frontend local_apiserver
bind 172.20.0.1:6443
log global
mode tcp
option tcplog
default_backend remote_apiserver
backend remote_apiserver
mode tcp
log global
option httpchk GET /version
option log-health-checks
default-server inter 10s fall 3 rise 3
server controlplane api.fxie-hcp-1.hypershift.local:443
Description of problem:
We should not require the s3:DeleteObject permission for installs when the `preserveBootstrapIgnition` option is set in the install-config.
Version-Release number of selected component (if applicable):
4.14+
How reproducible:
always
Steps to Reproduce:
1. Use an account without the permission 2. Set `preserveBootstrapIgnition: true` in the install-config.yaml 3. Try to deploy a cluster
Actual results:
INFO Credentials loaded from the "denys3" profile in file "/home/cloud-user/.aws/credentials" INFO Consuming Install Config from target directory WARNING Action not allowed with tested creds action=s3:DeleteBucket WARNING Action not allowed with tested creds action=s3:DeleteObject WARNING Action not allowed with tested creds action=s3:DeleteObject WARNING Tested creds not able to perform all requested actions FATAL failed to fetch Cluster: failed to fetch dependency of "Cluster": failed to generate asset "Platform Permissions Check": validate AWS credentials: current credentials insufficient for performing cluster installation
Expected results:
No permission errors.
Additional info:
Please review the following PR: https://github.com/openshift/cluster-version-operator/pull/1047
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-44657. The following is the description of the original issue:
—
Corresponding Jira ticket for PR https://github.com/openshift/console/pull/14076
CAPA v2.6.1 was just released upstream and it is needed downstream by several projects.
When one of our partner was trying to deploy a 4.16 Spoke cluster with ZTP/Gitops Approach, they get the following error message in their assisted-service pod:
error msg="failed to get corresponding infraEnv" func="github.com/openshift/assisted-service/internal/controller/controllers.(*PreprovisioningImageReconciler).AddIronicAgentToInfraEnv" file="/remote-source/assisted-service/app/internal/controller/controllers/preprovisioningimage_controller.go:409" error="record not found" go-id=497 preprovisioning_image=storage-1.fi-911.tre.nsn-rdnet.net preprovisioning_image_namespace=fi-911 request_id=cc62d8f6-d31f-4f74-af50-3237df186dc2
After some discussion in Assisted-Installer forum(https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1723196754444999), Nick Carboni and Alona Paz suggested that "identifier: mac-address" is not supported. Partner has currently ACM 2.11.0 and MCE 2.6.0 versions. However, their older cluster had ACM 2.10 and MCE 2.4.5 and this parameter was working. Nick and Alona suggested to remove "identifier: mac-address" from siteconfig and then installation started to progress. Based on suggestion from Nick, I opened this bug ticket to understand why it started not work now. Partner asked for an official documentation on why this parameter is no more working anymore or if this parameter is not supported any more.
Changing apiserverConfig.Spec.TLSSecurityProfile now makes MCO rollout nodes (see https://github.com/openshift/machine-config-operator/pull/4435) which is disruptive for other tests.
"the simplest solution would be to skip that test for now, and think about how to rewrite/mock it later (even though we try to make tests take that into account, nodes rollout is still a disruptive operation)."
For more context, check https://redhat-internal.slack.com/archives/C026QJY8WJJ/p1722348618775279
Description of problem:
We have runbook for OVNKubernetesNorthdInactive: https://github.com/openshift/runbooks/blob/master/alerts/cluster-network-operator/OVNKubernetesNorthdInactive.md But the runbook url is not added for alert OVNKubernetesNorthdInactive: 4.12: https://github.com/openshift/cluster-network-operator/blob/c1a891129c310d01b8d6940f1eefd26058c0f5b6/bindata/network/ovn-kubernetes/managed/alert-rules-control-plane.yaml#L350 4.13: https://github.com/openshift/cluster-network-operator/blob/257435702312e418be694f4b98b8fe89557030c6/bindata/network/ovn-kubernetes/managed/alert-rules-control-plane.yaml#L350
Version-Release number of selected component (if applicable):
4.12.z, 4.13.z
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/aws-encryption-provider/pull/19
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Previously, in OCPBUGS-32105, we fixed a bug where a race between the assisted-installer and the assisted-installer-controller to mark a Node as Joined would result in 30+ minutes of (unlogged) retries by the former if the latter won. This was indistinguishable from the installation process hanging and it would eventually timed out.
This bug has been fixed, but we were unable to reproduce the circumstances that caused it.
However, a reproduction by the customer reveals another problem: we now correctly retry checking the control plane nodes for readiness if we encounter a conflict with another write from assisted-installer-controller. However, we never reload fresh data from assisted-service - data that would show the host has already been updated and thus prevent us from trying to update it again. Therefore, we continue to get a conflict on every retry. (This is at least now logged, so we can see what is happening.)
This also suggests a potential way to reproduce the problem: whenever one control plane node has booted to the point that the assisted-installer-controller is running before the second control plane node has booted to the point that the Node is marked as ready in the k8s API, there is a possibility of a race. There is in fact no need for the write from assisted-installer-controller to come in the narrow window between when assisted-installer reads vs. writes to the assisted-service API, because assisted-installer is always using a stale read.
The new test: [sig-node] kubelet metrics endpoints should always be reachable
Is picking up some upgrade job runs where we see the metrics endpoint go down for about 30 seconds, during the generic node update phase, and recover before we reboot the node. This is treated as a reason to flake the test because there was no overlap with reboot as initially written.
Example: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade/1806142925785010176
Interval chart showing the problem: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1806142925785010176/periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade/intervals?filterText=master-1&intervalFile=e2e-timelines_spyglass_20240627-024633.json&overrideDisplayFlag=0&selectedSources=E2EFailed&selectedSources=MetricsEndpointDown&selectedSources=NodeState
The master outage at 3:30:59 is causing a flake when I'd rather it didn't, because it doesn't extend into the reboot.
I'd like to tighten this up to include any overlap with update.
Will be backported to 4.16 to tighten the signal there as well.
Description of problem:
The developer perspective dashboards page will deduplicate data before showing it on the table.
How reproducible:
Steps to Reproduce:
1. Apply this dashboard yaml (https://drive.google.com/file/d/1PcErgAKqu95yFi5YDAM5LxaTEutVtbrs/view?usp=sharing) 2. open the dashboard on the Admin console and should be list all the rows 3. open the dashboard on the Developer console selecting openshift-kube-scheduler projectand 4. See that when varying Plugin are available under the Execution Time table they are combined in the developer perspective per Pod
Actual results:
The Developer Perspective Dashboards Table doesn't display all rows returned from a query.
Expected results:
The Developer Perspective Dashboards Table displays all rows returned from a query.
Additional info:
Admin Console: https://drive.google.com/file/d/1EIMYHBql0ql1zYiKlqOJh7hyqG-JFjla/view?usp=sharing
Developer Console: https://drive.google.com/file/d/1jk-Fxq9I6LDYzBGLFTUDDsGqERzwWJrl/view?usp=sharing
It works as expected on OCP <= 4.14
Description of problem:
Trying to delete the application depleyed using Serveless, with a user with limited permission, caused the "Delete application" form to complain:
pipelines.tekton.dev is forbidden: User "uitesting" cannot list resource "pipelines" in API group "tekton.dev" in the namespace "test-cluster-local"
This prevents the deletion. Worth adding that the cluster doesn't have Pipelines installed.
See the sceenshot: https://drive.google.com/file/d/1bsQ_NFO_grj_fE-UInUJXum39bPsHJh1
Version-Release number of selected component (if applicable):
4.15.0
How reproducible:
Always
Steps to Reproduce:
1. Create a limited user 2. Deploy some application, not nececcerly a Serverless one 3. Try to delete the "application" using the Dev Console
Actual results:
And unrevelant error is shown, preventing the deletetion: pipelines.tekton.dev is forbidden: User "uitesting" cannot list resource "pipelines" in API group "tekton.dev" in the namespace "test-cluster-local"
Expected results:
The app should be removed, with everything that's labelled with it.
Please review the following PR: https://github.com/openshift/image-customization-controller/pull/126
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When setting overlaps CIDR for v4InternalSubnet, Live migration did not be blocked. #oc patch Network.operator.openshift.io cluster --type='merge' --patch '{"spec":{"defaultNetwork":{"ovnKubernetesConfig": {"v4InternalSubnet": "10.128.0.0/16"}}}}' #oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'
Version-Release number of selected component (if applicable):
https://github.com/openshift/cluster-network-operator/pull/2392/commits/50201625861ba30570313d8f28c14e59e83f112a
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
I can see: network 4.16.0-0.ci.test-2024-06-05-023541-ci-ln-hhzztr2-latest True False True 163m The cluster configuration is invalid (network clusterNetwork(10.128.0.0/14) overlaps with network v4InternalSubnet(10.128.0.0/16)). Use 'oc edit network.config.openshift.io cluster' to fix. However migration still going on later
Expected results:
Migration should be blocked.
Additional info:
Description of problem:
Bursting hosted cluster creation on a management cluster with size tagging enabled results in some hosted clusters taking a very long time to be scheduled.
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
always
Steps to Reproduce:
1. Setup a management cluster with request serving architecture and size tagging enabled. 2. Configure clustersizing configuration for concurrency of 5 clusters at a time over a 10 minute window. 3. Create many clusters at the same time
Actual results:
Some of the created clusters take a very long time to be scheduled.
Expected results:
New clusters take at most the time required to bring up request serving nodes to be scheduled.
Additional info:
The concurrency settings of the clustersizing configuration are getting applied to both existing clusters and new clusters coming in. They should not be applied to net new hostedclusters.
Description of problem:
When using IPI for IBM Cloud to create a Private BYON cluster, the installer attempts to fetch the VPC resource to verify if it is already a PermittedNetwork for the DNS Services Zone. However, currently there is a new VPC Region that is listed in IBM Cloud, eu-es, which is not yet GA'd. This means while it is listed in available VPC Regions, to search for resources, requests to eu-es fail. Any attempts to use VPC Regions alphabetically after eu-es (appears they are returned in this order), fail due to requests made to eu-es. This includes, eu-gb, us-east, and us-south, causing a golang panic.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
100%
Steps to Reproduce:
1. Create IBM Cloud BYON resources in us-east or us-south 2. Attempt to create a Private BYON based cluster in us-east or us-south
Actual results:
DEBUG Fetching Common Manifests... DEBUG Reusing previously-fetched Common Manifests DEBUG Generating Terraform Variables... panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x2bdb706] goroutine 1 [running]: github.com/openshift/installer/pkg/asset/installconfig/ibmcloud.(*Metadata).IsVPCPermittedNetwork(0xc000e89b80, {0x1a8b9918, 0xc00007c088}, {0xc0009d8678, 0x8}) /go/src/github.com/openshift/installer/pkg/asset/installconfig/ibmcloud/metadata.go:175 +0x186 github.com/openshift/installer/pkg/asset/cluster.(*TerraformVariables).Generate(0x1dc55040, 0x5?) /go/src/github.com/openshift/installer/pkg/asset/cluster/tfvars.go:606 +0x3a5a github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc000ca0d80, {0x1a8ab280, 0x1dc55040}, {0x0, 0x0}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:227 +0x5fa github.com/openshift/installer/pkg/asset/store.(*storeImpl).Fetch(0x7ffd948754cc?, {0x1a8ab280, 0x1dc55040}, {0x1dc32840, 0x8, 0x8}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:77 +0x48 main.runTargetCmd.func1({0x7ffd948754cc, 0xb}) /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:261 +0x125 main.runTargetCmd.func2(0x1dc38800?, {0xc000ca0a80?, 0x3?, 0x3?}) /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:291 +0xe7 github.com/spf13/cobra.(*Command).execute(0x1dc38800, {0xc000ca0a20, 0x3, 0x3}) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:876 +0x67b github.com/spf13/cobra.(*Command).ExecuteC(0xc000bc8000) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:990 +0x3bd github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:918 main.installerMain() /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:61 +0x2b0 main.main() /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:38 +0xff
Expected results:
Successful Private cluster creation using BYON on IBM Cloud
Additional info:
IBM Cloud development has identified the issue and is working on a fix to all affected supported releases (4.12, 4.13, 4.14+)
This is a clone of issue OCPBUGS-38085. The following is the description of the original issue:
—
Description of problem:
Multipart upload issues with Cloudflare R2 using S3 api. Some S3 compatible object storage systems like R2 require that all multipart chunks are the same size. This was mostly true before, except the final chunk was larger than the requested chunk size which causes uploads to fail.
Version-Release number of selected component (if applicable):
How reproducible:
Problem shows itself on OpenShift CI clusters intermittently.
Steps to Reproduce:
This behavior has been causing 504 Gateway Timeout issues in the image registry instances in OpenShift CI clusters. It is connected to uploading big images (i.e 35GB), but we do not currently have the exact steps that reproduce it. 1. 2. 3.
Actual results:
Expected results:
Additional info:
https://github.com/distribution/distribution/issues/3873 https://github.com/distribution/distribution/issues/3873#issuecomment-2258926705 https://developers.cloudflare.com/r2/api/workers/workers-api-reference/#r2multipartupload-definition (look for "uniform in size")
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/292
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
For a cluster having one worker machine of A3 instance type, during "destroy cluster" it keeps telling below failure until I stopped the instance via "gcloud". WARNING failed to stop instance jiwei-0530b-q9t8w-worker-c-ck6s8 in zone us-central1-c: googleapi: Error 400: VM has a Local SSD attached but an undefined value for `discard-local-ssd`. If using gcloud, please add `--discard-local-ssd=false` or `--discard-local-ssd=true` to your command., badRequest
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-multi-2024-05-29-143245
How reproducible:
Always
Steps to Reproduce:
1. "create install-config" and then "create manifests" 2. edit a worker machineset YAML, to specify "machineType: a3-highgpu-8g" along with "onHostMaintenance: Terminate" 3. "create cluster", and make sure it succeeds 4. "destroy cluster"
Actual results:
Uninstalling the cluster keeps telling stopping instance error.
Expected results:
"destroy cluster" should proceed without any warning/error, and delete everything finally.
Additional info:
FYI the .openshift-install.log is available at https://drive.google.com/file/d/15xIwzi0swDk84wqg32tC_4KfUahCalrL/view?usp=drive_link FYI to stop the A3 instance via "gcloud" by specifying "--discard-local-ssd=false" does succeed. $ gcloud compute instances list --format="table(creationTimestamp.date('%Y-%m-%d %H:%M:%S'):sort=1,zone,status,name,machineType,tags.items)" --filter="name~jiwei" 2>/dev/null CREATION_TIMESTAMP ZONE STATUS NAME MACHINE_TYPE ITEMS 2024-05-29 20:55:52 us-central1-a TERMINATED jiwei-0530b-q9t8w-master-0 n2-standard-4 ['jiwei-0530b-q9t8w-master'] 2024-05-29 20:55:52 us-central1-b TERMINATED jiwei-0530b-q9t8w-master-1 n2-standard-4 ['jiwei-0530b-q9t8w-master'] 2024-05-29 20:55:52 us-central1-c TERMINATED jiwei-0530b-q9t8w-master-2 n2-standard-4 ['jiwei-0530b-q9t8w-master'] 2024-05-29 21:10:08 us-central1-a TERMINATED jiwei-0530b-q9t8w-worker-a-rkxkk n2-standard-4 ['jiwei-0530b-q9t8w-worker'] 2024-05-29 21:10:19 us-central1-b TERMINATED jiwei-0530b-q9t8w-worker-b-qg6jv n2-standard-4 ['jiwei-0530b-q9t8w-worker'] 2024-05-29 21:10:31 us-central1-c RUNNING jiwei-0530b-q9t8w-worker-c-ck6s8 a3-highgpu-8g ['jiwei-0530b-q9t8w-worker'] $ gcloud compute instances stop jiwei-0530b-q9t8w-worker-c-ck6s8 --zone us-central1-c ERROR: (gcloud.compute.instances.stop) HTTPError 400: VM has a Local SSD attached but an undefined value for `discard-local-ssd`. If using gcloud, please add `--discard-local-ssd=false` or `--discard-local-ssd=true` to your command. $ gcloud compute instances stop jiwei-0530b-q9t8w-worker-c-ck6s8 --zone us-central1-c --discard-local-ssd=false Stopping instance(s) jiwei-0530b-q9t8w-worker-c-ck6s8...done. Updated [https://compute.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instances/jiwei-0530b-q9t8w-worker-c-ck6s8]. $ gcloud compute instances list --format="table(creationTimestamp.date('%Y-%m-%d %H:%M:%S'):sort=1,zone,status,name,machineType,tags.items)" --filter="name~jiwei" 2>/dev/null CREATION_TIMESTAMP ZONE STATUS NAME MACHINE_TYPE ITEMS 2024-05-29 20:55:52 us-central1-a TERMINATED jiwei-0530b-q9t8w-master-0 n2-standard-4 ['jiwei-0530b-q9t8w-master'] 2024-05-29 20:55:52 us-central1-b TERMINATED jiwei-0530b-q9t8w-master-1 n2-standard-4 ['jiwei-0530b-q9t8w-master'] 2024-05-29 20:55:52 us-central1-c TERMINATED jiwei-0530b-q9t8w-master-2 n2-standard-4 ['jiwei-0530b-q9t8w-master'] 2024-05-29 21:10:08 us-central1-a TERMINATED jiwei-0530b-q9t8w-worker-a-rkxkk n2-standard-4 ['jiwei-0530b-q9t8w-worker'] 2024-05-29 21:10:19 us-central1-b TERMINATED jiwei-0530b-q9t8w-worker-b-qg6jv n2-standard-4 ['jiwei-0530b-q9t8w-worker'] 2024-05-29 21:10:31 us-central1-c TERMINATED jiwei-0530b-q9t8w-worker-c-ck6s8 a3-highgpu-8g ['jiwei-0530b-q9t8w-worker'] $ gcloud compute instances delete -q jiwei-0530b-q9t8w-worker-c-ck6s8 --zone us-central1-c Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instances/jiwei-0530b-q9t8w-worker-c-ck6s8]. $
Description of problem:
When install a 4.16 cluster with the same API public DNS already existed, Installer is prompting Terraform Variables initialization errors, which is not expected since the Terraform support should be removed from the installer. 05-19 17:36:32.935 level=fatal msg=failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": baseDomain: Invalid value: "qe.devcluster.openshift.com": the zone already has record sets for the domain of the cluster: [api.gpei-0519a.qe.devcluster.openshift.com. (A)]
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-18-212906, which has the CAPI install as default
How reproducible:
Steps to Reproduce:
1. Create a 4.16 cluster with the cluster name: gpei-0519a 2. After the cluster installation finished, try to create the 2nd one with the same cluster name
Actual results:
05-19 17:36:26.390 level=debug msg=OpenShift Installer 4.16.0-0.nightly-2024-05-18-212906 05-19 17:36:26.390 level=debug msg=Built from commit 3eed76e1400cac88af6638bb097ada1607137f3f 05-19 17:36:26.390 level=debug msg=Fetching Metadata... 05-19 17:36:26.390 level=debug msg=Loading Metadata... 05-19 17:36:26.390 level=debug msg= Loading Cluster ID... 05-19 17:36:26.390 level=debug msg= Loading Install Config... 05-19 17:36:26.390 level=debug msg= Loading SSH Key... 05-19 17:36:26.390 level=debug msg= Loading Base Domain... 05-19 17:36:26.390 level=debug msg= Loading Platform... 05-19 17:36:26.390 level=debug msg= Loading Cluster Name... 05-19 17:36:26.390 level=debug msg= Loading Base Domain... 05-19 17:36:26.390 level=debug msg= Loading Platform... 05-19 17:36:26.390 level=debug msg= Loading Pull Secret... 05-19 17:36:26.390 level=debug msg= Loading Platform... 05-19 17:36:26.390 level=debug msg= Using Install Config loaded from state file 05-19 17:36:26.391 level=debug msg= Using Cluster ID loaded from state file 05-19 17:36:26.391 level=debug msg= Loading Install Config... 05-19 17:36:26.391 level=debug msg= Loading Bootstrap Ignition Config... 05-19 17:36:26.391 level=debug msg= Loading Ironic bootstrap credentials... 05-19 17:36:26.391 level=debug msg= Using Ironic bootstrap credentials loaded from state file 05-19 17:36:26.391 level=debug msg= Loading CVO Ignore... 05-19 17:36:26.391 level=debug msg= Loading Common Manifests... 05-19 17:36:26.391 level=debug msg= Loading Cluster ID... 05-19 17:36:26.391 level=debug msg= Loading Install Config... 05-19 17:36:26.391 level=debug msg= Loading Ingress Config... 05-19 17:36:26.391 level=debug msg= Loading Install Config... 05-19 17:36:26.391 level=debug msg= Using Ingress Config loaded from state file 05-19 17:36:26.391 level=debug msg= Loading DNS Config... 05-19 17:36:26.391 level=debug msg= Loading Install Config... 05-19 17:36:26.392 level=debug msg= Loading Cluster ID... 05-19 17:36:26.392 level=debug msg= Loading Platform Credentials Check... 05-19 17:36:26.392 level=debug msg= Loading Install Config... 05-19 17:36:26.392 level=debug msg= Using Platform Credentials Check loaded from state file 05-19 17:36:26.392 level=debug msg= Using DNS Config loaded from state file 05-19 17:36:26.392 level=debug msg= Loading Infrastructure Config... 05-19 17:36:26.392 level=debug msg= Loading Cluster ID... 05-19 17:36:26.392 level=debug msg= Loading Install Config... 05-19 17:36:26.392 level=debug msg= Loading Cloud Provider Config... 05-19 17:36:26.392 level=debug msg= Loading Install Config... 05-19 17:36:26.392 level=debug msg= Loading Cluster ID... 05-19 17:36:26.392 level=debug msg= Loading Platform Credentials Check... 05-19 17:36:26.392 level=debug msg= Using Cloud Provider Config loaded from state file 05-19 17:36:26.393 level=debug msg= Loading Additional Trust Bundle Config... 05-19 17:36:26.393 level=debug msg= Loading Install Config... 05-19 17:36:26.393 level=debug msg= Using Additional Trust Bundle Config loaded from state file 05-19 17:36:26.393 level=debug msg= Using Infrastructure Config loaded from state file 05-19 17:36:26.393 level=debug msg= Loading Network Config... 05-19 17:36:26.393 level=debug msg= Loading Install Config... 05-19 17:36:26.393 level=debug msg= Using Network Config loaded from state file 05-19 17:36:26.393 level=debug msg= Loading Proxy Config... 05-19 17:36:26.393 level=debug msg= Loading Install Config... 05-19 17:36:26.393 level=debug msg= Loading Network Config... 05-19 17:36:26.393 level=debug msg= Using Proxy Config loaded from state file 05-19 17:36:26.393 level=debug msg= Loading Scheduler Config... 05-19 17:36:26.394 level=debug msg= Loading Install Config... 05-19 17:36:26.394 level=debug msg= Using Scheduler Config loaded from state file 05-19 17:36:26.394 level=debug msg= Loading Image Content Source Policy... 05-19 17:36:26.394 level=debug msg= Loading Install Config... 05-19 17:36:26.394 level=debug msg= Using Image Content Source Policy loaded from state file 05-19 17:36:26.394 level=debug msg= Loading Cluster CSI Driver Config... 05-19 17:36:26.394 level=debug msg= Loading Install Config... 05-19 17:36:26.394 level=debug msg= Loading Cluster ID... 05-19 17:36:26.394 level=debug msg= Using Cluster CSI Driver Config loaded from state file 05-19 17:36:26.394 level=debug msg= Loading Image Digest Mirror Set... 05-19 17:36:26.394 level=debug msg= Loading Install Config... 05-19 17:36:26.394 level=debug msg= Using Image Digest Mirror Set loaded from state file 05-19 17:36:26.394 level=debug msg= Loading Machine Config Server Root CA... 05-19 17:36:26.395 level=debug msg= Using Machine Config Server Root CA loaded from state file 05-19 17:36:26.395 level=debug msg= Loading Certificate (mcs)... 05-19 17:36:26.395 level=debug msg= Loading Machine Config Server Root CA... 05-19 17:36:26.395 level=debug msg= Loading Install Config... 05-19 17:36:26.395 level=debug msg= Using Certificate (mcs) loaded from state file 05-19 17:36:26.395 level=debug msg= Loading CVOOverrides... 05-19 17:36:26.395 level=debug msg= Using CVOOverrides loaded from state file 05-19 17:36:26.395 level=debug msg= Loading KubeCloudConfig... 05-19 17:36:26.395 level=debug msg= Using KubeCloudConfig loaded from state file 05-19 17:36:26.395 level=debug msg= Loading KubeSystemConfigmapRootCA... 05-19 17:36:26.395 level=debug msg= Using KubeSystemConfigmapRootCA loaded from state file 05-19 17:36:26.395 level=debug msg= Loading MachineConfigServerTLSSecret... 05-19 17:36:26.396 level=debug msg= Using MachineConfigServerTLSSecret loaded from state file 05-19 17:36:26.396 level=debug msg= Loading OpenshiftConfigSecretPullSecret... 05-19 17:36:26.396 level=debug msg= Using OpenshiftConfigSecretPullSecret loaded from state file 05-19 17:36:26.396 level=debug msg= Using Common Manifests loaded from state file 05-19 17:36:26.396 level=debug msg= Loading Openshift Manifests... 05-19 17:36:26.396 level=debug msg= Loading Install Config... 05-19 17:36:26.396 level=debug msg= Loading Cluster ID... 05-19 17:36:26.396 level=debug msg= Loading Kubeadmin Password... 05-19 17:36:26.396 level=debug msg= Using Kubeadmin Password loaded from state file 05-19 17:36:26.396 level=debug msg= Loading OpenShift Install (Manifests)... 05-19 17:36:26.396 level=debug msg= Using OpenShift Install (Manifests) loaded from state file 05-19 17:36:26.397 level=debug msg= Loading Feature Gate Config... 05-19 17:36:26.397 level=debug msg= Loading Install Config... 05-19 17:36:26.397 level=debug msg= Using Feature Gate Config loaded from state file 05-19 17:36:26.397 level=debug msg= Loading CloudCredsSecret... 05-19 17:36:26.397 level=debug msg= Using CloudCredsSecret loaded from state file 05-19 17:36:26.397 level=debug msg= Loading KubeadminPasswordSecret... 05-19 17:36:26.397 level=debug msg= Using KubeadminPasswordSecret loaded from state file 05-19 17:36:26.397 level=debug msg= Loading RoleCloudCredsSecretReader... 05-19 17:36:26.397 level=debug msg= Using RoleCloudCredsSecretReader loaded from state file 05-19 17:36:26.397 level=debug msg= Loading Baremetal Config CR... 05-19 17:36:26.397 level=debug msg= Using Baremetal Config CR loaded from state file 05-19 17:36:26.397 level=debug msg= Loading Image... 05-19 17:36:26.397 level=debug msg= Loading Install Config... 05-19 17:36:26.398 level=debug msg= Using Image loaded from state file 05-19 17:36:26.398 level=debug msg= Loading AzureCloudProviderSecret... 05-19 17:36:26.398 level=debug msg= Using AzureCloudProviderSecret loaded from state file 05-19 17:36:26.398 level=debug msg= Using Openshift Manifests loaded from state file 05-19 17:36:26.398 level=debug msg= Using CVO Ignore loaded from state file 05-19 17:36:26.398 level=debug msg= Loading Install Config... 05-19 17:36:26.398 level=debug msg= Loading Kubeconfig Admin Internal Client... 05-19 17:36:26.398 level=debug msg= Loading Certificate (admin-kubeconfig-client)... 05-19 17:36:26.398 level=debug msg= Loading Certificate (admin-kubeconfig-signer)... 05-19 17:36:26.398 level=debug msg= Using Certificate (admin-kubeconfig-signer) loaded from state file 05-19 17:36:26.398 level=debug msg= Using Certificate (admin-kubeconfig-client) loaded from state file 05-19 17:36:26.399 level=debug msg= Loading Certificate (kube-apiserver-complete-server-ca-bundle)... 05-19 17:36:26.399 level=debug msg= Loading Certificate (kube-apiserver-localhost-ca-bundle)... 05-19 17:36:26.399 level=debug msg= Loading Certificate (kube-apiserver-localhost-signer)... 05-19 17:36:26.399 level=debug msg= Using Certificate (kube-apiserver-localhost-signer) loaded from state file 05-19 17:36:26.399 level=debug msg= Using Certificate (kube-apiserver-localhost-ca-bundle) loaded from state file 05-19 17:36:26.399 level=debug msg= Loading Certificate (kube-apiserver-service-network-ca-bundle)... 05-19 17:36:26.399 level=debug msg= Loading Certificate (kube-apiserver-service-network-signer)... 05-19 17:36:26.399 level=debug msg= Using Certificate (kube-apiserver-service-network-signer) loaded from state file 05-19 17:36:26.399 level=debug msg= Using Certificate (kube-apiserver-service-network-ca-bundle) loaded from state file 05-19 17:36:26.400 level=debug msg= Loading Certificate (kube-apiserver-lb-ca-bundle)... 05-19 17:36:26.400 level=debug msg= Loading Certificate (kube-apiserver-lb-signer)... 05-19 17:36:26.400 level=debug msg= Using Certificate (kube-apiserver-lb-signer) loaded from state file 05-19 17:36:26.400 level=debug msg= Using Certificate (kube-apiserver-lb-ca-bundle) loaded from state file 05-19 17:36:26.400 level=debug msg= Using Certificate (kube-apiserver-complete-server-ca-bundle) loaded from state file 05-19 17:36:26.400 level=debug msg= Loading Install Config... 05-19 17:36:26.400 level=debug msg= Using Kubeconfig Admin Internal Client loaded from state file 05-19 17:36:26.400 level=debug msg= Loading Kubeconfig Kubelet... 05-19 17:36:26.400 level=debug msg= Loading Certificate (kube-apiserver-complete-server-ca-bundle)... 05-19 17:36:26.400 level=debug msg= Loading Certificate (kubelet-client)... 05-19 17:36:26.401 level=debug msg= Loading Certificate (kubelet-bootstrap-kubeconfig-signer)... 05-19 17:36:26.401 level=debug msg= Using Certificate (kubelet-bootstrap-kubeconfig-signer) loaded from state file 05-19 17:36:26.401 level=debug msg= Using Certificate (kubelet-client) loaded from state file 05-19 17:36:26.401 level=debug msg= Loading Install Config... 05-19 17:36:26.401 level=debug msg= Using Kubeconfig Kubelet loaded from state file 05-19 17:36:26.401 level=debug msg= Loading Kubeconfig Admin Client (Loopback)... 05-19 17:36:26.401 level=debug msg= Loading Certificate (admin-kubeconfig-client)... 05-19 17:36:26.401 level=debug msg= Loading Certificate (kube-apiserver-localhost-ca-bundle)... 05-19 17:36:26.401 level=debug msg= Loading Install Config... 05-19 17:36:26.401 level=debug msg= Using Kubeconfig Admin Client (Loopback) loaded from state file 05-19 17:36:26.401 level=debug msg= Loading Master Ignition Customization Check... 05-19 17:36:26.402 level=debug msg= Loading Install Config... 05-19 17:36:26.402 level=debug msg= Loading Machine Config Server Root CA... 05-19 17:36:26.402 level=debug msg= Loading Master Ignition Config... 05-19 17:36:26.402 level=debug msg= Loading Install Config... 05-19 17:36:26.402 level=debug msg= Loading Machine Config Server Root CA... 05-19 17:36:26.402 level=debug msg= Loading Master Ignition Config from both state file and target directory 05-19 17:36:26.402 level=debug msg= On-disk Master Ignition Config matches asset in state file 05-19 17:36:26.402 level=debug msg= Using Master Ignition Config loaded from state file 05-19 17:36:26.402 level=debug msg= Using Master Ignition Customization Check loaded from state file 05-19 17:36:26.402 level=debug msg= Loading Worker Ignition Customization Check... 05-19 17:36:26.402 level=debug msg= Loading Install Config... 05-19 17:36:26.402 level=debug msg= Loading Machine Config Server Root CA... 05-19 17:36:26.403 level=debug msg= Loading Worker Ignition Config... 05-19 17:36:26.403 level=debug msg= Loading Install Config... 05-19 17:36:26.403 level=debug msg= Loading Machine Config Server Root CA... 05-19 17:36:26.403 level=debug msg= Loading Worker Ignition Config from both state file and target directory 05-19 17:36:26.403 level=debug msg= On-disk Worker Ignition Config matches asset in state file 05-19 17:36:26.403 level=debug msg= Using Worker Ignition Config loaded from state file 05-19 17:36:26.403 level=debug msg= Using Worker Ignition Customization Check loaded from state file 05-19 17:36:26.403 level=debug msg= Loading Master Machines... 05-19 17:36:26.403 level=debug msg= Loading Cluster ID... 05-19 17:36:26.403 level=debug msg= Loading Platform Credentials Check... 05-19 17:36:26.403 level=debug msg= Loading Install Config... 05-19 17:36:26.403 level=debug msg= Loading Image... 05-19 17:36:26.404 level=debug msg= Loading Master Ignition Config... 05-19 17:36:26.404 level=debug msg= Using Master Machines loaded from state file 05-19 17:36:26.404 level=debug msg= Loading Worker Machines... 05-19 17:36:26.404 level=debug msg= Loading Cluster ID... 05-19 17:36:26.404 level=debug msg= Loading Platform Credentials Check... 05-19 17:36:26.404 level=debug msg= Loading Install Config... 05-19 17:36:26.404 level=debug msg= Loading Image... 05-19 17:36:26.404 level=debug msg= Loading Release... 05-19 17:36:26.404 level=debug msg= Loading Install Config... 05-19 17:36:26.404 level=debug msg= Using Release loaded from state file 05-19 17:36:26.404 level=debug msg= Loading Worker Ignition Config... 05-19 17:36:26.404 level=debug msg= Using Worker Machines loaded from state file 05-19 17:36:26.404 level=debug msg= Loading Common Manifests... 05-19 17:36:26.404 level=debug msg= Loading Openshift Manifests... 05-19 17:36:26.404 level=debug msg= Loading Proxy Config... 05-19 17:36:26.405 level=debug msg= Loading Certificate (admin-kubeconfig-ca-bundle)... 05-19 17:36:26.405 level=debug msg= Loading Certificate (admin-kubeconfig-signer)... 05-19 17:36:26.405 level=debug msg= Using Certificate (admin-kubeconfig-ca-bundle) loaded from state file 05-19 17:36:26.405 level=debug msg= Loading Certificate (aggregator)... 05-19 17:36:26.405 level=debug msg= Using Certificate (aggregator) loaded from state file 05-19 17:36:26.405 level=debug msg= Loading Certificate (aggregator-ca-bundle)... 05-19 17:36:26.405 level=debug msg= Loading Certificate (aggregator-signer)... 05-19 17:36:26.405 level=debug msg= Using Certificate (aggregator-signer) loaded from state file 05-19 17:36:26.405 level=debug msg= Using Certificate (aggregator-ca-bundle) loaded from state file 05-19 17:36:26.405 level=debug msg= Loading Certificate (system:kube-apiserver-proxy)... 05-19 17:36:26.405 level=debug msg= Loading Certificate (aggregator-signer)... 05-19 17:36:26.406 level=debug msg= Using Certificate (system:kube-apiserver-proxy) loaded from state file 05-19 17:36:26.406 level=debug msg= Loading Certificate (aggregator-signer)... 05-19 17:36:26.406 level=debug msg= Loading Certificate (system:kube-apiserver-proxy)... 05-19 17:36:26.406 level=debug msg= Loading Certificate (aggregator)... 05-19 17:36:26.406 level=debug msg= Using Certificate (system:kube-apiserver-proxy) loaded from state file 05-19 17:36:26.406 level=debug msg= Loading Bootstrap SSH Key Pair... 05-19 17:36:26.406 level=debug msg= Using Bootstrap SSH Key Pair loaded from state file 05-19 17:36:26.406 level=debug msg= Loading User-provided Service Account Signing key... 05-19 17:36:26.406 level=debug msg= Using User-provided Service Account Signing key loaded from state file 05-19 17:36:26.406 level=debug msg= Loading Cloud Provider CA Bundle... 05-19 17:36:26.406 level=debug msg= Loading Install Config... 05-19 17:36:26.407 level=debug msg= Using Cloud Provider CA Bundle loaded from state file 05-19 17:36:26.407 level=debug msg= Loading Certificate (journal-gatewayd)... 05-19 17:36:26.407 level=debug msg= Loading Machine Config Server Root CA... 05-19 17:36:26.407 level=debug msg= Using Certificate (journal-gatewayd) loaded from state file 05-19 17:36:26.407 level=debug msg= Loading Certificate (kube-apiserver-lb-ca-bundle)... 05-19 17:36:26.407 level=debug msg= Loading Certificate (kube-apiserver-external-lb-server)... 05-19 17:36:26.407 level=debug msg= Loading Certificate (kube-apiserver-lb-signer)... 05-19 17:36:26.407 level=debug msg= Loading Install Config... 05-19 17:36:26.407 level=debug msg= Using Certificate (kube-apiserver-external-lb-server) loaded from state file 05-19 17:36:26.407 level=debug msg= Loading Certificate (kube-apiserver-internal-lb-server)... 05-19 17:36:26.407 level=debug msg= Loading Certificate (kube-apiserver-lb-signer)... 05-19 17:36:26.408 level=debug msg= Loading Install Config... 05-19 17:36:26.408 level=debug msg= Using Certificate (kube-apiserver-internal-lb-server) loaded from state file 05-19 17:36:26.408 level=debug msg= Loading Certificate (kube-apiserver-lb-signer)... 05-19 17:36:26.408 level=debug msg= Loading Certificate (kube-apiserver-localhost-ca-bundle)... 05-19 17:36:26.408 level=debug msg= Loading Certificate (kube-apiserver-localhost-server)... 05-19 17:36:26.408 level=debug msg= Loading Certificate (kube-apiserver-localhost-signer)... 05-19 17:36:26.408 level=debug msg= Using Certificate (kube-apiserver-localhost-server) loaded from state file 05-19 17:36:26.408 level=debug msg= Loading Certificate (kube-apiserver-localhost-signer)... 05-19 17:36:26.408 level=debug msg= Loading Certificate (kube-apiserver-service-network-ca-bundle)... 05-19 17:36:26.408 level=debug msg= Loading Certificate (kube-apiserver-service-network-server)... 05-19 17:36:26.409 level=debug msg= Loading Certificate (kube-apiserver-service-network-signer)... 05-19 17:36:26.409 level=debug msg= Loading Install Config... 05-19 17:36:26.409 level=debug msg= Using Certificate (kube-apiserver-service-network-server) loaded from state file 05-19 17:36:26.409 level=debug msg= Loading Certificate (kube-apiserver-service-network-signer)... 05-19 17:36:26.409 level=debug msg= Loading Certificate (kube-apiserver-complete-server-ca-bundle)... 05-19 17:36:26.409 level=debug msg= Loading Certificate (kube-apiserver-complete-client-ca-bundle)... 05-19 17:36:26.409 level=debug msg= Loading Certificate (admin-kubeconfig-ca-bundle)... 05-19 17:36:26.409 level=debug msg= Loading Certificate (kubelet-client-ca-bundle)... 05-19 17:36:26.409 level=debug msg= Loading Certificate (kubelet-signer)... 05-19 17:36:26.409 level=debug msg= Using Certificate (kubelet-signer) loaded from state file 05-19 17:36:26.410 level=debug msg= Using Certificate (kubelet-client-ca-bundle) loaded from state file 05-19 17:36:26.410 level=debug msg= Loading Certificate (kube-control-plane-ca-bundle)... 05-19 17:36:26.410 level=debug msg= Loading Certificate (kube-control-plane-signer)... 05-19 17:36:26.410 level=debug msg= Using Certificate (kube-control-plane-signer) loaded from state file 05-19 17:36:26.410 level=debug msg= Loading Certificate (kube-apiserver-lb-signer)... 05-19 17:36:26.410 level=debug msg= Loading Certificate (kube-apiserver-localhost-signer)... 05-19 17:36:26.410 level=debug msg= Loading Certificate (kube-apiserver-service-network-signer)... 05-19 17:36:26.410 level=debug msg= Using Certificate (kube-control-plane-ca-bundle) loaded from state file 05-19 17:36:26.410 level=debug msg= Loading Certificate (kube-apiserver-to-kubelet-ca-bundle)... 05-19 17:36:26.411 level=debug msg= Loading Certificate (kube-apiserver-to-kubelet-signer)... 05-19 17:36:26.411 level=debug msg= Using Certificate (kube-apiserver-to-kubelet-signer) loaded from state file 05-19 17:36:26.411 level=debug msg= Using Certificate (kube-apiserver-to-kubelet-ca-bundle) loaded from state file 05-19 17:36:26.411 level=debug msg= Loading Certificate (kubelet-bootstrap-kubeconfig-ca-bundle)... 05-19 17:36:26.411 level=debug msg= Loading Certificate (kubelet-bootstrap-kubeconfig-signer)... 05-19 17:36:26.411 level=debug msg= Using Certificate (kubelet-bootstrap-kubeconfig-ca-bundle) loaded from state file 05-19 17:36:26.411 level=debug msg= Using Certificate (kube-apiserver-complete-client-ca-bundle) loaded from state file 05-19 17:36:26.411 level=debug msg= Loading Certificate (kube-apiserver-to-kubelet-ca-bundle)... 05-19 17:36:26.411 level=debug msg= Loading Certificate (kube-apiserver-to-kubelet-client)... 05-19 17:36:26.412 level=debug msg= Loading Certificate (kube-apiserver-to-kubelet-signer)... 05-19 17:36:26.412 level=debug msg= Using Certificate (kube-apiserver-to-kubelet-client) loaded from state file 05-19 17:36:26.412 level=debug msg= Loading Certificate (kube-apiserver-to-kubelet-signer)... 05-19 17:36:26.412 level=debug msg= Loading Certificate (kube-control-plane-ca-bundle)... 05-19 17:36:26.412 level=debug msg= Loading Certificate (kube-control-plane-kube-controller-manager-client)... 05-19 17:36:26.412 level=debug msg= Loading Certificate (kube-control-plane-signer)... 05-19 17:36:26.412 level=debug msg= Using Certificate (kube-control-plane-kube-controller-manager-client) loaded from state file 05-19 17:36:26.412 level=debug msg= Loading Certificate (kube-control-plane-kube-scheduler-client)... 05-19 17:36:26.412 level=debug msg= Loading Certificate (kube-control-plane-signer)... 05-19 17:36:26.412 level=debug msg= Using Certificate (kube-control-plane-kube-scheduler-client) loaded from state file 05-19 17:36:26.413 level=debug msg= Loading Certificate (kube-control-plane-signer)... 05-19 17:36:26.413 level=debug msg= Loading Certificate (kubelet-bootstrap-kubeconfig-ca-bundle)... 05-19 17:36:26.413 level=debug msg= Loading Certificate (kubelet-client-ca-bundle)... 05-19 17:36:26.413 level=debug msg= Loading Certificate (kubelet-client)... 05-19 17:36:26.413 level=debug msg= Loading Certificate (kubelet-signer)... 05-19 17:36:26.413 level=debug msg= Loading Certificate (kubelet-serving-ca-bundle)... 05-19 17:36:26.413 level=debug msg= Loading Certificate (kubelet-signer)... 05-19 17:36:26.413 level=debug msg= Using Certificate (kubelet-serving-ca-bundle) loaded from state file 05-19 17:36:26.413 level=debug msg= Loading Certificate (mcs)... 05-19 17:36:26.413 level=debug msg= Loading Machine Config Server Root CA... 05-19 17:36:26.413 level=debug msg= Loading Key Pair (service-account.pub)... 05-19 17:36:26.414 level=debug msg= Using Key Pair (service-account.pub) loaded from state file 05-19 17:36:26.414 level=debug msg= Loading Release Image Pull Spec... 05-19 17:36:26.414 level=debug msg= Using Release Image Pull Spec loaded from state file 05-19 17:36:26.414 level=debug msg= Loading Image... 05-19 17:36:26.414 level=debug msg= Loading Bootstrap Ignition Config from both state file and target directory 05-19 17:36:26.414 level=debug msg= On-disk Bootstrap Ignition Config matches asset in state file 05-19 17:36:26.414 level=debug msg= Using Bootstrap Ignition Config loaded from state file 05-19 17:36:26.414 level=debug msg=Using Metadata loaded from state file 05-19 17:36:26.414 level=debug msg=Reusing previously-fetched Metadata 05-19 17:36:26.415 level=info msg=Consuming Worker Ignition Config from target directory 05-19 17:36:26.415 level=debug msg=Purging asset "Worker Ignition Config" from disk 05-19 17:36:26.415 level=info msg=Consuming Master Ignition Config from target directory 05-19 17:36:26.415 level=debug msg=Purging asset "Master Ignition Config" from disk 05-19 17:36:26.415 level=info msg=Consuming Bootstrap Ignition Config from target directory 05-19 17:36:26.415 level=debug msg=Purging asset "Bootstrap Ignition Config" from disk 05-19 17:36:26.415 level=debug msg=Fetching Master Ignition Customization Check... 05-19 17:36:26.415 level=debug msg=Reusing previously-fetched Master Ignition Customization Check 05-19 17:36:26.415 level=debug msg=Fetching Worker Ignition Customization Check... 05-19 17:36:26.415 level=debug msg=Reusing previously-fetched Worker Ignition Customization Check 05-19 17:36:26.415 level=debug msg=Fetching Terraform Variables... 05-19 17:36:26.415 level=debug msg=Loading Terraform Variables... 05-19 17:36:26.416 level=debug msg= Loading Cluster ID... 05-19 17:36:26.416 level=debug msg= Loading Install Config... 05-19 17:36:26.416 level=debug msg= Loading Image... 05-19 17:36:26.416 level=debug msg= Loading Release... 05-19 17:36:26.416 level=debug msg= Loading BootstrapImage... 05-19 17:36:26.416 level=debug msg= Loading Install Config... 05-19 17:36:26.416 level=debug msg= Loading Image... 05-19 17:36:26.416 level=debug msg= Loading Bootstrap Ignition Config... 05-19 17:36:26.416 level=debug msg= Loading Master Ignition Config... 05-19 17:36:26.416 level=debug msg= Loading Master Machines... 05-19 17:36:26.416 level=debug msg= Loading Worker Machines... 05-19 17:36:26.416 level=debug msg= Loading Ironic bootstrap credentials... 05-19 17:36:26.416 level=debug msg= Loading Platform Provisioning Check... 05-19 17:36:26.416 level=debug msg= Loading Install Config... 05-19 17:36:26.416 level=debug msg= Loading Common Manifests... 05-19 17:36:26.417 level=debug msg= Fetching Cluster ID... 05-19 17:36:26.417 level=debug msg= Reusing previously-fetched Cluster ID 05-19 17:36:26.417 level=debug msg= Fetching Install Config... 05-19 17:36:26.417 level=debug msg= Reusing previously-fetched Install Config 05-19 17:36:26.417 level=debug msg= Fetching Image... 05-19 17:36:26.417 level=debug msg= Reusing previously-fetched Image 05-19 17:36:26.417 level=debug msg= Fetching Release... 05-19 17:36:26.417 level=debug msg= Reusing previously-fetched Release 05-19 17:36:26.417 level=debug msg= Fetching BootstrapImage... 05-19 17:36:26.417 level=debug msg= Fetching Install Config... 05-19 17:36:26.417 level=debug msg= Reusing previously-fetched Install Config 05-19 17:36:26.417 level=debug msg= Fetching Image... 05-19 17:36:26.417 level=debug msg= Reusing previously-fetched Image 05-19 17:36:26.417 level=debug msg= Generating BootstrapImage... 05-19 17:36:26.417 level=debug msg= Fetching Bootstrap Ignition Config... 05-19 17:36:26.418 level=debug msg= Reusing previously-fetched Bootstrap Ignition Config 05-19 17:36:26.418 level=debug msg= Fetching Master Ignition Config... 05-19 17:36:26.418 level=debug msg= Reusing previously-fetched Master Ignition Config 05-19 17:36:26.418 level=debug msg= Fetching Master Machines... 05-19 17:36:26.418 level=debug msg= Reusing previously-fetched Master Machines 05-19 17:36:26.418 level=debug msg= Fetching Worker Machines... 05-19 17:36:26.418 level=debug msg= Reusing previously-fetched Worker Machines 05-19 17:36:26.418 level=debug msg= Fetching Ironic bootstrap credentials... 05-19 17:36:26.418 level=debug msg= Reusing previously-fetched Ironic bootstrap credentials 05-19 17:36:26.418 level=debug msg= Fetching Platform Provisioning Check... 05-19 17:36:26.418 level=debug msg= Fetching Install Config... 05-19 17:36:26.418 level=debug msg= Reusing previously-fetched Install Config 05-19 17:36:26.418 level=debug msg= Generating Platform Provisioning Check... 05-19 17:36:26.419 level=info msg=Credentials loaded from the "flexy-installer" profile in file "/home/installer1/workspace/ocp-common/Flexy-install@2/flexy/workdir/awscreds20240519-580673-bzyw8l" 05-19 17:36:32.935 level=fatal msg=failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": baseDomain: Invalid value: "qe.devcluster.openshift.com": the zone already has record sets for the domain of the cluster: [api.gpei-0519a.qe.devcluster.openshift.com. (A)]
Expected results:
Remove all TF checks on AWS/vSphere/Nutanix platforms
Additional info:
This is a clone of issue OCPBUGS-30811. The following is the description of the original issue:
—
Description of problem:
On CI all the software for openstack and ansible related pieces are taken from pip and ansible-glalaxy instead of OS repository.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/csi-operator/pull/231
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-37584. The following is the description of the original issue:
—
Description of problem:
Topology screen crashes and reports "Oh no! something went wrong" when a pod in completed state is selected.
Version-Release number of selected component (if applicable):
RHOCP 4.15.18
How reproducible:
100%
Steps to Reproduce:
1. Switch to developer mode 2. Select Topology 3. Select a project that has completed cron jobs like openshift-image-registry 4. Click the green CronJob Object 5. Observe Crash
Actual results:
The Topology screen crashes with error "Oh no! Something went wrong."
Expected results:
After clicking the completed pod / workload, the screen should display the information related to it.
Additional info:
After some investigation, the issues we have seen with util-linux missing from some images are due to the CentOS Stream base image not installing subscription-manager
[root@92063ff10998 /]# yum install subscription-manager
CentOS Stream 9 - BaseOS 3.3 MB/s | 8.9 MB 00:02
CentOS Stream 9 - AppStream 2.1 MB/s | 17 MB 00:08
CentOS Stream 9 - Extras packages 14 kB/s | 17 kB 00:01
Dependencies resolved.
==============================================================================================================================================================================================================================================
Package Architecture Version Repository Size
==============================================================================================================================================================================================================================================
Installing:
subscription-manager aarch64 1.29.40-1.el9 baseos 911 k
Installing dependencies:
acl aarch64 2.3.1-4.el9 baseos 71 k
checkpolicy aarch64 3.6-1.el9 baseos 348 k
cracklib aarch64 2.9.6-27.el9 baseos 95 k
cracklib-dicts aarch64 2.9.6-27.el9 baseos 3.6 M
dbus aarch64 1:1.12.20-8.el9 baseos 3.7 k
dbus-broker aarch64 28-7.el9 baseos 166 k
dbus-common noarch 1:1.12.20-8.el9 baseos 15 k
dbus-libs aarch64 1:1.12.20-8.el9 baseos 150 k
diffutils aarch64 3.7-12.el9 baseos 392 k
dmidecode aarch64 1:3.3-7.el9 baseos 70 k
gobject-introspection aarch64 1.68.0-11.el9 baseos 248 k
iproute aarch64 6.2.0-5.el9 baseos 818 k
kmod-libs aarch64 28-9.el9 baseos 62 k
libbpf aarch64 2:1.3.0-2.el9 baseos 172 k
libdb aarch64 5.3.28-53.el9 baseos 712 k
libdnf-plugin-subscription-manager aarch64 1.29.40-1.el9 baseos 63 k
libeconf aarch64 0.4.1-4.el9 baseos 26 k
libfdisk aarch64 2.37.4-18.el9 baseos 150 k
libmnl aarch64 1.0.4-16.el9 baseos 28 k
libpwquality aarch64 1.4.4-8.el9 baseos 119 k
libseccomp aarch64 2.5.2-2.el9 baseos 72 k
libselinux-utils aarch64 3.6-1.el9 baseos 190 k
libuser aarch64 0.63-13.el9 baseos 405 k
libutempter aarch64 1.2.1-6.el9 baseos 27 k
openssl aarch64 1:3.2.1-1.el9 baseos 1.3 M
pam aarch64 1.5.1-19.el9 baseos 627 k
passwd aarch64 0.80-12.el9 baseos 121 k
policycoreutils aarch64 3.6-2.1.el9 baseos 242 k
policycoreutils-python-utils noarch 3.6-2.1.el9 baseos 77 k
psmisc aarch64 23.4-3.el9 baseos 243 k
python3-audit aarch64 3.1.2-2.el9 baseos 83 k
python3-chardet noarch 4.0.0-5.el9 baseos 239 k
python3-cloud-what aarch64 1.29.40-1.el9 baseos 77 k
python3-dateutil noarch 1:2.8.1-7.el9 baseos 288 k
python3-dbus aarch64 1.2.18-2.el9 baseos 144 k
python3-decorator noarch 4.4.2-6.el9 baseos 28 k
python3-distro noarch 1.5.0-7.el9 baseos 37 k
python3-dnf-plugins-core noarch 4.3.0-15.el9 baseos 264 k
python3-gobject-base aarch64 3.40.1-6.el9 baseos 184 k
python3-gobject-base-noarch noarch 3.40.1-6.el9 baseos 161 k
python3-idna noarch 2.10-7.el9.1 baseos 102 k
python3-iniparse noarch 0.4-45.el9 baseos 47 k
python3-inotify noarch 0.9.6-25.el9 baseos 53 k
python3-librepo aarch64 1.14.5-2.el9 baseos 48 k
python3-libselinux aarch64 3.6-1.el9 baseos 183 k
python3-libsemanage aarch64 3.6-1.el9 baseos 79 k
python3-policycoreutils noarch 3.6-2.1.el9 baseos 2.1 M
python3-pysocks noarch 1.7.1-12.el9 baseos 35 k
python3-requests noarch 2.25.1-8.el9 baseos 125 k
python3-setools aarch64 4.4.4-1.el9 baseos 595 k
python3-setuptools noarch 53.0.0-12.el9 baseos 944 k
python3-six noarch 1.15.0-9.el9 baseos 37 k
python3-subscription-manager-rhsm aarch64 1.29.40-1.el9 baseos 162 k
python3-systemd aarch64 234-18.el9 baseos 89 k
python3-urllib3 noarch 1.26.5-5.el9 baseos 215 k
subscription-manager-rhsm-certificates noarch 20220623-1.el9 baseos 21 k
systemd aarch64 252-33.el9 baseos 4.0 M
systemd-libs aarch64 252-33.el9 baseos 641 k
systemd-pam aarch64 252-33.el9 baseos 271 k
systemd-rpm-macros noarch 252-33.el9 baseos 69 k
usermode aarch64 1.114-4.el9 baseos 189 k
util-linux aarch64 2.37.4-18.el9 baseos 2.3 M
util-linux-core aarch64 2.37.4-18.el9 baseos 463 k
virt-what aarch64 1.25-5.el9 baseos 33 k
which aarch64 2.21-29.el9 baseos 41 kTransaction Summary
==============================================================================================================================================================================================================================================
Install 66 PackagesTotal download size: 26 M
Installed size: 92 M
Is this ok [y/N]:
subscription-manager does bring in quite a few things. we can probably get away with installing
systemd util-linux iproute dbus
we may hit some edge cases still where something works in OCP but doesn't in OKD due to a missing package. we have hit at least 6 or 7 containers using tools from util-linux so far.
Description of the problem:
While installing many SNOs via ACM/ZTP using the infrastructure operator, all SNOs have a left over pod in the `assisted-installer` namespace that is stuck in ImagePullBackOff state. It seems that there is no sha tag referenced in the pod spec and thus the container image will not be able to be pulled down in a disconnected environment. In large scale tests this results in 3500+ pods across 3500 SNO clusters stuck in ImagePullBackOff. Despite this, the cluster succeed in installing so this doesn't block tests however is something that should be addressed for the additional resources that a pod that can not run places on an SNO
Versions
Hub and Deployed SNO - 4.15.2
ACM - 2.10.0-DOWNSTREAM-2024-03-14-14-53-38
How reproducible:
Always in disconnected
Steps to reproduce:
1.
2.
3.
Actual results:
# oc --kubeconfig /root/hv-vm/kc/vm00006/kubeconfig get po -n assisted-installer NAME READY STATUS RESTARTS AGE assisted-installer-controller-z569s 0/1 Completed 0 25h vm00006-debug-n477b 0/1 ImagePullBackOff 0 25h
Yaml of pod in question:
# oc --kubeconfig /root/hv-vm/kc/vm00006/kubeconfig get po -n assisted-installer vm00006-debug-n477b -o yaml apiVersion: v1 kind: Pod metadata: annotations: debug.openshift.io/source-container: container-00 debug.openshift.io/source-resource: /v1, Resource=nodes/vm00006 openshift.io/scc: privileged creationTimestamp: "2024-03-25T16:31:48Z" name: vm00006-debug-n477b namespace: assisted-installer resourceVersion: "501965" uid: 4c46ed25-5d81-4e27-8f2a-5a5eb89cc474 spec: containers: - command: - chroot - /host - last - reboot env: - name: TMOUT value: "900" image: registry.redhat.io/rhel8/support-tools imagePullPolicy: Always name: container-00 resources: {} securityContext: privileged: true runAsUser: 0 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /host name: host - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-zfn5z readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true hostIPC: true hostNetwork: true hostPID: true nodeName: vm00006 preemptionPolicy: PreemptLowerPriority priority: 1000000000 priorityClassName: openshift-user-critical restartPolicy: Never schedulerName: default-scheduler securityContext: {} serviceAccount: default serviceAccountName: default terminationGracePeriodSeconds: 30 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 volumes: - hostPath: path: / type: Directory name: host - name: kube-api-access-zfn5z projected: defaultMode: 420 sources: - serviceAccountToken: expirationSeconds: 3607 path: token - configMap: items: - key: ca.crt path: ca.crt name: kube-root-ca.crt - downwardAPI: items: - fieldRef: apiVersion: v1 fieldPath: metadata.namespace path: namespace - configMap: items: - key: service-ca.crt path: service-ca.crt name: openshift-service-ca.crt status: conditions: - lastProbeTime: null lastTransitionTime: "2024-03-25T16:31:48Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2024-03-25T16:31:48Z" message: 'containers with unready status: [container-00]' reason: ContainersNotReady status: "False" type: Ready - lastProbeTime: null lastTransitionTime: "2024-03-25T16:31:48Z" message: 'containers with unready status: [container-00]' reason: ContainersNotReady status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2024-03-25T16:31:48Z" status: "True" type: PodScheduled containerStatuses: - image: registry.redhat.io/rhel8/support-tools imageID: "" lastState: {} name: container-00 ready: false restartCount: 0 started: false state: waiting: message: Back-off pulling image "registry.redhat.io/rhel8/support-tools" reason: ImagePullBackOff hostIP: fc00:1005::3ed phase: Pending podIP: fc00:1005::3ed podIPs: - ip: fc00:1005::3ed qosClass: BestEffort startTime: "2024-03-25T16:31:48Z"
ACI Object for this cluster:
# oc get aci -n vm00005 vm00005 -o yaml apiVersion: extensions.hive.openshift.io/v1beta1 kind: AgentClusterInstall metadata: annotations: agent-install.openshift.io/install-config-overrides: '{"networking":{"networkType":"OVNKubernetes"},"capabilities":{"baselineCapabilitySet": "None", "additionalEnabledCapabilities": [ "OperatorLifecycleManager", "NodeTuning" ] }}' argocd.argoproj.io/sync-wave: "1" kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"extensions.hive.openshift.io/v1beta1","kind":"AgentClusterInstall","metadata":{"annotations":{"agent-install.openshift.io/install-config-overrides":"{\"networking\":{\"networkType\":\"OVNKubernetes\"},\"capabilities\":{\"baselineCapabilitySet\": \"None\", \"additionalEnabledCapabilities\": [ \"OperatorLifecycleManager\", \"NodeTuning\" ] }}","argocd.argoproj.io/sync-wave":"1","ran.openshift.io/ztp-gitops-generated":"{}"},"labels":{"app.kubernetes.io/instance":"ztp-clusters-01"},"name":"vm00005","namespace":"vm00005"},"spec":{"clusterDeploymentRef":{"name":"vm00005"},"imageSetRef":{"name":"openshift-4.15.2"},"manifestsConfigMapRef":{"name":"vm00005"},"networking":{"clusterNetwork":[{"cidr":"fd01::/48","hostPrefix":64}],"machineNetwork":[{"cidr":"fc00:1005::/64"}],"serviceNetwork":["fd02::/112"]},"provisionRequirements":{"controlPlaneAgents":1},"sshPublicKey":"ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC6351YHGIvE6DZcAt0RXqodzKbglUeCNqxRyd7OOfcN+p88RaKZahg9/dL81IbdlGPbhPPG9ga7BdkLN8VbyYNCjE7kBIKM47JS1pbjeoJhI8bBrFjjp6LIJW/tWh1Pyl7Mk3DbPiZOkJ9HXvOgE/HRh44jJPtMLnzrZU5VNgNRgaEBWOG+j06pxdK9giMji1mFkJXSr43YUZYYgM3egNfNzxeTG0SshbZarRDEKeAlnDJkZ70rbP2krL2MgZJDv8vIK1PcMMFhjsJ/4Pp7F0Tl2Rm/qlZhTn4ptWagZmM0Z3N2WkNdX6Z9i2lZ5K+5jNHEFfjw/CPOFqpaFMMckpfFMsAJchbqnh+F5NvKJSFNB6L77iRCp5hbhGBbZncwc3UDO3FZ9ZuYZ8Ws+2ZyS5uVxd5ZUsvZFO+mWwySytFbsc0nUUcgkXlBiGKF/eFm9SQTURkyNzJkJfPm7awRwYoidaf8MTSp/kUCCyloAjpFIOJAa0SoVerhLp8uhQzfeU= root@e38-h01-000-r650.rdu2.scalelab.redhat.com"}} ran.openshift.io/ztp-gitops-generated: '{}' creationTimestamp: "2024-03-25T15:39:40Z" finalizers: - agentclusterinstall.agent-install.openshift.io/ai-deprovision generation: 3 labels: app.kubernetes.io/instance: ztp-clusters-01 name: vm00005 namespace: vm00005 ownerReferences: - apiVersion: hive.openshift.io/v1 kind: ClusterDeployment name: vm00005 uid: 4d647db3-88fb-4b64-8a47-d50c9e2dfe7b resourceVersion: "267225" uid: f6472a4f-d483-4563-8a36-388e7d3874c9 spec: clusterDeploymentRef: name: vm00005 clusterMetadata: adminKubeconfigSecretRef: name: vm00005-admin-kubeconfig adminPasswordSecretRef: name: vm00005-admin-password clusterID: b758099e-3556-4fec-9190-ba709d4fbcaf infraID: 1de248f6-6767-47dc-8617-c02e8d0d457e imageSetRef: name: openshift-4.15.2 manifestsConfigMapRef: name: vm00005 networking: clusterNetwork: - cidr: fd01::/48 hostPrefix: 64 machineNetwork: - cidr: fc00:1005::/64 serviceNetwork: - fd02::/112 userManagedNetworking: true provisionRequirements: controlPlaneAgents: 1 sshPublicKey: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC6351YHGIvE6DZcAt0RXqodzKbglUeCNqxRyd7OOfcN+p88RaKZahg9/dL81IbdlGPbhPPG9ga7BdkLN8VbyYNCjE7kBIKM47JS1pbjeoJhI8bBrFjjp6LIJW/tWh1Pyl7Mk3DbPiZOkJ9HXvOgE/HRh44jJPtMLnzrZU5VNgNRgaEBWOG+j06pxdK9giMji1mFkJXSr43YUZYYgM3egNfNzxeTG0SshbZarRDEKeAlnDJkZ70rbP2krL2MgZJDv8vIK1PcMMFhjsJ/4Pp7F0Tl2Rm/qlZhTn4ptWagZmM0Z3N2WkNdX6Z9i2lZ5K+5jNHEFfjw/CPOFqpaFMMckpfFMsAJchbqnh+F5NvKJSFNB6L77iRCp5hbhGBbZncwc3UDO3FZ9ZuYZ8Ws+2ZyS5uVxd5ZUsvZFO+mWwySytFbsc0nUUcgkXlBiGKF/eFm9SQTURkyNzJkJfPm7awRwYoidaf8MTSp/kUCCyloAjpFIOJAa0SoVerhLp8uhQzfeU= root@e38-h01-000-r650.rdu2.scalelab.redhat.com status: apiVIP: fc00:1005::3ec apiVIPs: - fc00:1005::3ec conditions: - lastProbeTime: "2024-03-25T15:39:40Z" lastTransitionTime: "2024-03-25T15:39:40Z" message: SyncOK reason: SyncOK status: "True" type: SpecSynced - lastProbeTime: "2024-03-25T15:55:46Z" lastTransitionTime: "2024-03-25T15:55:46Z" message: The cluster's validations are passing reason: ValidationsPassing status: "True" type: Validated - lastProbeTime: "2024-03-25T16:48:06Z" lastTransitionTime: "2024-03-25T16:48:06Z" message: The cluster requirements are met reason: ClusterAlreadyInstalling status: "True" type: RequirementsMet - lastProbeTime: "2024-03-25T16:48:06Z" lastTransitionTime: "2024-03-25T16:48:06Z" message: 'The installation has completed: Cluster is installed' reason: InstallationCompleted status: "True" type: Completed - lastProbeTime: "2024-03-25T15:39:40Z" lastTransitionTime: "2024-03-25T15:39:40Z" message: The installation has not failed reason: InstallationNotFailed status: "False" type: Failed - lastProbeTime: "2024-03-25T16:48:06Z" lastTransitionTime: "2024-03-25T16:48:06Z" message: The installation has stopped because it completed successfully reason: InstallationCompleted status: "True" type: Stopped - lastProbeTime: "2024-03-25T15:57:07Z" lastTransitionTime: "2024-03-25T15:57:07Z" reason: There is no failing prior preparation attempt status: "False" type: LastInstallationPreparationFailed connectivityMajorityGroups: '{"IPv4":[],"IPv6":[],"fc00:1005::/64":[]}' debugInfo: eventsURL: https://assisted-service-multicluster-engine.apps.acm-lta.rdu2.scalelab.redhat.com/api/assisted-install/v2/events?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiMWRlMjQ4ZjYtNjc2Ny00N2RjLTg2MTctYzAyZThkMGQ0NTdlIn0.rC8eJsuw3mVOtBNWFUu6Nq5pRjmiBLuC6b_xV_FetO5D_8Tc4qz7_hit29C92Xrl_pjysD3tTey2c9NJoI9kTA&cluster_id=1de248f6-6767-47dc-8617-c02e8d0d457e logsURL: https://assisted-service-multicluster-engine.apps.acm-lta.rdu2.scalelab.redhat.com/api/assisted-install/v2/clusters/1de248f6-6767-47dc-8617-c02e8d0d457e/logs?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiMWRlMjQ4ZjYtNjc2Ny00N2RjLTg2MTctYzAyZThkMGQ0NTdlIn0._CjP1uE4SpWQ93CIAVjHqN1uyse4d0AU5FThBSAyfKoRF8IvvY9Zpsr2kx3jNd_CyzHp0wPT5EWWpIMvQx79Kw state: adding-hosts stateInfo: Cluster is installed ingressVIP: fc00:1005::3ec ingressVIPs: - fc00:1005::3ec machineNetwork: - cidr: fc00:1005::/64 platformType: None progress: totalPercentage: 100 userManagedNetworking: true validationsInfo: configuration: - id: platform-requirements-satisfied message: Platform requirements satisfied status: success - id: pull-secret-set message: The pull secret is set. status: success hosts-data: - id: all-hosts-are-ready-to-install message: All hosts in the cluster are ready to install. status: success - id: sufficient-masters-count message: The cluster has the exact amount of dedicated control plane nodes. status: success network: - id: api-vips-defined message: 'API virtual IPs are not required: User Managed Networking' status: success - id: api-vips-valid message: 'API virtual IPs are not required: User Managed Networking' status: success - id: cluster-cidr-defined message: The Cluster Network CIDR is defined. status: success - id: dns-domain-defined message: The base domain is defined. status: success - id: ingress-vips-defined message: 'Ingress virtual IPs are not required: User Managed Networking' status: success - id: ingress-vips-valid message: 'Ingress virtual IPs are not required: User Managed Networking' status: success - id: machine-cidr-defined message: The Machine Network CIDR is defined. status: success - id: machine-cidr-equals-to-calculated-cidr message: 'The Cluster Machine CIDR is not required: User Managed Networking' status: success - id: network-prefix-valid message: The Cluster Network prefix is valid. status: success - id: network-type-valid message: The cluster has a valid network type status: success - id: networks-same-address-families message: Same address families for all networks. status: success - id: no-cidrs-overlapping message: No CIDRS are overlapping. status: success - id: ntp-server-configured message: No ntp problems found status: success - id: service-cidr-defined message: The Service Network CIDR is defined. status: success operators: - id: cnv-requirements-satisfied message: cnv is disabled status: success - id: lso-requirements-satisfied message: lso is disabled status: success - id: lvm-requirements-satisfied message: lvm is disabled status: success - id: mce-requirements-satisfied message: mce is disabled status: success - id: odf-requirements-satisfied message: odf is disabled status: success
Expected results:
Description of problem:
The creation of an Azure HC with secret encryption failed with # azure-kms-provider-active container log (within the KAS pod) I0516 09:38:22.860917 1 exporter.go:17] "metrics backend" exporter="prometheus" I0516 09:38:22.861178 1 prometheus_exporter.go:56] "Prometheus metrics server running" address="8095" I0516 09:38:22.861199 1 main.go:90] "Starting KeyManagementServiceServer service" version="" buildDate="" E0516 09:38:22.861439 1 main.go:59] "unrecoverable error encountered" err="failed to create key vault client: key vault name, key name and key version are required"
How reproducible:
Always
Steps to Reproduce:
1. export RESOURCEGROUP="fxie-1234-rg" LOCATION="eastus" KEYVAULT_NAME="fxie-1234-keyvault" KEYVAULT_KEY_NAME="fxie-1234-key" KEYVAULT_KEY2_NAME="fxie-1234-key-2" 2. az group create --name $RESOURCEGROUP --location $LOCATION 3. az keyvault create -n $KEYVAULT_NAME -g $RESOURCEGROUP -l $LOCATION --enable-purge-protection true 4. az keyvault set-policy -n $KEYVAULT_NAME --key-permissions decrypt encrypt --spn fa5abf8d-ed43-4637-93a7-688e2a0efd82 5. az keyvault key create --vault-name $KEYVAULT_NAME -n $KEYVAULT_KEY_NAME --protection software 6. KEYVAULT_KEY_URL="$(az keyvault key show --vault-name $KEYVAULT_NAME --name $KEYVAULT_KEY_NAME --query 'key.kid' -o tsv)" 7. hypershift create cluster azure --pull-secret $PULL_SECRET --name $CLUSTER_NAME --azure-creds $HOME/.azure/osServicePrincipal.json --node-pool-replicas=1 --location eastus --base-domain $BASE_DOMAIN --release-image registry.ci.openshift.org/ocp/release:4.16.0-0.nightly-2024-05-15-001800 --encryption-key-id $KEYVAULT_KEY_URL
Root cause:
The entrypoint statement within azure-kubernetes-kms's Dockerfile is in shell form which prevents any command line arguments from being used.
Description of problem:
When trying to onboard a xFusion baremetal node using redfish-virtual media (no provisioning network), it fails after the node registration with this error: Normal InspectionError 60s metal3-baremetal-controller Failed to inspect hardware. Reason: unable to start inspection: The attribute Links/ManagedBy is missing from the resource /redfish/v1/Systems/1
Version-Release number of selected component (if applicable):
4.14.18
How reproducible:
Just add a xFusion baremetal node, specifing in the manifest Spec: Automated Cleaning Mode: metadata Bmc: Address: redfish-virtualmedia://w.z.x.y/redfish/v1/Systems/1 Credentials Name: hu28-tovb-bmc-secret Disable Certificate Verification: true Boot MAC Address:MAC Boot Mode: UEFI Online: false Preprovisioning Network Data Name: openstack-hu28-tovb-network-config-secret
Steps to Reproduce:
1. 2. 3.
Actual results:
Inspection fails with afore mentioned error, no preprovisioning image is mounted on the hoste virtualmedia
Expected results:
VirtualMedia get mounted and inspection starts.
Additional info:
Description of problem: When the bootstrap times out, the installer tries to download the logs from the bootstrap VM and gives an analysis of what happened. On OpenStack platform, we're currently failing to download the bootstrap logs (tracked in OCPBUGS-34950), which causes the analysis to always return an erroneous message:
time="2024-06-05T08:34:45-04:00" level=error msg="Bootstrap failed to complete: timed out waiting for the condition" time="2024-06-05T08:34:45-04:00" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane." time="2024-06-05T08:34:45-04:00" level=error msg="The bootstrap machine did not execute the release-image.service systemd unit"
The affirmation that the bootstrap machine did not execute the release-image.service systemd unit is wrong, as I can confirm by SSH'ing to the bootstrap node:
systemctl status release-image.service ● release-image.service - Download the OpenShift Release Image Loaded: loaded (/etc/systemd/system/release-image.service; static) Active: active (exited) since Wed 2024-06-05 11:57:33 UTC; 1h 16min ago Process: 2159 ExecStart=/usr/local/bin/release-image-download.sh (code=exited, status=0/SUCCESS) Main PID: 2159 (code=exited, status=0/SUCCESS) CPU: 47.364s Jun 05 11:57:05 mandre-tnvc8bootstrap systemd[1]: Starting Download the OpenShift Release Image... Jun 05 11:57:06 mandre-tnvc8bootstrap podman[2184]: 2024-06-05 11:57:06.895418265 +0000 UTC m=+0.811028632 system refresh Jun 05 11:57:06 mandre-tnvc8bootstrap release-image-download.sh[2159]: Pulling quay.io/openshift-release-dev/ocp-release@sha256:31cdf34b1957996d5c79c48466abab2fcfb9d9843> Jun 05 11:57:32 mandre-tnvc8bootstrap release-image-download.sh[2269]: 079f5c86b015ddaf9c41349ba292d7a5487be91dd48e48852d10e64dd0ec125d Jun 05 11:57:32 mandre-tnvc8bootstrap podman[2269]: 2024-06-05 11:57:32.82473216 +0000 UTC m=+25.848290388 image pull 079f5c86b015ddaf9c41349ba292d7a5487be91dd48e48852d1> Jun 05 11:57:33 mandre-tnvc8bootstrap systemd[1]: Finished Download the OpenShift Release Image.
The installer was just unable to retrieve the bootstrap logs. Earlier, buried in the installer logs, we can see:
time="2024-06-05T08:34:42-04:00" level=info msg="Failed to gather bootstrap logs: failed to connect to the bootstrap machine: dial tcp 10.196.2.10:22: connect: connection
timed out"
This is what should be reported by the analyzer.
Currently the manifests directory has:
0000_30_cluster-api_00_credentials-request.yaml 0000_30_cluster-api_00_namespace.yaml ...
CredentialsRequests go into the openshift-cloud-credential-operator namespace, so they can come before or after the openshift-cluster-api namespace. But because they ask for Secrets in the openshift-cluster-api namespace, there would be less race and drama if the CredentialsRequest manifests were given a name that sorted them after the namespace. Like 0000_30_cluster-api_01_credentials-request.yaml.
I haven't gone digging in history, it may have been like this since forever.
Every time.
With a release image pullspec like registry.ci.openshift.org/ocp/release:4.17.0-0.nightly-2024-06-27-184535:
$ oc adm release extract --to manifests registry.ci.openshift.org/ocp/release:4.17.0-0.nightly-2024-06-27-184535 $ ls manifests/0000_30_cluster-api_* | grep 'namespace\|credentials-request'
$ ls manifests/0000_30_cluster-api_* | grep 'namespace\|credentials-request' manifests/0000_30_cluster-api_00_credentials-request.yaml manifests/0000_30_cluster-api_00_namespace.yaml
$ ls manifests/0000_30_cluster-api_* | grep 'namespace\|credentials-request' manifests/0000_30_cluster-api_00_namespace.yaml manifests/0000_30_cluster-api_01_credentials-request.yaml
Description of problem:
Given the sessions can now modify their expiration during their lifetime (by going through the token refresh process), the current pruning mechanism might randomly remove active sessions. We need to fix that behavior by ordering the byAge index of the session storage by expiration and only removing the sessions that are either really expired or close to expiration.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
In 4,16 OCP starts to place an annotation on service accounts when it creates a dockercfg secret. Some operators/reconciliation loops (incorrectly) will then try to set the annotation on the SA back to exactly what they wanted. OCP will annotate again and create a new secret. Operators sets it back without annotation. Rinse Repeat. Eventually etcd will get completely overloaded with secrets, will start to OOM, and the entire cluster will come down.
There is belief that at least otel, tempo, acm, odf/ocs, strymzi, elasticsearch and possibly other operators reconciled the annoations on the SA by setting them back exactly how they wanted them set.
These seem to be related (but no complete)
https://issues.redhat.com/browse/LOG-5776
https://issues.redhat.com/browse/ENTMQST-6129
This is a clone of issue OCPBUGS-29240. The following is the description of the original issue:
—
Manila drivers and node-registrar should be configured to use healtchecks.
Description of problem:
without specifying "kmsKeyServiceAccount" for controlPlane leads to creating bootstrap and control-plane machines failure
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-multi-2024-06-12-211551
How reproducible:
Always
Steps to Reproduce:
1. "create install-config" and then insert disk encryption settings, but not set "kmsKeyServiceAccount" for controlPlane (see [2]) 2. "create cluster" (see [3])
Actual results:
"create cluster" failed with below error: ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to create control-plane manifest: GCPMachine.infrastructure.cluster.x-k8s.io "jiwei-0613d-capi-84z69-bootstrap" is invalid: spec.rootDiskEncryptionKey.kmsKeyServiceAccount: Invalid value: "": spec.rootDiskEncryptionKey.kmsKeyServiceAccount in body should match '[-_[A-Za-z0-9]+@[-_[A-Za-z0-9]+.iam.gserviceaccount.com
Expected results:
Installation should succeed.
Additional info:
FYI the QE test case: OCP-61160 - [IPI-on-GCP] install cluster with different custom managed keys for control-plane and compute nodes https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-61160
Description of problem:
when installing a disconnected and/or private cluster, existing VPC and subnets are used and inserted into install-config, unfortunately CAPI installation failed with error 'failed to generate asset "Cluster API Manifests": failed to generate GCP manifests: failed to get control plane subnet'
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-07-07-131215
How reproducible:
Always
Steps to Reproduce:
1. create VPC/subnets/etc. 2. "create install-config", and insert the VPC/subnets settings (see [1]) 3. "create cluster" (or "create manifests")
Actual results:
Failed with below error, although the subnet does exist (see [2]): 07-08 17:46:44.755 level=fatal msg=failed to fetch Cluster API Manifests: failed to generate asset "Cluster API Manifests": failed to generate GCP manifests: failed to get control plane subnet: failed to find subnet jiwei-0708b-master-subnet: googleapi: Error 400: Invalid resource field value in the request. 07-08 17:46:44.755 level=fatal msg=Details: 07-08 17:46:44.755 level=fatal msg=[ 07-08 17:46:44.755 level=fatal msg= { 07-08 17:46:44.755 level=fatal msg= "@type": "type.googleapis.com/google.rpc.ErrorInfo", 07-08 17:46:44.755 level=fatal msg= "domain": "googleapis.com", 07-08 17:46:44.755 level=fatal msg= "metadatas": { 07-08 17:46:44.755 level=fatal msg= "method": "compute.v1.SubnetworksService.Get", 07-08 17:46:44.755 level=fatal msg= "service": "compute.googleapis.com" 07-08 17:46:44.755 level=fatal msg= }, 07-08 17:46:44.755 level=fatal msg= "reason": "RESOURCE_PROJECT_INVALID" 07-08 17:46:44.755 level=fatal msg= } 07-08 17:46:44.756 level=fatal msg=] 07-08 17:46:44.756 level=fatal msg=, invalidParameter
Expected results:
Installation should succeed.
Additional info:
FYI one of the problem PROW CI tests: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-multi-nightly-gcp-ipi-disc-priv-capi-amd-mixarch-f28-destructive/1810155486213836800 QE's Flexy-install/295772/ VARIABLES_LOCATION private-templates/functionality-testing/aos-4_17/ipi-on-gcp/versioned-installer-private_cluster LAUNCHER_VARS feature_set: "TechPreviewNoUpgrade"
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
manually approve client CSR will cause Admission Webhook Warning
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-23-013426
How reproducible:
Always
Steps to Reproduce:
1. setup UPI cluster and manually add more workers, don't approve node CSR 2. Navigate to Compute -> Nodes page, click on 'Discovered' status then click on 'Approve' button in the modal
Actual results:
2. Console displays `Admission Webhook Warning: CertificateSigningRequest xxx violates policy 299 - unknown field metadata.Originalname
Expected results:
2. no warning message
Additional info:
Description of problem:
When setting up cluster on vsphere, sometimes machine is powered off and in "Provisioning" phase, it will trigger a new machine creation, and report error "failed to Create machine: The name 'jima-ipi-27-d97wp-worker-7qn9b' already exists"
Version-Release number of selected component (if applicable):
4.12.0-0.ci.test-2022-09-26-235306-ci-ln-vh4qjyk-latest
How reproducible:
Sometimes, met two times
Steps to Reproduce:
1. Setup a vsphere cluster 2. 3.
Actual results:
Cluster installation failed, machine stuck in Provisioning status. $ oc get machine NAME PHASE TYPE REGION ZONE AGE jima-ipi-27-d97wp-master-0 Running 4h jima-ipi-27-d97wp-master-1 Running 4h jima-ipi-27-d97wp-master-2 Running 4h jima-ipi-27-d97wp-worker-7qn9b Provisioning 3h56m jima-ipi-27-d97wp-worker-dsqd2 Running 3h56m $ oc edit machine jima-ipi-27-d97wp-worker-7qn9b status: conditions: - lastTransitionTime: "2022-09-27T01:27:29Z" status: "True" type: Drainable - lastTransitionTime: "2022-09-27T01:27:29Z" message: Instance has not been created reason: InstanceNotCreated severity: Warning status: "False" type: InstanceExists - lastTransitionTime: "2022-09-27T01:27:29Z" status: "True" type: Terminable lastUpdated: "2022-09-27T01:27:29Z" phase: Provisioning providerStatus: conditions: - lastTransitionTime: "2022-09-27T01:36:09Z" message: The name 'jima-ipi-27-d97wp-worker-7qn9b' already exists. reason: MachineCreationSucceeded status: "False" type: MachineCreation taskRef: task-11363480 $ govc vm.info /SDDC-Datacenter/vm/jima-ipi-27-d97wp/jima-ipi-27-d97wp-worker-7qn9b Name: jima-ipi-27-d97wp-worker-7qn9b Path: /SDDC-Datacenter/vm/jima-ipi-27-d97wp/jima-ipi-27-d97wp-worker-7qn9b UUID: 422cb686-6585-f05a-af13-b2acac3da294 Guest name: Red Hat Enterprise Linux 8 (64-bit) Memory: 16384MB CPU: 8 vCPU(s) Power state: poweredOff Boot time: <nil> IP address: Host: 10.3.32.8 I0927 01:44:42.568599 1 session.go:91] No existing vCenter session found, creating new session I0927 01:44:42.633672 1 session.go:141] Find template by instance uuid: 9535891b-902e-410c-b9bb-e6a57aa6b25a I0927 01:44:42.641691 1 reconciler.go:270] jima-ipi-27-d97wp-worker-7qn9b: already exists, but was not powered on after clone, requeue I0927 01:44:42.641726 1 controller.go:380] jima-ipi-27-d97wp-worker-7qn9b: reconciling machine triggers idempotent create I0927 01:44:42.641732 1 actuator.go:66] jima-ipi-27-d97wp-worker-7qn9b: actuator creating machine I0927 01:44:42.659651 1 reconciler.go:935] task: task-11363480, state: error, description-id: VirtualMachine.clone I0927 01:44:42.659684 1 reconciler.go:951] jima-ipi-27-d97wp-worker-7qn9b: Updating provider status E0927 01:44:42.659696 1 actuator.go:57] jima-ipi-27-d97wp-worker-7qn9b error: jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: The name 'jima-ipi-27-d97wp-worker-7qn9b' already exists. I0927 01:44:42.659762 1 machine_scope.go:101] jima-ipi-27-d97wp-worker-7qn9b: patching machine I0927 01:44:42.660100 1 recorder.go:103] events "msg"="jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: The name 'jima-ipi-27-d97wp-worker-7qn9b' already exists." "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"jima-ipi-27-d97wp-worker-7qn9b","uid":"9535891b-902e-410c-b9bb-e6a57aa6b25a","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"17614"} "reason"="FailedCreate" "type"="Warning" W0927 01:44:42.688562 1 controller.go:382] jima-ipi-27-d97wp-worker-7qn9b: failed to create machine: jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: The name 'jima-ipi-27-d97wp-worker-7qn9b' already exists. E0927 01:44:42.688651 1 controller.go:326] "msg"="Reconciler error" "error"="jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: The name 'jima-ipi-27-d97wp-worker-7qn9b' already exists." "controller"="machine-controller" "name"="jima-ipi-27-d97wp-worker-7qn9b" "namespace"="openshift-machine-api" "object"={"name":"jima-ipi-27-d97wp-worker-7qn9b","namespace":"openshift-machine-api"} "reconcileID"="d765f02c-bd54-4e6c-88a4-c578f16c7149" ... I0927 03:18:45.118110 1 actuator.go:66] jima-ipi-27-d97wp-worker-7qn9b: actuator creating machine E0927 03:18:45.131676 1 actuator.go:57] jima-ipi-27-d97wp-worker-7qn9b error: jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: ServerFaultCode: The object 'vim.Task:task-11363480' has already been deleted or has not been completely created I0927 03:18:45.131725 1 machine_scope.go:101] jima-ipi-27-d97wp-worker-7qn9b: patching machine I0927 03:18:45.131873 1 recorder.go:103] events "msg"="jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: ServerFaultCode: The object 'vim.Task:task-11363480' has already been deleted or has not been completely created" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"jima-ipi-27-d97wp-worker-7qn9b","uid":"9535891b-902e-410c-b9bb-e6a57aa6b25a","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"17614"} "reason"="FailedCreate" "type"="Warning" W0927 03:18:45.150393 1 controller.go:382] jima-ipi-27-d97wp-worker-7qn9b: failed to create machine: jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: ServerFaultCode: The object 'vim.Task:task-11363480' has already been deleted or has not been completely created E0927 03:18:45.150492 1 controller.go:326] "msg"="Reconciler error" "error"="jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: ServerFaultCode: The object 'vim.Task:task-11363480' has already been deleted or has not been completely created" "controller"="machine-controller" "name"="jima-ipi-27-d97wp-worker-7qn9b" "namespace"="openshift-machine-api" "object"={"name":"jima-ipi-27-d97wp-worker-7qn9b","namespace":"openshift-machine-api"} "reconcileID"="5d92bc1d-2f0d-4a0b-bb20-7f2c7a2cb5af" I0927 03:18:45.150543 1 controller.go:187] jima-ipi-27-d97wp-worker-dsqd2: reconciling Machine
Expected results:
Machine is created successfully.
Additional info:
machine-controller log: http://file.rdu.redhat.com/~zhsun/machine-controller.log
Description of problem:
My CSV recently added a v1beta2 API version in addition to the existing v1beta1 version. When I create a v1beta2 CR and view it in the console, I see v1beta1 API fields and not the expected v1beta2 fields.
Version-Release number of selected component (if applicable):
4.15.14 (could affect other versions)
How reproducible:
Install 3.0.0 development version of Cryostat Operator
Steps to Reproduce:
1. operator-sdk run bundle quay.io/ebaron/cryostat-operator-bundle:ocpbugs-34901 2. cat << 'EOF' | oc create -f - apiVersion: operator.cryostat.io/v1beta2 kind: Cryostat metadata: name: cryostat-sample spec: enableCertManager: false EOF 3. Navigate to https://<openshift console>/k8s/ns/openshift-operators/clusterserviceversions/cryostat-operator.v3.0.0-dev/operator.cryostat.io~v1beta2~Cryostat/cryostat-sample 4. Observe v1beta1 properties are rendered including "Minimal Deployment" 5. Attempt to toggle "Minimal Deployment", observe that this fails.
Actual results:
v1beta1 properties are rendered in the details page instead of v1beta2 properties
Expected results:
v1beta2 properties are rendered in the details page
Additional info:
This is a clone of issue OCPBUGS-38842. The following is the description of the original issue:
—
Component Readiness has found a potential regression in the following test:
[sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers for ns/openshift-image-registry
Probability of significant regression: 98.02%
Sample (being evaluated) Release: 4.17
Start Time: 2024-08-15T00:00:00Z
End Time: 2024-08-22T23:59:59Z
Success Rate: 94.74%
Successes: 180
Failures: 10
Flakes: 0
Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 89
Failures: 0
Flakes: 0
Also hitting 4.17, I've aligned this bug to 4.18 so the backport process is cleaner.
The problem appears to be a permissions error preventing the pods from starting:
2024-08-22T06:14:14.743856620Z ln: failed to create symbolic link '/etc/pki/ca-trust/extracted/pem/directory-hash/ca-certificates.crt': Permission denied
Originating from this code: https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/podtemplatespec.go#L489
Both 4.17 and 4.18 nightlies bumped rhcos and in there is an upgrade like this:
container-selinux-3-2.231.0-1.rhaos4.16.el9-noarch container-selinux-3-2.231.0-2.rhaos4.17.el9-noarch
With slightly different versions in each stream, but both were on 3-2.231.
Hits other tests too:
operator conditions image-registry
Operator upgrade image-registry
[sig-cluster-lifecycle] Cluster completes upgrade
[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
[sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade]
This is a clone of issue OCPBUGS-38474. The following is the description of the original issue:
—
Description of problem:
AdditionalTrustedCA is not wired correctly so the configmap is not found my its operator. This feature is meant to be exposed by XCMSTRAT-590, but at the moment it seems to be broken
Version-Release number of selected component (if applicable):
4.16.5
How reproducible:
Always
Steps to Reproduce:
1. Create a configmap containing a registry and PEM cert, like https://github.com/openshift/openshift-docs/blob/ef75d891786604e78dcc3bcb98ac6f1b3a75dad1/modules/images-configuration-cas.adoc#L17 2. Refer to it in .spec.configuration.image.additionalTrustedCA.name 3. image-registry-config-operator is not able to find the cm and the CO is degraded
Actual results:
CO is degraded
Expected results:
certs are used.
Additional info:
I think we may miss a copy of the configmap from the cluster NS to the target ns. It should be also deleted if it is deleted.
% oc get hc -n ocm-adecorte-2d525fsstsvtbv1h8qss14pkv171qhdd -o jsonpath="{.items[0].spec.configuration.image.additionalTrustedCA}" | jq { "name": "registry-additional-ca-q9f6x5i4" }
% oc get cm -n ocm-adecorte-2d525fsstsvtbv1h8qss14pkv171qhdd registry-additional-ca-q9f6x5i4 NAME DATA AGE registry-additional-ca-q9f6x5i4 1 16m
logs of cluster-image-registry operator
E0814 13:22:32.586416 1 imageregistrycertificates.go:141] ImageRegistryCertificatesController: unable to sync: failed to update object *v1.ConfigMap, Namespace=openshift-image-registry, Name=image-registry-certificates: image-registry-certificates: configmap "registry-additional-ca-q9f6x5i4" not found, requeuing
CO is degraded
% oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
console 4.16.5 True False False 3h58m
csi-snapshot-controller 4.16.5 True False False 4h11m
dns 4.16.5 True False False 3h58m
image-registry 4.16.5 True False True 3h58m ImageRegistryCertificatesControllerDegraded: failed to update object *v1.ConfigMap, Namespace=openshift-image-registry, Name=image-registry-certificates: image-registry-certificates: configmap "registry-additional-ca-q9f6x5i4" not found
ingress 4.16.5 True False False 3h59m
insights 4.16.5 True False False 4h
kube-apiserver 4.16.5 True False False 4h11m
kube-controller-manager 4.16.5 True False False 4h11m
kube-scheduler 4.16.5 True False False 4h11m
kube-storage-version-migrator 4.16.5 True False False 166m
monitoring 4.16.5 True False False 3h55m
This is a clone of issue OCPBUGS-42745. The following is the description of the original issue:
—
flowschemas.v1beta3.flowcontrol.apiserver.k8s.io used in manifests/09_flowschema.yaml
Description of problem:
Failed to deploy the cluster with the following error: time="2024-06-13T14:01:11Z" level=debug msg="Creating the security group rules"time="2024-06-13T14:01:19Z" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: failed to create security groups: failed to create the security group rule on group \"cb9a607c-9799-4186-bc22-26f141ce91aa\" for IPv4 tcp on ports 1936-1936: Bad request with: [POST https://10.46.44.159:13696/v2.0/security-group-rules], error message: {\"NeutronError\": {\"type\": \"SecurityGroupRuleParameterConflict\", \"message\": \"Conflicting value ethertype IPv4 for CIDR fd2e:6f44:5dd8:c956::/64\", \"detail\": \"\"}}"time="2024-06-13T14:01:20Z" level=debug msg="OpenShift Installer 4.17.0-0.nightly-2024-06-13-083330"time="2024-06-13T14:01:20Z" level=debug msg="Built from commit 6bc75dfebaca79ecf302263af7d32d50c31f371a"time="2024-06-13T14:01:20Z" level=debug msg="Loading Install Config..."time="2024-06-13T14:01:20Z" level=debug msg=" Loading SSH Key..."time="2024-06-13T14:01:20Z" level=debug msg=" Loading Base Domain..."time="2024-06-13T14:01:20Z" level=debug msg=" Loading Platform..."time="2024-06-13T14:01:20Z" level=debug msg=" Loading Cluster Name..."time="2024-06-13T14:01:20Z" level=debug msg=" Loading Base Domain..."time="2024-06-13T14:01:20Z" level=debug msg=" Loading Platform..."time="2024-06-13T14:01:20Z" level=debug msg=" Loading Pull Secret..."time="2024-06-13T14:01:20Z" level=debug msg=" Loading Platform..."time="2024-06-13T14:01:20Z" level=debug msg="Using Install Config loaded from state file"time="2024-06-13T14:01:20Z" level=debug msg="Loading Agent Config..."time="2024-06-13T14:01:20Z" level=info msg="Waiting up to 40m0s (until 2:41PM UTC) for the cluster at https://api.ostest.shiftstack.com:6443 to initialize..."
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-06-13-083330
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-41111. The following is the description of the original issue:
—
Description of problem:
v4.17 baselineCapabilitySet is not recognized. # ./oc adm release extract --install-config v4.17-basecap.yaml --included --credentials-requests --from quay.io/openshift-release-dev/ocp-release:4.17.0-rc.1-x86_64 --to /tmp/test error: unrecognized baselineCapabilitySet "v4.17" # cat v4.17-basecap.yaml --- apiVersion: v1 platform: gcp: foo: bar capabilities: baselineCapabilitySet: v4.17
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-09-04-132247
How reproducible:
always
Steps to Reproduce:
1. Run `oc adm release extract --install-config --included` against an install-config file including baselineCapabilitySet: v4.17. 2. 3.
Actual results:
`oc adm release extract` throw unrecognized error
Expected results:
`oc adm release extract` should extract correct manifests
Additional info:
If specifying baselineCapabilitySet: v4.16, it works well.
Description of problem:
revert "force cert rotation every couple days for development" in 4.16 Below is the steps to verify this bug: # oc adm release info --commits registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-06-25-081133|grep -i cluster-kube-apiserver-operator cluster-kube-apiserver-operator https://github.com/openshift/cluster-kube-apiserver-operator 7764681777edfa3126981a0a1d390a6060a840a3 # git log --date local --pretty="%h %an %cd - %s" 776468 |grep -i "#1307" 08973b820 openshift-ci[bot] Thu Jun 23 22:40:08 2022 - Merge pull request #1307 from tkashem/revert-cert-rotation # oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-25-081133 True False 64m Cluster version is 4.11.0-0.nightly-2022-06-25-081133 $ cat scripts/check_secret_expiry.sh FILE="$1" if [ ! -f "$1" ]; then echo "must provide \$1" && exit 0 fi export IFS=$'\n' for i in `cat "$FILE"` do if `echo "$i" | grep "^#" > /dev/null`; then continue fi NS=`echo $i | cut -d ' ' -f 1` SECRET=`echo $i | cut -d ' ' -f 2` rm -f tls.crt; oc extract secret/$SECRET -n $NS --confirm > /dev/null echo "Check cert dates of $SECRET in project $NS:" openssl x509 -noout --dates -in tls.crt; echo done $ cat certs.txt openshift-kube-controller-manager-operator csr-signer-signer openshift-kube-controller-manager-operator csr-signer openshift-kube-controller-manager kube-controller-manager-client-cert-key openshift-kube-apiserver-operator aggregator-client-signer openshift-kube-apiserver aggregator-client openshift-kube-apiserver external-loadbalancer-serving-certkey openshift-kube-apiserver internal-loadbalancer-serving-certkey openshift-kube-apiserver service-network-serving-certkey openshift-config-managed kube-controller-manager-client-cert-key openshift-config-managed kube-scheduler-client-cert-key openshift-kube-scheduler kube-scheduler-client-cert-key Checking the Certs, they are with one day expiry times, this is as expected. # ./check_secret_expiry.sh certs.txt Check cert dates of csr-signer-signer in project openshift-kube-controller-manager-operator: notBefore=Jun 27 04:41:38 2022 GMT notAfter=Jun 28 04:41:38 2022 GMT Check cert dates of csr-signer in project openshift-kube-controller-manager-operator: notBefore=Jun 27 04:52:21 2022 GMT notAfter=Jun 28 04:41:38 2022 GMT Check cert dates of kube-controller-manager-client-cert-key in project openshift-kube-controller-manager: notBefore=Jun 27 04:52:26 2022 GMT notAfter=Jul 27 04:52:27 2022 GMT Check cert dates of aggregator-client-signer in project openshift-kube-apiserver-operator: notBefore=Jun 27 04:41:37 2022 GMT notAfter=Jun 28 04:41:37 2022 GMT Check cert dates of aggregator-client in project openshift-kube-apiserver: notBefore=Jun 27 04:52:26 2022 GMT notAfter=Jun 28 04:41:37 2022 GMT Check cert dates of external-loadbalancer-serving-certkey in project openshift-kube-apiserver: notBefore=Jun 27 04:52:26 2022 GMT notAfter=Jul 27 04:52:27 2022 GMT Check cert dates of internal-loadbalancer-serving-certkey in project openshift-kube-apiserver: notBefore=Jun 27 04:52:49 2022 GMT notAfter=Jul 27 04:52:50 2022 GMT Check cert dates of service-network-serving-certkey in project openshift-kube-apiserver: notBefore=Jun 27 04:52:28 2022 GMT notAfter=Jul 27 04:52:29 2022 GMT Check cert dates of kube-controller-manager-client-cert-key in project openshift-config-managed: notBefore=Jun 27 04:52:26 2022 GMT notAfter=Jul 27 04:52:27 2022 GMT Check cert dates of kube-scheduler-client-cert-key in project openshift-config-managed: notBefore=Jun 27 04:52:47 2022 GMT notAfter=Jul 27 04:52:48 2022 GMT Check cert dates of kube-scheduler-client-cert-key in project openshift-kube-scheduler: notBefore=Jun 27 04:52:47 2022 GMT notAfter=Jul 27 04:52:48 2022 GMT # # cat check_secret_expiry_within.sh #!/usr/bin/env bash # usage: ./check_secret_expiry_within.sh 1day # or 15min, 2days, 2day, 2month, 1year WITHIN=${1:-24hours} echo "Checking validity within $WITHIN ..." oc get secret --insecure-skip-tls-verify -A -o json | jq -r '.items[] | select(.metadata.annotations."auth.openshift.io/certificate-not-after" | . != null and fromdateiso8601<='$( date --date="+$WITHIN" +%s )') | "\(.metadata.annotations."auth.openshift.io/certificate-not-before") \(.metadata.annotations."auth.openshift.io/certificate-not-after") \(.metadata.namespace)\t\(.metadata.name)"' # ./check_secret_expiry_within.sh 1day Checking validity within 1day ... 2022-06-27T04:41:37Z 2022-06-28T04:41:37Z openshift-kube-apiserver-operator aggregator-client-signer 2022-06-27T04:52:26Z 2022-06-28T04:41:37Z openshift-kube-apiserver aggregator-client 2022-06-27T04:52:21Z 2022-06-28T04:41:38Z openshift-kube-controller-manager-operator csr-signer 2022-06-27T04:41:38Z 2022-06-28T04:41:38Z openshift-kube-controller-manager-operator csr-signer-signer
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
If an infra-id (which is uniquely generated by the installer) is reused the installer will fail with: level=info msg=Creating private Hosted Zone level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed provisioning resources after infrastructure ready: failed to create private hosted zone: error creating private hosted zone: HostedZoneAlreadyExists: A hosted zone has already been created with the specified caller reference. Users should not be reusing installer state in this manner, but we do it purposefully in our ipi-install-install step to mitigate infrastructure provisioning flakes: https://steps.ci.openshift.org/reference/ipi-install-install#line720 We can fix this by ensuring the caller ref is unique on each invocation.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-36871. The following is the description of the original issue:
—
Description of problem:
Customer has a cluster in AWS that was born on an old OCP version (4.7) and was upgraded all the way through 4.15. During the lifetime of the cluster they changed the DHCP option in AWS to "domain name". During the node provisioning during MachineSet scaling the Machine can successfully be created at the cloud provider but the Node is never added to the cluster. The CSR remain pending and do not get auto-approved This issue is eventually related or similar to the bug fixed via https://issues.redhat.com/browse/OCPBUGS-29290
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
CSR don't get auto approved. New nodes have a different domain name when CSR is approved manually.
Expected results:
CSR should get approved automatically and domain name scheme should not change.
Additional info:
Description of problem:
user.openshift.io and oauth.openshift.io APIs are not unavailable in external oidc cluster, that conducts all the common pull/push blob from/to image registry failed.
Version-Release number of selected component (if applicable):
4.15.15
How reproducible:
always
Steps to Reproduce:
1.Create a ROSA HCP cluster which configured external oidc users 2.Push data to image registry under a project oc new-project wxj1 oc new-build httpd~https://github.com/openshift/httpd-ex.git 3.
Actual results:
$ oc logs -f build/httpd-ex-1 Cloning "https://github.com/openshift/httpd-ex.git" ... Commit: 1edee8f58c0889616304cf34659f074fda33678c (Update httpd.json) Author: Petr Hracek <phracek@redhat.com> Date: Wed Jun 5 13:00:09 2024 +0200time="2024-06-12T09:55:13Z" level=info msg="Not using native diff for overlay, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled"I0612 09:55:13.306937 1 defaults.go:112] Defaulting to storage driver "overlay" with options [mountopt=metacopy=on].Caching blobs under "/var/cache/blobs".Trying to pull image-registry.openshift-image-registry.svc:5000/openshift/httpd@sha256:765aa645587f34e310e49db7cdc97e82d34122adb0b604eea891e0f98050aa77...Warning: Pull failed, retrying in 5s ...Trying to pull image-registry.openshift-image-registry.svc:5000/openshift/httpd@sha256:765aa645587f34e310e49db7cdc97e82d34122adb0b604eea891e0f98050aa77...Warning: Pull failed, retrying in 5s ...Trying to pull image-registry.openshift-image-registry.svc:5000/openshift/httpd@sha256:765aa645587f34e310e49db7cdc97e82d34122adb0b604eea891e0f98050aa77...Warning: Pull failed, retrying in 5s ...error: build error: After retrying 2 times, Pull image still failed due to error: unauthorized: unable to validate token: NotFound oc logs -f deploy/image-registry -n openshift-image-registry time="2024-06-12T09:55:13.36003996Z" level=error msg="invalid token: the server could not find the requested resource (get users.user.openshift.io ~)" go.version="go1.20.12 X:strictfipsruntime" http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=0c380b81-99d4-4118-8de3-407706e8767c http.request.method=GET http.request.remoteaddr="10.130.0.35:50550" http.request.uri="/openshift/token?account=serviceaccount&scope=repository%3Aopenshift%2Fhttpd%3Apull" http.request.useragent="containers/5.28.0 (github.com/containers/image)"
Expected results:
Should pull/push blob from/to image registry on external oidc cluster
Additional info:
This is a clone of issue OCPBUGS-43428. The following is the description of the original issue:
—
As part of TRT investigations of k8s API disruptions, we have discovered there are times when haproxy considers underlying apiserver as Down, yet from k8s perspective the apiserver is healthy&functional.
From the customer perspective, during this time any call to the cluster API endpoint will fail. It simply looks like an outage.
Thorough investigation leads us to the following difference in how haproxy perceives apiserver being alive versus how k8s perceives it, i.e.
inter 1s fall 2 rise 3
and
readinessProbe: httpGet: scheme: HTTPS port: 6443 path: readyz initialDelaySeconds: 0 periodSeconds: 5 timeoutSeconds: 10 successThreshold: 1 failureThreshold: 3
We can see the top check is much stricter. And it belongs to haproxy. As a result, haproxy sees the following
2024-10-08T12:37:32.779247039Z [WARNING] (29) : Server masters/master-2 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 5ms. 0 active and 0 backup servers left. 154 sessions active, 0 requeued, 0 remaining in queue.
much faster than k8s would consider something as wrong.
In order to remediate this issue, it has been agreed the haproxy checks should be softened and adjusted to the k8s readiness probe.
This is a clone of issue OCPBUGS-35297. The following is the description of the original issue:
—
Starting about 5/24 or 5/25, we see a massive increase in the number of watch establishments from all clients to the kube-apiserver during non-upgrade jobs. While this could theoretically be every single client merged a bug on the same day, the more likely explanation is that the kube update is exposed or produced some kind of a bug.
This is a clear regression and it is only present on 4.17, not 4.16. It is present across all platforms, though I've selected AWS for links and screenshots.
slack thread if there are questions
courtesy screen shot
Similar to OCPBUGS-20061, but for a different situation:
$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=48h&name=pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling&type=junit&search=clusteroperator/control-plane-machine-set+should+not+change+condition/Available' | grep 'failures match' | sort pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling (all) - 15 runs, 60% failed, 33% of failures match = 20% impact
In that test, since ETCD-329, the test suite deletes a control-plane Machine and waits for the ControlPlaneMachineSet controller to scale in a replacement. But in runs like this, the outgoing Node goes Ready=Unknown for not-yet-diagnosed reasons, and that somehow misses cpmso#294's inertia (maybe the running guard should be dropped?), and the ClusterOperator goes Available=False complaining about Missing 1 available replica(s).
It's not clear from the message which replica it's worried about (that would be helpful information to include in the message), but I suspect it's the Machine/Node that's in the deletion process. But regardless of the message, this does not seem like a situation worth a cluster-admin-midnight-page Available=False alarm.
Seen in dev-branch CI. I haven't gone back to check older 4.y.
CI Search shows 20% impact, see my earlier query in this message.
Run a bunch of pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling and check CI Search results.
20% impact
No hits.
As a dev I'd like to run the HO locally smoothly.
Usually when you run it locally you scale down the deployment so they do not interfere.
However the HO code expects a pod to be always running.
We should imporve the dev ux and remove that hard dep.
Description of problem:
Affects only developers with a local build.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
Build and run the console locally.
Actual results:
The user toggle menu isn't shown, so developers cannot access the user preference, such as the language or theme.
Expected results:
The user toggle should be there.
Please review the following PR: https://github.com/openshift/images/pull/183
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-37988. The following is the description of the original issue:
—
Description of problem:
In the Administrator view under Cluster Settings -> Update Status Pane, the text for the versions is black instead of white when Dark mode is selected on Firefox (128.0.3 Mac). Also happens if you choose System default theme and the system is set to Dark mode.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Open /settings/cluster using Firefox with Dark mode selected 2. 3.
Actual results:
The version numbers under Update status are black
Expected results:
The version numbers under Update status are white
Additional info:
Description of problem:
Even though fakefish is not a supported redfish interface, it is very useful to have it working for "special" scenarios, like NC-SI, while its support is implemented. On OCP 4.14 and later, converged flow is enabled by default, and on this configuration Ironic sends a soft power_off command to the ironic agent running on the ramdisk. Since this power operation is not going through the redfish interface, it is not processed by fakefish, preventing it from working on some NC-SI configurations, where a full power-off would mean the BMC loses power. Ironic already supports using out-of-band power off for the agent [1], so having an option to use it would be very helpful. [1]- https://opendev.org/openstack/ironic/commit/824ad1676bd8032fb4a4eb8ffc7625a376a64371
Version-Release number of selected component (if applicable):
Seen with OCP 4.14.26 and 4.14.33, expected to happen on later versions
How reproducible:
Always
Steps to Reproduce:
1. Deploy SNO node using ACM and fakefish as redfish interface 2. Check metal3-ironic pod logs
Actual results:
We can see a soft power_off command sent to the ironic agent running on the ramdisk: 2024-08-07 15:00:45.545 1 DEBUG ironic.drivers.modules.agent_client [None req-74c0c3ed-011f-4718-bdce-53f2ba412e85 - - - - - -] Executing agent command standby.power_off for node df006e90-02ee-4847-b532-be4838e844e6 with params {'wait': 'false', 'agent_token': '***'} _command /usr/lib/python3.9/site-packages/ironic/drivers/modules/agent_client.py:197 2024-08-07 15:00:45.551 1 DEBUG ironic.drivers.modules.agent_client [None req-74c0c3ed-011f-4718-bdce-53f2ba412e85 - - - - - -] Agent command standby.power_off for node df006e90-02ee-4847-b532-be4838e844e6 returned result None, error None, HTTP status code 200 _command /usr/lib/python3.9/site-packages/ironic/drivers/modules/agent_client.py:234
Expected results:
There is an option to prevent this soft power_off command, so all power actions happen via redfish. This would allow fakefish to capture them and behave as needed.
Additional info:
Description of problem: Terraform should no longer be available for vSphere.
This is a clone of issue OCPBUGS-38571. The following is the description of the original issue:
—
Description of problem:
Cluster's global address "<infra id>-apiserver" not deleted during "destroy cluster"
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-multi-2024-08-15-212448
How reproducible:
Always
Steps to Reproduce:
1. "create install-config", then optionally insert interested settings (see [1]) 2. "create cluster", and make sure the cluster turns healthy finally (see [2]) 3. check the cluster's addresses on GCP (see [3]) 4. "destroy cluster", and make sure everything of the cluster getting deleted (see [4])
Actual results:
The global address "<infra id>-apiserver" is not deleted during "destroy cluster".
Expected results:
Everything of the cluster shoudl get deleted during "destroy cluster".
Additional info:
FYI we had a 4.16 bug once, see https://issues.redhat.com/browse/OCPBUGS-32306
Description of the problem:
After ACM installation, namespace `local-cluster` contains AgentClusterInstall and ClusterDeployment but not InfraEnv:
[kni@provisionhost-0-0 ~]$ oc project local-cluster Now using project "local-cluster" on server "https://api.ocp-edge-cluster-0.qe.lab.redhat.com:6443". [kni@provisionhost-0-0 ~]$ oc get aci NAME CLUSTER STATE local-cluster local-cluster adding-hosts [kni@provisionhost-0-0 ~]$ oc get clusterdeployment NAME INFRAID PLATFORM REGION VERSION CLUSTERTYPE PROVISIONSTATUS POWERSTATE AGE local-cluster agent-baremetal 4.16.0-0.nightly-2024-06-03-060250 Provisioned Running 61m [kni@provisionhost-0-0 ~]$ oc get infraenv No resources found in local-cluster namespace. [kni@provisionhost-0-0 ~]$
How reproducible:
100%
Steps to reproduce:
1. Deploy OCP 4.16
2. Deploy ACM build 2.11.0-DOWNSTREAM-2024-06-03-20-28-43
3. Execute `oc get infraenv -n local-cluster`
Actual results:
No infraenvs displayed.
Expected results:
Local-cluster infraenv displayed.
Description of problem:
The instructions for running a local console with authentication documented in https://github.com/openshift/console?tab=readme-ov-file#openshift-with-authentication appear to no longer work.
Steps to Reproduce:
1. Follow the steps in https://github.com/openshift/console?tab=readme-ov-file#openshift-with-authentication
Actual results:
console on master [$?] via 🐳 desktop-linux via 🐹 v1.19.5 via v18.20.2 via 💎 v3.1.3 on ☁️ openshift-dev (us-east-1) ❯ ./examples/run-bridge.sh ++ oc whoami --show-token ++ oc whoami --show-server ++ oc -n openshift-config-managed get configmap monitoring-shared-config -o 'jsonpath={.data.alertmanagerPublicURL}' ++ oc -n openshift-config-managed get configmap monitoring-shared-config -o 'jsonpath={.data.thanosPublicURL}' + ./bin/bridge --base-address=http://localhost:9000 --ca-file=examples/ca.crt --k8s-auth=bearer-token --k8s-auth-bearer-token=sha256~EEIDh9LGIzlQ83udktABnEEIse3bintNzKNBJwQfvNI --k8s-mode=off-cluster --k8s-mode-off-cluster-endpoint=https://api.rhamilto.devcluster.openshift.com:6443 --k8s-mode-off-cluster-skip-verify-tls=true --listen=http://127.0.0.1:9000 --public-dir=./frontend/public/dist --user-auth=openshift --user-auth-oidc-client-id=console-oauth-client --user-auth-oidc-client-secret-file=examples/console-client-secret --user-auth-oidc-ca-file=examples/ca.crt --k8s-mode-off-cluster-alertmanager=https://alertmanager-main-openshift-monitoring.apps.rhamilto.devcluster.openshift.com --k8s-mode-off-cluster-thanos=https://thanos-querier-openshift-monitoring.apps.rhamilto.devcluster.openshift.com W0515 11:18:55.835781 57122 authoptions.go:103] Flag inactivity-timeout is set to less then 300 seconds and will be ignored! F0515 11:18:55.836030 57122 authoptions.go:299] Error initializing authenticator: file examples/ca.crt contained no CA data
Expected results:
Local console with authentication should work.
Seen in a 4.17 nightly-to-nightly CI update:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade/1809154554084724736/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-machine-config-operator") | .reason' | sort | uniq -c | sort -n | tail -n3 82 Pulled 82 Started 2116 ValidatingAdmissionPolicyUpdated $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade/1809154554084724736/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-machine-config-operator" and .reason == "ValidatingAdmissionPolicyUpdated").message' | sort | uniq -c 705 Updated ValidatingAdmissionPolicy.admissionregistration.k8s.io/machine-configuration-guards because it changed 705 Updated ValidatingAdmissionPolicy.admissionregistration.k8s.io/managed-bootimages-platform-check because it changed 706 Updated ValidatingAdmissionPolicy.admissionregistration.k8s.io/mcn-guards because it changed
I'm not sure what those are about (which may be a bug on it's own? Would be nice to know what changed), but it smells like a hot loop to me.
Seen in 4.17. Not clear yet how to audit for exposure frequency or versions, short of teaching the origin test suite to fail if it sees too many of these kinds of events? Maybe a for openshift-... namespaces version of the current events should not repeat pathologically in e2e namespaces test-case? Which we may have, but it's not tripping?
Besides the initial update, also seen in this 4.17.0-0.nightly-2024-07-05-091056 serial run:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-serial/1809154615350923264/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-machine-config-operator" and .reason == "ValidatingAdmissionPolicyUpdated").message' | sort | uniq -c 1006 Updated ValidatingAdmissionPolicy.admissionregistration.k8s.io/machine-configuration-guards because it changed 1006 Updated ValidatingAdmissionPolicy.admissionregistration.k8s.io/managed-bootimages-platform-check because it changed 1007 Updated ValidatingAdmissionPolicy.admissionregistration.k8s.io/mcn-guards because it changed
So possibly every time, in all 4.17 clusters?
1. Unclear. Possibly just install 4.17.
2. Run oc -n openshift-machine-config-operator get -o json events | jq -r '.items[] | select(.reason == "ValidatingAdmissionPolicyUpdated")'.
Thousands of hits.
Zero to few hits.
This is a clone of issue OCPBUGS-42115. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-41184. The following is the description of the original issue:
—
Description of problem:
The disk and instance types for gcp machines should be validated further. The current implementation provides validation for each individually, but the disk types and instance types should be checked against each other for valid combinations. The attached spreadsheet displays the combinations of valid disk and instance types.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
In assisted-service, we check in a hostname contains 64 or less characters:
https://github.com/openshift/assisted-service/blob/master/internal/host/hostutil/host_utils.go#L28
But the kubelet will fails to register a node if it has 64 characters:
https://access.redhat.com/solutions/7068042
This issue was faced by a customer in case https://access.redhat.com/support/cases/#/case/03876918
How reproducible:
Always
Steps to reproduce:
1. Have a node with a hostname made of 64 characters
2. Install the cluster and notices that the node will be marked as "Pending user action"
3. Look at the kubelet log on the node, and notice the following error:
Jul 22 12:53:39 synergywatson-control-plane-1.private.openshiftvcn.oraclevcn.com kubenswrapper[4193]: E0722 12:53:39.167645 4193 kubelet_node_status.go:94] "Unable to register node with API server" err="Node \"synergywatson-control-plane-1.private.openshiftvcn.oraclevcn.com\" is invalid: metadata.labels: Invalid value: \"synergywatson-control-plane-1.private.openshiftvcn.oraclevcn.com\": must be no more than 63 characters" node="synergywatson-control-plane-1.private.openshiftvcn.oraclevcn.com"
Actual results:
Cluster installation fails
Expected results:
The assisted installer does not allow the installation of a cluster with a hostname longer than 64 characters included
/cc Alona Kaplan
Please review the following PR: https://github.com/openshift/azure-disk-csi-driver/pull/79
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
If a deployment is created using BuildConfig and on edit of that, build option is selected as Shipwright and on Save error is coming
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Install Shipwright 2. Create a deployment using BuildConfig as build option 3. Edit it and see build option dropdown 4. Click on Save
Actual results:
Build option is set as Shipwright
Expected results:
Build option should be BuildConfig
Additional info:
Description of problem:
When using the registry-overrides flag to override registries for control plane components, it seems like the current implementation prpagates the override to some data plane components. It seems that certain components like multus, dns, and ingress get values for their containers' images from env vars set in operators on the control plane (cno/dns operator/konnectivity), and hence also get the overridden registry propagated to them.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1.Input a registry override through the HyperShift Operator 2.Check registry fields for components on data plane 3.
Actual results:
Data plane components that get registry values from env vars set in dns-operator, ingress-operator, cluster-network-operator, and cluster-node-tuning-operator get overridden registries.
Expected results:
overriden registries should not get propagated to data plane
Additional info:
Description of problem:
The capitalization of "Import from Git" is inconsistent between the serverless and git repository buttons in the Add page
Version-Release number of selected component (if applicable):
4.17.0
How reproducible:
Always
Steps to Reproduce:
1. Install serverless operator 2. In the developer perspective click +Add 3. Observe "Add from git" is inconsistently capitalized
Actual results:
It is consistent
Expected results:
It is not
Additional info:
This is a clone of issue OCPBUGS-38281. The following is the description of the original issue:
—
Description of problem:
We ignore errors from the existence check in https://github.com/openshift/baremetal-runtimecfg/blob/723290ec4b31bc4e032ff62198ae3dd0d0e36313/pkg/monitor/iptables.go#L116 and that can make it more difficult to debug errors in the healthchecks. In particular, this made it more difficult to debug an issue with permissions on the monitor container because there were no log messages to let us know the check had failed.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Change the Installer interface to always accept (and provide) a context for clean cancellation.
Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
at:
github.com/openshift/cluster-openshift-controller-manager-operator/pkg/operator/internalimageregistry/cleanup_controller.go:146 +0xd65
This is a clone of issue OCPBUGS-39232. The following is the description of the original issue:
—
Description of problem:
The smoke test for OLM run by the OpenShift e2e suite is specifying an unavailable operator for installation, causing it to fail.
Version-Release number of selected component (if applicable):
How reproducible:
Always (when using 4.17+ catalog versions)
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-32812. The following is the description of the original issue:
—
Description of problem:
When the image from a build is rolling out on the nodes, the update progress on the node is not displaying correctly.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Enable OCL functionality 2. Opt the pool in by MachineOSConfig 3. Wait for the image to build and roll out 4. Track mcp update status by oc get mcp
Actual results:
The MCP start with O ready pool. While there are 1-2 pools got updated already, the count still remains 0. The count jump to 3 when all the pools are ready.
Expected results:
The update progress should be reflected in the mcp status correctly.
Additional info:
Description of problem:
In https://issues.redhat.com//browse/STOR-1453: TLSSecurityProfile feature, storage clustercsidriver.spec.observedConfig will get the value from APIServer.spec.tlsSecurityProfile to set cipherSuites and minTLSVersion in all corresponding csi driver, but it doesn't work well in hypershift cluster when only setting different value in the hostedclusters.spec.configuration.apiServer.tlsSecurityProfile in management cluster, the APIServer.spec in hosted cluster is not synced and CSI driver doesn't get the updated value as well.
Version-Release number of selected component (if applicable):
Pre-merge test with openshift/csi-operator#69,openshift/csi-operator#71
How reproducible:
Always
Steps to Reproduce:
1. Have a hypershift cluster, the clustercsidriver get the default value like "minTLSVersion": "VersionTLS12" $ oc get clustercsidriver ebs.csi.aws.com -ojson | jq .spec.observedConfig.targetcsiconfig.servingInfo { "cipherSuites": [ "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256", "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256" ], "minTLSVersion": "VersionTLS12" } 2. set the tlsSecurityProfile in hostedclusters.spec.configuration.apiServer in mgmtcluster, like the "minTLSVersion": "VersionTLS11": $ oc -n clusters get hostedclusters hypershift-ci-14206 -o json | jq .spec.configuration { "apiServer": { "audit": { "profile": "Default" }, "tlsSecurityProfile": { "custom": { "ciphers": [ "ECDHE-ECDSA-CHACHA20-POLY1305", "ECDHE-RSA-CHACHA20-POLY1305", "ECDHE-RSA-AES128-GCM-SHA256", "ECDHE-ECDSA-AES128-GCM-SHA256" ], "minTLSVersion": "VersionTLS11" }, "type": "Custom" } } } 3. This doesn't pass to apiserver in hosted cluster oc get apiserver cluster -ojson | jq .spec { "audit": { "profile": "Default" } } 4. CSI Driver still use the default value which is different from mgmtcluster.hostedclusters.spec.configuration.apiServer $ oc get clustercsidriver ebs.csi.aws.com -ojson | jq .spec.observedConfig.targetcsiconfig.servingInfo { "cipherSuites": [ "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256", "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256" ], "minTLSVersion": "VersionTLS12" }
Actual results:
The tlsSecurityProfile doesn't get synced
Expected results:
The tlsSecurityProfile should get synced
Additional info:
Please review the following PR: https://github.com/openshift/router/pull/602
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/210
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-38657. The following is the description of the original issue:
—
https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/2781
https://kubernetes.slack.com/archives/CKFGK3SSD/p1704729665056699
https://github.com/okd-project/okd/discussions/1993#discussioncomment-10385535
Description of problem:
INFO Waiting up to 15m0s (until 2:23PM UTC) for machines [vsphere-ipi-b8gwp-bootstrap vsphere-ipi-b8gwp-master-0 vsphere-ipi-b8gwp-master-1 vsphere-ipi-b8gwp-master-2] to provision... E0819 14:17:33.676051 2162 session.go:265] "Failed to keep alive govmomi client, Clearing the session now" err="Post \"https://vctest.ars.de/sdk\": context canceled" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local" E0819 14:17:33.708233 2162 session.go:295] "Failed to keep alive REST client" err="Post \"https://vctest.ars.de/rest/com/vmware/cis/session?~action=get\": context canceled" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local" I0819 14:17:33.708279 2162 session.go:298] "REST client session expired, clearing session" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local"
Openshift Dedicated is in the process of developing an offering of GCP clusters that uses only short-lived credentials from the end user. For these clusters to be deployed, the pod running the Openshift Installer needs to function with GCP credentials that fit the short-lived credential formats. This worked in prior Installer versions, such as 4.14, but was not an explicit requirement.
Refactor name to Dockerfile.ocp as a better, version independent alternative
This is a clone of issue OCPBUGS-29528. The following is the description of the original issue:
—
Description of problem:
Camel K provides a list of Kamelets that are able to act as an event source or sink for a Knative eventing message broker. Usually the list of Kamelets installed with the Camel K operator are displayed in the Developer Catalog list of available event sources with the provider "Apache Software Foundation" or "Red Hat Integration". When a user adds a custom Kamelet custom resource to the user namespace the list of default Kamelets coming from the Camel K operator is gone. The Developer Catalog event source list then only displays the custom Kamelet but not the default ones.
Version-Release number of selected component (if applicable):
How reproducible:
Apply a custom Kamelet custom resource to the user namespace and open the list of available event sources in Dev Console Developer Catalog.
Steps to Reproduce:
1. install global Camel K operator in operator namespace (e.g. openshift-operators) 2. list all available event sources in "default" user namespace and see all Kamelets listed as event sources/sinks 3. add a custom Kamelet custom resource to the default namespace 4. see the list of available event sources only listing the custom Kamelet and the default Kamelets are gone from that list
Actual results:
Default Kamelets that act as event source/sink are only displayed in the Developer Catalog when there is no custom Kamelet added to a namespace.
Expected results:
Default Kamelets coming with the Camel K operator (installed in the operator namespace) should always be part of the Developer Catalog list of available event sources/sinks. When the user adds more custom Kamelets these should be listed, too.
Additional info:
Reproduced with Camel K operator 2.2 and OCP 4.14.8
screenshots: https://drive.google.com/drive/folders/1mTpr1IrASMT76mWjnOGuexFr9-mP0y3i?usp=drive_link
Description of the problem:
Recently, image registries listed in the assisted mirror config map are triggering a validation error if their registry is not listed in the spoke pull secret (And spoke fails to deploy). The error is:
message: 'The Spec could not be synced due to an input error: pull secret for new cluster is invalid: pull secret must contain auth for "registry.ci.openshift.org"'
Mirror registries should be automatically excluded per this doc , and have been up until this point . This issue just start happening, so it appears some change has caused this.
This will impact:
Versions:
So far occurs with 2.11.0-DOWNSTREAM-2024-05-17-03-42-35 but still checking for other y/z stream
How reproducible:
100%
Steps to reproduce:
1. Create a mirror config map such as in this doc and ensure registry.ci.openshift.org is an entry
2. Create an imagecontentsource policy containing
registry.ci.openshift.org/ocp/release:4.15.0-0.ci-2024-05-21-131652
3. Deploy a spoke cluster using CRDs with the above 4.15 ci image - ensure the spoke pull secret does NOT contain an entry for registry.ci.openshift.org
Actual results:
Spoke will fail deployment - aci will show
message: 'The Spec could not be synced due to an input error: pull secret for new cluster is invalid: pull secret must contain auth for "registry.ci.openshift.org"'
Expected results:
Spoke deployment proceeds and deploys successfully
Work around
Description of problem:
We need to bump the Kubernetes Version. To the latest API version OCP is using. This what was done last time: https://github.com/openshift/cluster-samples-operator/pull/409 Find latest stable version from here: https://github.com/kubernetes/api This is described in wiki: https://source.redhat.com/groups/public/appservices/wiki/cluster_samples_operator_release_activities
Version-Release number of selected component (if applicable):
How reproducible:
Not really a bug, but we're using OCPBUGS so that automation can manage the PR lifecycle (SO project is no longer kept up-to-date with release versions, etc.).
Please review the following PR: https://github.com/openshift/cluster-image-registry-operator/pull/1040
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Customer defines proxy in its HostedCluster resource definition. The variables are propagated to some pods but not to oauth one:
oc describe pod kube-apiserver-5f5dbf78dc-8gfgs | grep PROX
HTTP_PROXY: http://ocpproxy.corp.example.com:8080
HTTPS_PROXY: http://ocpproxy.corp.example.com:8080
NO_PROXY: .....
oc describe pod oauth-openshift-6d7b7c79f8-2cf99| grep PROX
HTTP_PROXY: socks5://127.0.0.1:8090
HTTPS_PROXY: socks5://127.0.0.1:8090
ALL_PROXY: socks5://127.0.0.1:8090
NO_PROXY: kube-apiserver
apiVersion: hypershift.openshift.io/v1beta1
kind: HostedCluster
...
spec:
autoscaling: {}
clusterID: 9c8db607-b291-4a72-acc7-435ec23a72ea
configuration:
.....
proxy:
httpProxy: http://ocpproxy.corp.example.com:8080
httpsProxy: http://ocpproxy.corp.example.com:8080
Version-Release number of selected component (if applicable): 4.14
Looks like ovn-kubernetes/pull/2233 merged and broke techpreview jobs. There is a PR up to correct that but payloads are currently blocked.
Description of problem:
When setting ENV OPENSHIFT_INSTALL_PRESERVE_BOOTSTRAP to keep bootstrap, and launch capi-based installation, installer exit with error when collecting applied cluster api manifests..., since local cluster api was already stopped. 06-20 15:26:51.216 level=debug msg=Machine jima417aws-gjrzd-bootstrap is ready. Phase: Provisioned 06-20 15:26:51.216 level=debug msg=Checking that machine jima417aws-gjrzd-master-0 has provisioned... 06-20 15:26:51.217 level=debug msg=Machine jima417aws-gjrzd-master-0 has status: Provisioned 06-20 15:26:51.217 level=debug msg=Checking that IP addresses are populated in the status of machine jima417aws-gjrzd-master-0... 06-20 15:26:51.217 level=debug msg=Checked IP InternalDNS: ip-10-0-50-47.us-east-2.compute.internal 06-20 15:26:51.217 level=debug msg=Found internal IP address: 10.0.50.47 06-20 15:26:51.217 level=debug msg=Machine jima417aws-gjrzd-master-0 is ready. Phase: Provisioned 06-20 15:26:51.217 level=debug msg=Checking that machine jima417aws-gjrzd-master-1 has provisioned... 06-20 15:26:51.217 level=debug msg=Machine jima417aws-gjrzd-master-1 has status: Provisioned 06-20 15:26:51.217 level=debug msg=Checking that IP addresses are populated in the status of machine jima417aws-gjrzd-master-1... 06-20 15:26:51.218 level=debug msg=Checked IP InternalDNS: ip-10-0-75-199.us-east-2.compute.internal 06-20 15:26:51.218 level=debug msg=Found internal IP address: 10.0.75.199 06-20 15:26:51.218 level=debug msg=Machine jima417aws-gjrzd-master-1 is ready. Phase: Provisioned 06-20 15:26:51.218 level=debug msg=Checking that machine jima417aws-gjrzd-master-2 has provisioned... 06-20 15:26:51.218 level=debug msg=Machine jima417aws-gjrzd-master-2 has status: Provisioned 06-20 15:26:51.218 level=debug msg=Checking that IP addresses are populated in the status of machine jima417aws-gjrzd-master-2... 06-20 15:26:51.218 level=debug msg=Checked IP InternalDNS: ip-10-0-60-118.us-east-2.compute.internal 06-20 15:26:51.218 level=debug msg=Found internal IP address: 10.0.60.118 06-20 15:26:51.218 level=debug msg=Machine jima417aws-gjrzd-master-2 is ready. Phase: Provisioned 06-20 15:26:51.218 level=info msg=Control-plane machines are ready 06-20 15:26:51.218 level=info msg=Cluster API resources have been created. Waiting for cluster to become ready... 06-20 15:26:51.219 level=warning msg=OPENSHIFT_INSTALL_PRESERVE_BOOTSTRAP is set, shutting down local control plane. 06-20 15:26:51.219 level=info msg=Shutting down local Cluster API control plane... 06-20 15:26:51.473 level=info msg=Stopped controller: Cluster API 06-20 15:26:51.473 level=info msg=Stopped controller: aws infrastructure provider 06-20 15:26:52.830 level=info msg=Local Cluster API system has completed operations 06-20 15:26:52.830 level=debug msg=Collecting applied cluster api manifests... 06-20 15:26:52.831 level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: [failed to get manifest openshift-cluster-api-guests: Get "https://127.0.0.1:46555/api/v1/namespaces/openshift-cluster-api-guests": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest default: Get "https://127.0.0.1:46555/apis/infrastructure.cluster.x-k8s.io/v1beta2/awsclustercontrolleridentities/default": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd: Get "https://127.0.0.1:46555/apis/cluster.x-k8s.io/v1beta1/namespaces/openshift-cluster-api-guests/clusters/jima417aws-gjrzd": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd: Get "https://127.0.0.1:46555/apis/infrastructure.cluster.x-k8s.io/v1beta2/namespaces/openshift-cluster-api-guests/awsclusters/jima417aws-gjrzd": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-bootstrap: Get "https://127.0.0.1:46555/apis/infrastructure.cluster.x-k8s.io/v1beta2/namespaces/openshift-cluster-api-guests/awsmachines/jima417aws-gjrzd-bootstrap": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-master-0: Get "https://127.0.0.1:46555/apis/infrastructure.cluster.x-k8s.io/v1beta2/namespaces/openshift-cluster-api-guests/awsmachines/jima417aws-gjrzd-master-0": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-master-1: Get "https://127.0.0.1:46555/apis/infrastructure.cluster.x-k8s.io/v1beta2/namespaces/openshift-cluster-api-guests/awsmachines/jima417aws-gjrzd-master-1": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-master-2: Get "https://127.0.0.1:46555/apis/infrastructure.cluster.x-k8s.io/v1beta2/namespaces/openshift-cluster-api-guests/awsmachines/jima417aws-gjrzd-master-2": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-bootstrap: Get "https://127.0.0.1:46555/apis/cluster.x-k8s.io/v1beta1/namespaces/openshift-cluster-api-guests/machines/jima417aws-gjrzd-bootstrap": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-master-0: Get "https://127.0.0.1:46555/apis/cluster.x-k8s.io/v1beta1/namespaces/openshift-cluster-api-guests/machines/jima417aws-gjrzd-master-0": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-master-1: Get "https://127.0.0.1:46555/apis/cluster.x-k8s.io/v1beta1/namespaces/openshift-cluster-api-guests/machines/jima417aws-gjrzd-master-1": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-master-2: Get "https://127.0.0.1:46555/apis/cluster.x-k8s.io/v1beta1/namespaces/openshift-cluster-api-guests/machines/jima417aws-gjrzd-master-2": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-bootstrap: Get "https://127.0.0.1:46555/api/v1/namespaces/openshift-cluster-api-guests/secrets/jima417aws-gjrzd-bootstrap": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-master: Get "https://127.0.0.1:46555/api/v1/namespaces/openshift-cluster-api-guests/secrets/jima417aws-gjrzd-master": dial tcp 127.0.0.1:46555: connect: connection refused]
Version-Release number of selected component (if applicable):
4.16/4.17 nightly build
How reproducible:
always
Steps to Reproduce:
1. Set ENV OPENSHIFT_INSTALL_PRESERVE_BOOTSTRAP 2. Trigger the capi-based installation 3.
Actual results:
Installer exited when collecting capi manifests.
Expected results:
Installation should be successful.
Additional info:
This is a clone of issue OCPBUGS-38966. The following is the description of the original issue:
—
Description of problem:
installing into GCP shared VPC with BYO hosted zone failed with error "failed to create the private managed zone"
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-multi-2024-08-26-170521
How reproducible:
Always
Steps to Reproduce:
1. pre-create the dns private zone in the service project, with the zone's dns name like "<cluster name>.<base domain>" and binding to the shared VPC 2. activate the service account having minimum permissions, i.e. no permission to bind a private zone to the shared VPC in the host project (see [1]) 3. "create install-config" and then insert the interested settings (e.g. see [2]) 4. "create cluster"
Actual results:
It still tries to create a private zone, which is unexpected. failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed provisioning resources after infrastructure ready: failed to create the private managed zone: failed to create private managed zone: googleapi: Error 403: Forbidden, forbidden
Expected results:
The installer should use the pre-configured dns private zone, rather than try to create a new one.
Additional info:
The 4.16 epic adding the support: https://issues.redhat.com/browse/CORS-2591 One PROW CI test which succeeded using Terraform installation: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-multi-nightly-4.17-upgrade-from-stable-4.17-gcp-ipi-xpn-mini-perm-byo-hosted-zone-arm-f28/1821177143447523328 The PROW CI test which failed: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-multi-nightly-gcp-ipi-xpn-mini-perm-byo-hosted-zone-amd-f28-destructive/1828255050678407168
Description of the problem:
When having installation with proxy having password with special characters, the proxy vars are not passed to the agent. The special characters are url-encoded (i.e %2C etc).
How reproducible:
Always
Steps to reproduce:
1. Environment with proxy having special characters. The special characters should be url-encoded.
2.
3.
Actual results:
The agent ignored the proxy variables, so it tried connecting the destinations without proxy and failed.
Expected results:
Agent should use the proxy with the special characters.
Slack thread: https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1715779618711999
There is a solution for similar issue here: https://access.redhat.com/solutions/7053351 (not assisted).
In Quick Start guided tour, user needs to click "next" button two times for moving forward to next step. If you skip the alert(Yes/No input) and click the "next" button it won't work.
The next button don't respond for first click
The next button should navigate to next step whether the user has answered the Alert message or not.
This is a clone of issue OCPBUGS-34849. The following is the description of the original issue:
—
Description of problem:
On 4.17, ABI jobs fail with error level=debug msg=Failed to register infra env. Error: 1 error occurred: level=debug msg= * mac-interface mapping for interface eno12399np0 is missing
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-05-24-193308
How reproducible:
On Prow CI ABI jobs, always
Steps to Reproduce:
1. Generate ABI ISO starting with an agent-config file defining multiple network interfaces with `enabled: false` 2. Boot the ISO 3. Wait for error
Actual results:
Install fails with error 'mac-interface mapping for interface xxxx is missing'
Expected results:
Install completes
Additional info:
The check fails on the 1st network interface defined with `enabled: false` Prow CI ABI Job: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-baremetal-pxe-ha-agent-ipv4-static-connected-f14/1797619997015543808 agent-config.yaml: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-baremetal-pxe-ha-agent-ipv4-static-connected-f14/1797619997015543808/artifacts/baremetal-pxe-ha-agent-ipv4-static-connected-f14/baremetal-lab-agent-install/artifacts/agent-config.yaml
Auth operator appears to be the problem area. Change needs reverting until the test or operator can be fixed.
Description of problem:
Network policy doesn't work properly during SDN live migration. During the migration, when the 2 CNI plugins are running in parallel. Cross-CNI traffic will be denied by ACLs generated for the network policy.
Version-release number of selected component (if applicable):
How reproducible:
Steps to reproduce:
1. Deploy a cluster with openshift-sdn
2. Create testpods in 2 different namespaces, z1 and z2.
3. In namespace z1, create a network policy that allows traffic from z2.
4. Trigger SDN live migration
5. Monitor the accessibility between the pods in Z1 and Z2.
Actual results:
When the pods in z1 and z2 on different nodes are using different CNI, the traffic is denied.
Expected results:
The traffic shall be allowed regardless of the CNI utilized by either pod.
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Cluster-update keys has some old Red Hat keys which are self-signed with SHA-1. The keys that we use have recently been resigned with SHA256. We don't rely on the self-signing to establish trust in the keys (that trust is established by baking a ConfigMap manifest into release images, where it can be read by the cluster-version operator), but we do need to avoid spooking the key-loading library. Currently Go-1.22-build CVOs in FIPS mode fail to bootstrap,
like this aws-ovn-fips run > Artifacts > install artifacts:
$ curl -s [https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-fips/1800906552731766784/artifacts/e2e-aws-ovn-fips/ipi-install-install/artifacts/log-bundle-20240612161314.tar] | tar -tvz | grep 'cluster-version.*log' -rw-r--r-- core/core 54653 2024-06-12 09:13 log-bundle-20240612161314/bootstrap/containers/cluster-version-operator-bd9f61984afa844dcd284f68006ffc9548377c045eff840096c74bcdcbe5cca3.log $ curl -s [https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-fips/1800906552731766784/artifacts/e2e-aws-ovn-fips/ipi-install-install/artifacts/log-bundle-20240612161314.tar] | tar -xOz log-bundle-20240612161314/bootstrap/containers/cluster-version-operator-bd9f61984afa844dcd284f68006ffc9548377c045eff840096c74bcdcbe5cca3.log | grep GPG I0612 16:06:15.952567 1 start.go:256] Failed to initialize from payload; shutting down: the config map openshift-config-managed/release-verification has an invalid key "verifier-public-key-redhat" that must be a GPG public key: openpgp: invalid data: tag byte does not have MSB set: openpgp: invalid data: tag byte does not have MSB set E0612 16:06:15.952600 1 start.go:309] Collected payload initialization goroutine: the config map openshift-config-managed/release-verification has an invalid key "verifier-public-key-redhat" that must be a GPG public key: openpgp: invalid data: tag byte does not have MSB set: openpgp: invalid data: tag byte does not have MSB set
That's this code attempting to call ReadArmoredKeyRing (which fails with a currently-unlogged openpgp: invalid data: user ID self-signature invalid: openpgp: invalid signature: RSA verification failure complaining about the SHA-1 signature, and then a fallback to ReadKeyRing, which fails on the reported openpgp: invalid data: tag byte does not have MSB set.
To avoid these failures, we should:
Only 4.17 will use Go 1.22, so that's the only release that needs patching. But the changes would be fine to backport if we wanted.
100%.
1. Build the CVO with Go 1.22
2. Launch a FIPS cluster.
Fails to bootstrap, with the bootstrap CVO complaining, as shown in the Description of problem section.
Successful install
Description of problem:
With the changes in https://github.com/openshift/machine-config-operator/pull/4425, RHEL worker nodes fail as follows: [root@ptalgulk-0807c-fq97t-w-a-l-rhel-1 cloud-user]# systemctl --failed UNIT LOAD ACTIVE SUB DESCRIPTION ● disable-mglru.service loaded failed failed Disables MGLRU on Openshfit LOAD = Reflects whether the unit definition was properly loaded. ACTIVE = The high-level unit activation state, i.e. generalization of SUB. SUB = The low-level unit activation state, values depend on unit type. 1 loaded units listed. Pass --all to see loaded but inactive units, too. To show all installed unit files use 'systemctl list-unit-files'. [root@ptalgulk-0807c-fq97t-w-a-l-rhel-1 cloud-user]# journalctl -u disable-mglru.service -- Logs begin at Mon 2024-07-08 06:23:03 UTC, end at Mon 2024-07-08 08:31:35 UTC. -- Jul 08 06:23:14 localhost.localdomain systemd[1]: Starting Disables MGLRU on Openshfit... Jul 08 06:23:14 localhost.localdomain bash[710]: /usr/bin/bash: /sys/kernel/mm/lru_gen/enabled: No such file or directory Jul 08 06:23:14 localhost.localdomain systemd[1]: disable-mglru.service: Main process exited, code=exited, status=1/FAILURE Jul 08 06:23:14 localhost.localdomain systemd[1]: disable-mglru.service: Failed with result 'exit-code'. Jul 08 06:23:14 localhost.localdomain systemd[1]: Failed to start Disables MGLRU on Openshfit. Jul 08 06:23:14 localhost.localdomain systemd[1]: disable-mglru.service: Consumed 4ms CPU time We should only disable mglru if it exists.
Version-Release number of selected component (if applicable):
4.16, 4.17
How reproducible:
Attempt to bring up rhel worker node
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-42574. The following is the description of the original issue:
—
Description of problem:
On "VolumeSnapshot" list page, when project dropdown is "All Projects", click "Create VolumeSnapshot", the project "Undefined" is shown on project field.
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-09-27-213503 4.18.0-0.nightly-2024-09-28-162600
How reproducible:
Always
Steps to Reproduce:
1.Go to "VolumeSnapshot" list page, set "All Projects" in project dropdown list. 2.Click "Create VolumeSnapshot", check project field on the creation page. 3.
Actual results:
2. The project is "Undefined"
Expected results:
2. The project should be "default".
Additional info:
This is a clone of issue OCPBUGS-42783. The following is the description of the original issue:
—
Context
Some ROSA HCP users host their own container registries (e.g., self-hosted Quay servers) that are only accessible from inside of their VPCs. This is often achieved through the use of private DNS zones that resolve non-public domains like quay.mycompany.intranet to non-public IP addresses. The private registries at those addresses then present self-signed SSL certificates to the client that can be validated against the HCP's additional CA trust bundle.
Problem Description
A user of a ROSA HCP cluster with a configuration like the one described above is encountering errors when attempting to import a container image from their private registry into their HCP's internal registry via oc import-image. Originally, these errors showed up in openshift-apiserver logs as DNS resolution errors, i.e., OCPBUGS-36944. After the user upgraded their cluster to 4.14.37 (which fixes OCPBUGS-36944), openshift-apiserver was able to properly resolve the domain name but complains of HTTP 502 Bad Gateway errors. We suspect these 502 Bad Gateway errors are coming from the Konnectivity-agent while it proxies traffic between the control and data planes.
We've confirmed that the private registry is accessible from the HCP data plane (worker nodes) and that the certificate presented by the registry can be validated against the cluster's additional trust bundle. IOW, curl-ing the private registry from a worker node returns a HTTP 200 OK, but doing the same from a control plane node returns a HTTP 502. Notably, this cluster is not configured with a cluster-wide proxy, nor does the user's VPC feature a transparent proxy.
Version-Release number of selected component
OCP v4.14.37
How reproducible
Can be reliably reproduced, although the network config (see Context above) is quite specific
Steps to Reproduce
oc import-image imagegroup/imagename:v1.2.3 --from=quay.mycompany.intranet/imagegroup/imagename:v1.2.3 --confirm
Actual Results
error: tag v1.2.3 failed: Internal error occurred: quay.mycompany.intranet/imagegroup/imagename:v1.2.3: Get "https://quay.mycompany.intranet/v2/": Bad Gateway imagestream.image.openshift.io/imagename imported with errors Name: imagename Namespace: mynamespace Created: Less than a second ago Labels: <none> Annotations: openshift.io/image.dockerRepositoryCheck=2024-10-01T12:46:02Z Image Repository: default-route-openshift-image-registry.apps.rosa.clustername.abcd.p1.openshiftapps.com/mynamespace/imagename Image Lookup: local=false Unique Images: 0 Tags: 1 v1.2.3 tagged from quay.mycompany.intranet/imagegroup/imagename:v1.2.3 ! error: Import failed (InternalError): Internal error occurred: quay.mycompany.intranet/imagegroup/imagename:v1.2.3: Get "https://quay.mycompany.intranet/v2/": Bad Gateway Less than a second ago error: imported completed with errors
Expected Results
Desired container image is imported from private external image registry into cluster's internal image registry without error
Please review the following PR: https://github.com/openshift/cluster-api-provider-openstack/pull/315
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
We are in a live migration scenario.
If a project has a networkpolicy to allow from the host network (more concretely, to allow from the ingress controllers and the ingress controllers are in the host network), traffic doesn't work during the live migration between any ingress controller node (either migrated or not migrated) and an already migrated application node.
I'll expand later in the description and internal comments, but the TL;DR is that the IPs of the tun0 of not migrated source nodes and the IPs of the ovn-k8s-mp0 from migrated source nodes are not added to the address sets related to the networkpolicy ACL in the target OVN-Kubernetes node, so that traffic is not allowed.
Version-Release number of selected component (if applicable):
4.16.13
How reproducible:
Always
Steps to Reproduce:
1. Before the migration: have a project with a networkpolicy that allows from the ingress controller and the ingress controller in the host network. Everything must work properly at this point.
2. Start the migration
3. During the migration, check connectivity from the host network of either a migrated node or a non-migrated node. Both will fail (checking from the same node doesn't fail)
Actual results:
Pod on the worker node is not reachable from the host network of the ingress controller node (unless the pod is in the same node than the ingress controller), which causes the ingress controller routes to throw 503 error.
Expected results:
Pod on the worker node to be reachable from the ingress controller node, even when the ingress controller node has not migrated yet and the application node has.
Additional info:
This is not a duplicate of OCPBUGS-42578. This bug refers to the host-to-pod communication path while the other one doesn't.
This is a customer issue. More details to be included in private comments for privacy.
Workaround: Creating a networkpolicy that explicitly allows traffic from tun0 and ovn-k8s-mp0 interfaces. However, note that the workaround can be problematic for clusters with hundreds or thousands of projects. Another possible workaround is to temporarily delete all the networkpolicies of the projects. But again, this may be problematic (and a security risk).
Please review the following PR: https://github.com/openshift/ironic-rhcos-downloader/pull/98
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Attempts to update a cluster to a release payload with a signature published by Red Hat fails with CVO failing to verity the signature, signalled by the ReleaseAccepted=False condition:
Retrieving payload failed version="4.16.0-rc.4" image="quay.io/openshift-release-dev/ocp-release@sha256:5e76f8c2cdc81fa40abb809ee5e2d56cb84f409aab773aa9b9c7e8ed8811bf74" failure=The update cannot be verified: unable to verify sha256:5e76f8c2cdc81fa40abb809ee5e2d56cb84f409aab773aa9b9c7e8ed8811bf74 against keyrings: verifier-public-key-redhat
CVO shows evidence of not being able to find the proper signature in its stores:
$ grep verifier-public-key-redhat cvo.log | head I0610 07:38:16.208595 1 event.go:364] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RetrievePayloadFailed' Retrieving payload failed version="4.16.0-rc.4" image="quay.io/openshift-release-dev/ocp-release@sha256:5e76f8c2cdc81fa40abb809ee5e2d56cb84f409aab773aa9b9c7e8ed8811bf74" failure=The update cannot be verified: unable to verify sha256:5e76f8c2cdc81fa40abb809ee5e2d56cb84f409aab773aa9b9c7e8ed8811bf74 against keyrings: verifier-public-key-redhat // [2024-06-10T07:38:16Z: prefix sha256-5e76f8c2cdc81fa40abb809ee5e2d56cb84f409aab773aa9b9c7e8ed8811bf74 in config map signatures-managed: no more signatures to check, 2024-06-10T07:38:16Z: ClusterVersion spec.signatureStores is an empty array. Unset signatureStores entirely if you want to enable the default signature stores] ...
4.16.0-rc.3
4.16.0-rc.4
4.17.0-ec.0
Seems always. All CI build farm clusters showed this behavior when trying to update from 4.16.0-rc.3
1. Launch update to a version with a signature published by RH
ReleaseAccepted=False and update is stuck
ReleaseAccepted=True and update proceeds
Suspected culprit is https://github.com/openshift/cluster-version-operator/pull/1030/ so the fix may be a revert or an attempt to fix-forward, but revert seems safer at this point.
Evidence:
[1]
...ClusterVersion spec.signatureStores is an empty array. Unset signatureStores entirely if you want to enable the default signature store... W0610 07:58:59.095970 1 warnings.go:70] unknown field "spec.signatureStores"
Refactor name to Dockerfile.ocp as a better, version independent alternative
Description of problem:
monitoritoring-alertmanager-api-writer should be monitoring-alertmanager-api-writer
Port 9092 provides access the `/metrics` and `/federate` endpoints only. This port is for internal use, and no other usage is guaranteed.
should be
Port 9092 provides access to the `/metrics` and `/federate` endpoints only. This port is for internal use, and no other usage is guaranteed.
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
always
Steps to Reproduce:
1. check Documentation/resources.md
Actual results:
errors in doc
use "pipelinesascode.tekton.dev/on-cel-expression" to skip Konflux builds when they are not necessary
The haproxy image here is currently just amd64. This is preventing testing of arm nodepools on azure on aks. We should use a manifest list version with both arm and amd.
Description of problem:
Update Docs links for "Learn More" in Display Warning Policy Notification Actual link:
Version-Release number of selected component (if applicable):
How reproducible:
Code: https://github.com/openshift/console/blob/master/frontend/public/components/utils/documentation.tsx#L88
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-35358. The following is the description of the original issue:
—
I'm working with the Gitops operator (1.7) and when there is a high amount of CR (38.000 applications objects in this case) the related install plan get stuck with the following error:
- lastTransitionTime: "2024-06-11T14:28:40Z" lastUpdateTime: "2024-06-11T14:29:42Z" message: 'error validating existing CRs against new CRD''s schema for "applications.argoproj.io": error listing resources in GroupVersionResource schema.GroupVersionResource{Group:"argoproj.io", Version:"v1alpha1", Resource:"applications"}: the server was unable to return a response in the time allotted, but may still be processing the request'
Even waiting for a long time the operator is unable to move forward not removing or reinstalling its components.
Over a lab, the issue was not present until we started to add load to the cluster (applications.argoproj.io) and when we hit 26.000 applications we were not able to upgrade or reinstall the operator anymore.
This is a clone of issue OCPBUGS-42782. The following is the description of the original issue:
—
Description of problem:
The OpenShift Pipelines operator automatically installs a OpenShift console plugin. The console plugin metrics reports this as unknown after the plugin was renamed from "pipeline-console-plugin" to "pipelines-console-plugin".
Version-Release number of selected component (if applicable):
4.14+
How reproducible:
Always
Steps to Reproduce:
Actual results:
It shows an "unknown" plugin in the metrics.
Expected results:
It should shows a "pipelines" plugin in the metrics.
Additional info:
None
Currently, the assisted-service generated code test does not ensure that modules are tidy. This results in untidy builds which could potentially lead to build failures or redundant packages in builds
Please review the following PR: https://github.com/openshift/cluster-bootstrap/pull/104
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
TestIngressControllerRouteSelectorUpdateShouldClearRouteStatus is flaking because sometimes the IngressController gets modified after the E2E test retrieved the state of the IngressController object, we get this error when trying to apply changes: === NAME TestAll/parallel/TestIngressControllerRouteSelectorUpdateShouldClearRouteStatus router_status_test.go:134: failed to update ingresscontroller: Operation cannot be fulfilled on ingresscontrollers.operator.openshift.io "ic-route-selector-test": the object has been modified; please apply your changes to the latest version and try again We should use updateIngressControllerSpecWithRetryOnConflict to repeatedly attempt to update the IngressController while refreshing the state.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Steps to Reproduce:
1. Run TestIngressControllerRouteSelectorUpdateShouldClearRouteStatus E2E test
Actual results:
=== NAME TestAll/parallel/TestIngressControllerRouteSelectorUpdateShouldClearRouteStatus router_status_test.go:134: failed to update ingresscontroller: Operation cannot be fulfilled on ingresscontrollers.operator.openshift.io "ic-route-selector-test": the object has been modified; please apply your changes to the latest version and try again
Expected results:
Test should pass
Additional info:
This is a clone of issue OCPBUGS-39226. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible: Always
Repro Steps:
Add: "bridge=br0:enpf0,enpf2 ip=br0:dhcp" to dracut cmdline. Make sure either enpf0/enpf2 is the primary network of the cluster subnet.
The linux bridge can be configured to add a virtual switch between one or many ports. This can be done by a simple machine config that adds:
"bridge=br0:enpf0,enpf2 ip=br0:dhcp"
to the the kernel command line options which will be processed by dracut.
The use case of adding such a virtual bridge for simple IEEE802.1 switching is to support PCIe devices that act as co-processors in a baremetal server. For example:
-------- ---------------------
Host | PCIe | Co-processor | |
eth0 | <-------> | enpf0 < |
<---> network |
-------- ---------------------
This co-processor could be a "DPU" network interface card. Thus the co-processor can be part of the same underlay network as the cluster and pods can be scheduled on the Host and the Co-processor. This allows for pods to be offloaded to the co-processor for scaling workloads.
Actual results:
ovs-configuration service fails.
Expected results:
ovs-configuration service passes with the bridge interface added to the ovs bridge.
Description of problem:
this issue is opened to track an known issue https://github.com/openshift/console/pull/13677/files#r1567936852
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-04-22-023835
How reproducible:
Always
Steps to Reproduce:
Actual results:
currently we are printing: console-telemetry-plugin: telemetry disabled - ignoring telemetry event: page
Expected results:
Additional info:
Description of problem:
When creating a Serverless Function via Web Console from GIT repository the validation claims that builder strategy is not s2i. However if the build strategy is not set in func.yaml, then the s2i should be assumed implicitly. There should be no error. There should be error only if the strategy is explicitly set to something other than s2i in func.yaml.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Try to create Serverless function from git repository where func.yaml does not explicitly specify builder. 2. The Serverless Function cannot be created because of the validation.
Actual results:
The Function cannot be created.
Expected results:
The function can be created.
Additional info:
This is a clone of issue OCPBUGS-38111. The following is the description of the original issue:
—
Description of problem:
See https://github.com/openshift/console/pull/14030/files/0eba7f7db6c35bbf7bca5e0b8eebd578e47b15cc#r1707020700
This is a clone of issue OCPBUGS-38132. The following is the description of the original issue:
—
The CPO reconciliation aborts when the OIDC/LDAP IDP validation check fails and this result in failure to reconcile for any components that are reconciled after that point in the code.
This failure should not be fatal to the CPO reconcile and should likely be reported as a condition on the HC.
xref
Customer incident
https://issues.redhat.com/browse/OCPBUGS-38071
RFE for bypassing the check
https://issues.redhat.com/browse/RFE-5638
PR to proxy the IDP check through the data plane network
https://github.com/openshift/hypershift/pull/4273
Description of problem:
Collect number of resources in etcd with must-gather
Version-Release number of selected component (if applicable):
4.14, 4.15, 4.16, 4.17
Actual results:
No etcd number of resources available in the must-gather
Expected results:
etcd number of resources available in the must-gather
Additional info: RFE-5765
PR for 4.17 [1]
Please review the following PR: https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/346
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
The docs [1] on how to remove a cluster without it being destroyed for MCE don't account for clusters installed with the infrastructure operator and late-binding.
If these steps are followed the hosts will be booted back into the discovery ISO, effectively destroying the cluster.
How reproducible:
Not sure, just opening this after a conversation with a customer and a check in the docs.
Steps to reproduce:
1. Deploy a cluster using late-binding and BMHs
2. Follow the referenced doc to "detach" a cluster from management
Actual results:
ClusterDeployment is removed and agents are unbound, rebooting them back into the discovery ISO.
Expected results:
"detaching" a cluster should not reboot the agents into the infraenv and should delete them instead even if late binding was used.
Alternatively the docs should be updated to ensure late-binding users don't accidentally destroy their clusters.
Please review the following PR: https://github.com/openshift/oauth-apiserver/pull/116
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/baremetal-runtimecfg/pull/314
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-38549. The following is the description of the original issue:
—
In analytics events, console sends the Organization.id from OpenShift Cluster Manager's Account Service, rather than the Organization.external_id. The external_id is meaningful company-wide at Red Hat, while the plain id is only meaningful within OpenShift Cluster Manager. You can use id to lookup external_id in OCM, but it's an extra step we'd like to avoid if possible.
cc Ali Mobrem
Description of problem:
The etcd data store is leftover in the <install dir>/.clusterapi_output/etcd dir when infrastructure provisioning fails. This takes up a lot of local storage space and is useless.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Start an install 2. introduce an infra provisioning failure -- I did this by editing the 02_infra-cluster.yaml manifest to point to a non existent region 3. Check .clusterapi_output dir for etcd dir
Actual results:
etcd data store remains
Expected results:
etcd data store should be deleted during infra provisioning failures. It should only be persisted to disk if there is a failure/interrupt in between infrastructure provisioning and bootstrap destroy, in which case it can be used in conjunction with the wait-for and destroy bootstrap commands.
Additional info:
Description of problem:
https://github.com/prometheus/prometheus/pull/14446 is a fix for https://github.com/prometheus/prometheus/issues/14087 (see there for details) This was introduced in Prom 2.51.0 https://github.com/openshift/cluster-monitoring-operator/blob/master/Documentation/deps-versions.md
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The code introduced by https://github.com/openshift/hypershift/pull/4354 is potentially disrupting for legit clusters. This need to be dropped before releasing. This is a blocker but I don't happen to be able to set prio: blocker any more.
Description of problem:
The console frontend does not respect the basePath when loading locales
Version-Release number of selected component (if applicable):
4.17.0
How reproducible:
Always
Steps to Reproduce:
1. Launch ./bin/bridge --base-path=/okd/ 2. Open browser console and observe 404 errors
Actual results:
There are 404 errors
Expected results:
There are no 404 errors
Additional info:
Copied from https://github.com/openshift/console/issues/12671
Please review the following PR: https://github.com/openshift/cluster-kube-apiserver-operator/pull/1688
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/openshift-apiserver/pull/437
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-37984. The following is the description of the original issue:
—
Starting around the beginning of June, `-bm` (real baremetal) jobs started exhibiting a high failure rate. OCPBUGS-33255 was mentioned as a potential cause, but this was filed much earlier.
The start date for this is pretty clear in Sippy, chart here:
Example job run:
More job runs
Slack thread:
https://redhat-internal.slack.com/archives/C01CQA76KMX/p1722871253737309
Affecting these tests:
install should succeed: overall install should succeed: cluster creation install should succeed: bootstrap
A provisioning object such as:
apiVersion: metal3.io/v1alpha1 kind: Provisioning metadata: name: provisioning-configuration spec: provisioningInterface: enp175s0f1 provisioningNetwork: Disabled
Would cause Metal3 to not function properly, errors such as:
**
waiting for IP to be configured ens175s0f1
would be seen in the metal-ironic container logs.
A workaround is to delete all related provisioning fields, i.e.:
provisioningDHCPRange: 172.22.0.10,172.22.7.254 provisioningIP: 172.22.0.3 provisioningInterface: enp175s0f1
If the provisioning network is disabled all related provisioning options should be ignored.
Remove duplicate code in storage module
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
CI Disruption during node updates:
4.18 Minor and 4.17 micro upgrades started failing with the initial 4.17 payload 4.17.0-0.ci-2024-08-09-225819
4.18 Micro upgrade failures began with the initial payload 4.18.0-0.ci-2024-08-09-234503
CI Disruption in the -out-of-change jobs in the nightlies that start with
4.18.0-0.nightly-2024-08-10-011435 and
4.17.0-0.nightly-2024-08-09-223346
The common change in all of those scenarios appears to be:
OCPNODE-2357: templates/master/cri-o: make crun as the default container runtime #4437
OCPNODE-2357: templates/master/cri-o: make crun as the default container runtime #4518
Please review the following PR: https://github.com/openshift/machine-api-provider-gcp/pull/83
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
During OpenShift cluster installation - 4.16 Openshift installer file which uses terraform module is unable to create tags for the security groups associated with master / worker nodes since the tag is in key value format. (i.e key=value) Error log for reference: level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: faile d to create security groups: failed to tag the Control plane security group: Resource not found: [PUT https://example.cloud:443/v2.0/security-groups/sg-id/tags/openshiftClusterID=ocpclientprod2-vwgsc]
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
100%
Steps to Reproduce:
1. Create install-config 2. run the 4.16 installer 3. Observe the installation logs
Actual results:
installation fails to tag the security group
Expected results:
installation to be successful
Additional info:
Description of problem:
- Pods managed by DaemonSets are being evicted. - This is causing that some pods of OCP components, such as for example csi drivers (and might be more) are beeing evicted before the application pods, causing those application pods going into an Error status (because CSI pod cannot do the tear down of the volumes). - As applicaiton pods remain in error status, drain operation also fails after the maxPodGracePeriod
Version-Release number of selected component (if applicable):
- 4.11
How reproducible:
- Wait for a new scale-down event
Steps to Reproduce:
1. Wait for a new scale-down event 2.Monitor csi pods (or dns, or ingress...), you will notice that they are evicted, and as it come from DaemonSets, they become scheduled again as new pods. 3. More evidences could be found from kube-api audit logs.
Actual results:
- From audit logs we can see that pods are evicted by the clusterautoscaler "kind": "Event", "apiVersion": "audit.k8s.io/v1", "level": "Metadata", "auditID": "ec999193-2c94-4710-a8c7-ff9460e30f70", "stage": "ResponseComplete", "requestURI": "/api/v1/namespaces/openshift-cluster-csi-drivers/pods/aws-efs-csi-driver-node-2l2xn/eviction", "verb": "create", "user": { "username": "system:serviceaccount:openshift-machine-api:cluster-autoscaler", "uid": "44aa427b-58a4-438a-b56e-197b88aeb85d", "groups": [ "system:serviceaccounts", "system:serviceaccounts:openshift-machine-api", "system:authenticated" ], "extra": { "authentication.kubernetes.io/pod-name": [ "cluster-autoscaler-default-5d4c54c54f-dx59s" ], "authentication.kubernetes.io/pod-uid": [ "d57837b1-3941-48da-afeb-179141d7f265" ] } }, "sourceIPs": [ "10.0.210.157" ], "userAgent": "cluster-autoscaler/v0.0.0 (linux/amd64) kubernetes/$Format", "objectRef": { "resource": "pods", "namespace": "openshift-cluster-csi-drivers", "name": "aws-efs-csi-driver-node-2l2xn", "apiVersion": "v1", "subresource": "eviction" }, "responseStatus": { "metadata": {}, "status": "Success", "code": 201 ## Even if they come from a daemonset $ oc get ds -n openshift-cluster-csi-drivers NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE aws-ebs-csi-driver-node 8 8 8 8 8 kubernetes.io/os=linux 146m aws-efs-csi-driver-node 8 8 8 8 8 kubernetes.io/os=linux 127m
Expected results:
DaemonSet Pods should not be evicted
Additional info:
In our hypershift test, we see the openshift-controller-manager undoing the work of our controllers to set an imagePullSecrets entry on our ServiceAccounts. The result is a rapid updating of ServiceAccounts as the controllers fight.
This started happening after https://github.com/openshift/openshift-controller-manager/pull/305
Please review the following PR: https://github.com/openshift/agent-installer-utils/pull/35
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
4.16 installs fail for ROSA STS installations time="2024-06-11T14:05:48Z" level=debug msg="\t[failed to apply security groups to load balancer \"jamesh-sts-52g29-int\": AccessDenied: User: arn:aws:sts::476950216884:assumed-role/ManagedOpenShift-Installer-Role/1718114695748673685 is not authorized to perform: elasticloadbalancing:SetSecurityGroups on resource: arn:aws:elasticloadbalancing:us-east-1:476950216884:loadbalancer/net/jamesh-sts-52g29-int/bf7ef748daa739ce because no identity-based policy allows the elasticloadbalancing:SetSecurityGroups action"
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
Every time
Steps to Reproduce:
1. Create an installer policy with the permissions listed in the installer [here|https://github.com/openshift/installer/blob/master/pkg/asset/installconfig/aws/permissions.go] 2. Run a install in AWS IPI
Actual results:
The installer fails to install a cluster in AWS The installer log should show AccessDenied messages for the IAM action elasticloadbalancing:SetSecurityGroups The installer should show the error message "failed to apply security groups to load balancer"
Expected results:
Install completes successfully
Additional info:
Managed OpenShift (ROSA) installs STS clusters with [this|https://github.com/openshift/managed-cluster-config/blob/master/resources/sts/4.16/sts_installer_permission_policy.json] permission policy for the installer which should be what is required from the installer [policy|https://github.com/openshift/installer/blob/master/pkg/asset/installconfig/aws/permissions.go] plus permissions needed for OCM to do pre install validation.
This is a clone of issue OCPBUGS-38479. The following is the description of the original issue:
—
Description of problem:
When using an installer with amd64 payload, configuring the VM to use aarch64 is possible through the installer-config.yaml: additionalTrustBundlePolicy: Proxyonly apiVersion: v1 baseDomain: ci.devcluster.openshift.com compute: - architecture: arm64 hyperthreading: Enabled name: worker platform: {} replicas: 3 controlPlane: architecture: arm64 hyperthreading: Enabled name: master platform: {} replicas: 3 However, the installation will fail with ambiguous error messages: ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.build11.ci.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 13.59.207.137:6443: connect: connection refused The actual error hides in the bootstrap VM's System Log: Red Hat Enterprise Linux CoreOS 417.94.202407010929-0 4.17 SSH host key: SHA256:Ng1GpBIlNHcCik8VJZ3pm9k+bMoq+WdjEcMebmWzI4Y (ECDSA) SSH host key: SHA256:Mo5RgzEmZc+b3rL0IPAJKUmO9mTmiwjBuoslgNcAa2U (ED25519) SSH host key: SHA256:ckQ3mPUmJGMMIgK/TplMv12zobr7NKrTpmj+6DKh63k (RSA) ens5: 10.29.3.15 fe80::1947:eff6:7e1b:baac Ignition: ran on 2024/08/14 12:34:24 UTC (this boot) Ignition: user-provided config was applied [0;33mIgnition: warning at $.kernelArguments: Unused key kernelArguments[0m [1;31mRelease image arch amd64 does not match host arch arm64[0m ip-10-29-3-15 login: [ 89.141099] Warning: Unmaintained driver is detected: nft_compat
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Use amd64 installer to install a cluster with aarch64 nodes
Steps to Reproduce:
1. download amd64 installer 2. generate the install-config.yaml 3. edit install-config.yaml to use aarch64 nodes 4. invoke the installer
Actual results:
installation timed out after ~30mins
Expected results:
installation failed immediately with proper error message indicating the installation is not possible
Additional info:
https://redhat-internal.slack.com/archives/C68TNFWA2/p1723640243828379
This is a clone of issue OCPBUGS-39402. The following is the description of the original issue:
—
There is a typo here: https://github.com/openshift/installer/blob/release-4.18/upi/openstack/security-groups.yaml#L370
It should be os_subnet6_range.
That task is only run if os_master_schedulable is defined and greater to 0 in the inventory.yaml
Tracker issue for bootimage bump in 4.17. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-36318.
4.17.0-0.nightly-2024-05-17-101712 and 4.16.0-0.nightly-2024-05-17-180525 both appear to have ClusterAPIInstallAWS enabled
Don't know that it is related but, since those payloads we are seeing intermittent failures:
4.17.0-0.nightly-2024-05-18-041308, 4.17.0-0.nightly-2024-05-18-084343, 4.17.0-0.nightly-2024-05-18-152118
Each time we have an aggregated-aws-ovn-upgrade-4.17-micro-release-openshift-release-analysis-aggregator failure there are 3 jobs that fail due to bootstrap ssh rule was not removed within 5m0s
4.16.0-0.nightly-2024-05-19-025142, 4.16.0-0.nightly-2024-05-19-141817
This is a clone of issue OCPBUGS-41932. The following is the description of the original issue:
—
Description of problem:
When the Insights Operator is disabled (as described in the docs here or here), the RemoteConfigurationAvailable and RemoteConfigurationValid clusteroperator conditions are reporting the previous (before distabling the gathering) state (which might be Available=True and Valid=True).
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Disable the data gathering in the Insights Operator followings the docs links above 2. Watch the clusteroperator conditions with "oc get co insights -o json | jq .status.conditions" 3.
Actual results:
Expected results:
Additional info:
Tests failing:
hosted cluster version rollout succeeds
Tanked on May 25th (Sat) but may have come in late Friday the 24th.
Failure message:
{hosted cluster version rollout never completed error: hosted cluster version rollout never completed, dumping relevant hosted cluster condition messages Degraded: [capi-provider deployment has 1 unavailable replicas, kube-apiserver deployment has 1 unavailable replicas] ClusterVersionSucceeding: Condition not found in the CVO. }
Looks to broadly failing hypershift presubmits all over as well:
Description of problem:
When user tries to run oc-mirror delete command with --generate after a (M2D + D2M) it fails with error below 2024/08/02 12:18:03 [ERROR] : [OperatorImageCollector] pinging container registry localhost:55000: Get "https://localhost:55000/v2/": http: server gave HTTP response to HTTPS client 2024/08/02 12:18:03 [ERROR] : [OperatorImageCollector] pinging container registry localhost:55000: Get "https://localhost:55000/v2/": http: server gave HTTP response to HTTPS client 2024/08/02 12:18:03 [ERROR] : pinging container registry localhost:55000: Get "https://localhost:55000/v2/": http: server gave HTTP response to HTTPS client
Version-Release number of selected component (if applicable):
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202407302009.p0.gdbf115f.assembly.stream.el9-dbf115f", GitCommit:"dbf115f547a19f12ab72e7b326be219a47d460a0", GitTreeState:"clean", BuildDate:"2024-07-31T00:37:18Z", GoVersion:"go1.22.5 (Red Hat 1.22.5-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. Download latest oc-mirror binary 2. Use the ImageSetCofig below and perform (M2D + D2M) 3. oc-mirror -c config.yaml file://CLID-136 --v2 4. oc-mirror -c config.yaml --from file://CLID-136 --v2 docker://localhost:5000 --dest-tls-verify=false 5. Now create deleteImageSetConfig as shown below and run delete command with --generate 6. oc-mirror delete -c delete-config.yaml --generate --workspace file://CLID-136-delete docker://localhost:5000 --v2
Actual results:
Below errors are seen 2024/08/02 12:18:03 [ERROR] : [OperatorImageCollector] pinging container registry localhost:55000: Get "https://localhost:55000/v2/": http: server gave HTTP response to HTTPS client 2024/08/02 12:18:03 [ERROR] : [OperatorImageCollector] pinging container registry localhost:55000: Get "https://localhost:55000/v2/": http: server gave HTTP response to HTTPS client 2024/08/02 12:18:03 [ERROR] : pinging container registry localhost:55000: Get "https://localhost:55000/v2/": http: server gave HTTP response to HTTPS client
Expected results:
No errors should be seen
Additional info:
This error is resolved upon using --src-tls-verify=false with the oc-mirror delete --generate command More details in the slack thread here https://redhat-internal.slack.com/archives/C050P27C71S/p1722601331671649?thread_ts=1722597021.825099&cid=C050P27C71S
This fix contains the following changes coming from updated version of kubernetes up to v1.30.5: Changelog: v1.30.5: https://github.com/kubernetes/kubernetes/blob/release-1.30/CHANGELOG/CHANGELOG-1.30.md#changelog-since-v1304
This is a clone of issue OCPBUGS-37819. The following is the description of the original issue:
—
Description of problem:
When we added new bundle metadata encoding as `olm.csv.metadata` in https://github.com/operator-framework/operator-registry/pull/1094 (downstreamed for 4.15+) we created situations where - konflux onboarded operators, encouraged to use upstream:latest to generate FBC from templates; and - IIB-generated catalog images which used earlier opm versions to serve content could generate the new format but not be able to serve it. One only has to `opm render` an SQLite catalog image, or expand a catalog template.
Version-Release number of selected component (if applicable):
How reproducible:
every time
Steps to Reproduce:
1. opm render an SQLite catalog image 2. 3.
Actual results:
uses `olm.csv.metadata` in the output
Expected results:
only using `olm.bundle.object` in the output
Additional info:
Please review the following PR: https://github.com/openshift/azure-disk-csi-driver/pull/83
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
In Openshift web console, Dashboards tab, data is not getting loaded for "Prometheus/Overview" Dashboard
Version-Release number of selected component (if applicable):
4.16.0-ec.5
How reproducible:
OCP 4.16.0-ec.5 cluster deployed on Power using UPI installer
Steps to Reproduce:
1. Deploy 4.16.0-ec.5 cluster using UPI installer 2. Login to web console 3. Select "Dashboards" panel under "Observe" tab 4. Select "Prometheus/Overview" from the "Dashboard" drop down
Actual results:
Data/graphs are not getting loaded. "No datapoints found." message is being displayed in all panels
Expected results:
Data/Graphs should be displayed
Additional info:
Screenshots and must-gather.log are available at https://drive.google.com/drive/folders/1XnotzYBC_UDN97j_LNVygwrc77Tmmbtx?usp=drive_link Status of Prometheus pods: [root@ha-416-sajam-bastion-0 ~]# oc get pods -n openshift-monitoring | grep prometheus prometheus-adapter-dc7f96748-mczvq 1/1 Running 0 3h18m prometheus-adapter-dc7f96748-vl4n8 1/1 Running 0 3h18m prometheus-k8s-0 6/6 Running 0 7d2h prometheus-k8s-1 6/6 Running 0 7d2h prometheus-operator-677d4c87bd-8prnx 2/2 Running 0 7d2h prometheus-operator-admission-webhook-54549595bb-gp9bw 1/1 Running 0 7d3h prometheus-operator-admission-webhook-54549595bb-lsb2p 1/1 Running 0 7d3h [root@ha-416-sajam-bastion-0 ~]# Logs of Prometheus pods are available at https://drive.google.com/drive/folders/13DhLsQYneYpouuSsxYJ4VFhVrdJfQx8P?usp=drive_link
Please review the following PR: https://github.com/openshift/images/pull/186
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The MCO currently lays down a file at /etc/mco/internal-registry-pull-secret.json, which is extracted from the machine-os-puller SA into ControllerConfig. It is then templated down to a MachineConfig. For some reason, this SA is now being refreshed every hour or so, causing a new MachineConfig to be generated every hour. This also causes CI issues as the machineconfigpools will randomly update to a new config in the middle of a test.
More context: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1715888365021729
Please review the following PR: https://github.com/openshift/route-controller-manager/pull/42
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
If there was no DHCP Network Name, then the destroy code would skip deleting the DHCP resource. Now we add a test to see if the DHCP backing VM is in ERROR state. And, if so, delete it.
This was necessary for kuryr, which we no longer support. We should therefore stop creating machines with trunking enabled.
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver/pull/63
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Large portions of the code for setting the kargs in the agent ISO are duplicated from the implementation in assisted-image-service. By implementing an API in a-i-s comparable to the one used for ignition images in MULTIARCH-2678, we can eliminate this duplication and ensure that there is only one library that needs to be updated if anything changes in how kargs are set.
Description of problem:
For the fix of OCPBUGS-29494, only the hosted cluster was fixed, and changes to the node pool were ignored. The node pool encountered the following error:
- lastTransitionTime: "2024-05-31T09:11:40Z" message: 'failed to check if we manage haproxy ignition config: failed to look up image metadata for registry.ci.openshift.org/ocp/4.14-2024-05-29-171450@sha256:9b88c6e3f7802b06e5de7cd3300aaf768e85d785d0847a70b35857e6d1000d51: failed to obtain root manifest for registry.ci.openshift.org/ocp/4.14-2024-05-29-171450@sha256:9b88c6e3f7802b06e5de7cd3300aaf768e85d785d0847a70b35857e6d1000d51: unauthorized: authentication required' observedGeneration: 1 reason: ValidationFailed status: "False" type: ValidMachineConfig
Version-Release number of selected component (if applicable):
4.14, 4.15, 4.16, 4.17
How reproducible:
100%
Steps to Reproduce:
1. try to deploy an hostedCluster on a disconnected environment without explicitly set hypershift.openshift.io/control-plane-operator-image annotation. 2. 3.
Expected results:
without set hypershift.openshift.io/control-plane-operator-image annotation nodepool can be ready
This is a clone of issue OCPBUGS-41785. The following is the description of the original issue:
—
Context: https://redhat-internal.slack.com/archives/CH98TDJUD/p1682969691044039?thread_ts=1682946070.139719&cid=CH98TDJUD
If a neutron network MTU is too small, br-ex will be set to 1280 anyway, which might be problematic if the neutron mtu is smaller than that. we should have some validation in the installer to prevent this.
We need an MTU of 1380 at least, where 1280 is the minimum allowed for IPv6 + 100 for the OVN-Kubernetes encapsulation overhead.
Please review the following PR: https://github.com/openshift/node_exporter/pull/147
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/kubernetes/pull/1976
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-36196. The following is the description of the original issue:
—
Description of problem:
Launch CAPI based installation on Azure Government Cloud, installer was timeout when waiting for network infrastructure to become ready. 06-26 09:08:41.153 level=info msg=Waiting up to 15m0s (until 9:23PM EDT) for network infrastructure to become ready... ... 06-26 09:09:33.455 level=debug msg=E0625 21:09:31.992170 22172 azurecluster_controller.go:231] "failed to reconcile AzureCluster" err=< 06-26 09:09:33.455 level=debug msg= failed to reconcile AzureCluster service group: reconcile error that cannot be recovered occurred: resource is not Ready: The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found.: PUT https://management.azure.com/subscriptions/8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7/resourceGroups/jima26mag-9bqkl-rg 06-26 09:09:33.456 level=debug msg= -------------------------------------------------------------------------------- 06-26 09:09:33.456 level=debug msg= RESPONSE 404: 404 Not Found 06-26 09:09:33.456 level=debug msg= ERROR CODE: SubscriptionNotFound 06-26 09:09:33.456 level=debug msg= -------------------------------------------------------------------------------- 06-26 09:09:33.456 level=debug msg= { 06-26 09:09:33.456 level=debug msg= "error": { 06-26 09:09:33.456 level=debug msg= "code": "SubscriptionNotFound", 06-26 09:09:33.456 level=debug msg= "message": "The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found." 06-26 09:09:33.456 level=debug msg= } 06-26 09:09:33.456 level=debug msg= } 06-26 09:09:33.456 level=debug msg= -------------------------------------------------------------------------------- 06-26 09:09:33.456 level=debug msg= . Object will not be requeued 06-26 09:09:33.456 level=debug msg= > logger="controllers.AzureClusterReconciler.reconcileNormal" controller="azurecluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster" AzureCluster="openshift-cluster-api-guests/jima26mag-9bqkl" namespace="openshift-cluster-api-guests" reconcileID="f2ff1040-dfdd-4702-ad4a-96f6367f8774" x-ms-correlation-request-id="d22976f0-e670-4627-b6f3-e308e7f79def" name="jima26mag-9bqkl" 06-26 09:09:33.457 level=debug msg=I0625 21:09:31.992215 22172 recorder.go:104] "failed to reconcile AzureCluster: failed to reconcile AzureCluster service group: reconcile error that cannot be recovered occurred: resource is not Ready: The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found.: PUT https://management.azure.com/subscriptions/8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7/resourceGroups/jima26mag-9bqkl-rg\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 Not Found\nERROR CODE: SubscriptionNotFound\n--------------------------------------------------------------------------------\n{\n \"error\": {\n \"code\": \"SubscriptionNotFound\",\n \"message\": \"The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found.\"\n }\n}\n--------------------------------------------------------------------------------\n. Object will not be requeued" logger="events" type="Warning" object={"kind":"AzureCluster","namespace":"openshift-cluster-api-guests","name":"jima26mag-9bqkl","uid":"20bc01ee-5fbe-4657-9d0b-7013bd55bf96","apiVersion":"infrastructure.cluster.x-k8s.io/v1beta1","resourceVersion":"1115"} reason="ReconcileError" 06-26 09:17:40.081 level=debug msg=I0625 21:17:36.066522 22172 helpers.go:516] "returning early from secret reconcile, no update needed" logger="controllers.reconcileAzureSecret" controller="ASOSecret" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster" AzureCluster="openshift-cluster-api-guests/jima26mag-9bqkl" namespace="openshift-cluster-api-guests" name="jima26mag-9bqkl" reconcileID="2df7c4ba-0450-42d2-901e-683de399f8d2" x-ms-correlation-request-id="b2bfcbbe-8044-472f-ad00-5c0786ebbe84" 06-26 09:23:46.611 level=debug msg=Collecting applied cluster api manifests... 06-26 09:23:46.611 level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: infrastructure is not ready: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline 06-26 09:23:46.611 level=info msg=Shutting down local Cluster API control plane... 06-26 09:23:46.612 level=info msg=Stopped controller: Cluster API 06-26 09:23:46.612 level=warning msg=process cluster-api-provider-azure exited with error: signal: killed 06-26 09:23:46.612 level=info msg=Stopped controller: azure infrastructure provider 06-26 09:23:46.612 level=warning msg=process cluster-api-provider-azureaso exited with error: signal: killed 06-26 09:23:46.612 level=info msg=Stopped controller: azureaso infrastructure provider 06-26 09:23:46.612 level=info msg=Local Cluster API system has completed operations 06-26 09:23:46.612 [[1;31mERROR[0;39m] Installation failed with error code '4'. Aborting execution. From above log, Azure Resource Management API endpoint is not correct, endpoint "management.azure.com" is for Azure Public cloud, the expected one for Azure Government should be "management.usgovcloudapi.net".
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-06-23-145410
How reproducible:
Always
Steps to Reproduce:
1. Install cluster on Azure Government Cloud, capi-based installation 2. 3.
Actual results:
Installation failed because of the wrong Azure Resource Management API endpoint used.
Expected results:
Installation succeeded.
Additional info:
This is a clone of issue OCPBUGS-39126. The following is the description of the original issue:
—
Description of problem:
Difficult to detect in which component I should report this bug. The description is the following. Today we can install RH operators using one precise namespace or just all namepaces that will install the operator in "openshift-operators" namespace. if this operator creates a serviceMonitor that should be scrapped by platform prometheus, this will have a token authentication and security configured in its definition. But if the operator is installed in "openshift-operators" namespace, it's user workload monitoring that will try to scrappe it since this mentioned namespace has not the corresponding label to be scrapped by platform monitoring and we don't want it to have it because in this namespace we can also install community operators. The result is that user workload monitoring will scrap this namespace and the service monitors will be skipped since they are configured with security against platform monitoring and UWM will not hande this. A possible workaround is to do: oc label namespace openshift-operators openshift.io/user-monitoring=false losing functionality since some RH operators will not be monitored if installed in openshift-operators.
Version-Release number of selected component (if applicable):
4.16
Please review the following PR: https://github.com/openshift/cluster-autoscaler-operator/pull/323
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
In the Safari browser when creating an app with either pipeline or build option the topology shows the status on the left-hand corner of the topology(More details can be checked in the screenshot or video)
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create an app 2. Go to topology 3.
Actual results:
UI is distorted with build labels, not in the appropriate position
Expected results:
UI should show labels properly
Additional info:
Safari 17.4.1
Description of problem:
A breaking API change (Catalog -> ClusterCatalog) is blocking downstreaming of operator-framework/catalogd and operator-framework/operator-controller
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
Downstreaming script fails. https://prow.ci.openshift.org/?job=periodic-auto-olm-v1-downstreaming
Actual results:
Downstreaming fails.
Expected results:
Downstreaming succeeds.
Additional info:
Please review the following PR: https://github.com/openshift/cloud-provider-openstack/pull/281
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-dns-operator/pull/413
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Need to add the networking-console-plugin image to the OCP 4.17 release payload, so it can be consumed in the hosted CNO.
Version-Release number of selected component (if applicable):
4.17.0
How reproducible:
100%
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
done at: https://github.com/openshift/cluster-network-operator/pull/2474
This is a clone of issue OCPBUGS-32773. The following is the description of the original issue:
—
Description of problem:
In the OpenShift WebConsole, when using the Instantiate Template screen, the values entered into the form are automatically cleared. This issue occurs for users with developer roles who do not have administrator privileges, but does not occur for users with the cluster-admin cluster role. Additionally, using the developer tools of the web browser, I observed the following console logs when the values were cleared: https://console-openshift-console.apps.mmatsuta-blue.apac.aws.cee.support/api/prometheus/api/v1/rules 403 (Forbidden) https://console-openshift-console.apps.mmatsuta-blue.apac.aws.cee.support/api/alertmanager/api/v2/silences 403 (Forbidden) It appears that a script attempting to fetch information periodically from PrometheusRule and Alertmanager's silences encounters a 403 error due to insufficient permissions, which causes the script to halt and the values in the form to be reset and cleared. This bug prevents users from successfully creating instances from templates in the WebConsole.
Version-Release number of selected component (if applicable):
4.15 4.14
How reproducible:
YES
Steps to Reproduce:
1. Log in with a non-administrator account. 2. Select a template from the developer catalog and click on Instantiate Template. 3. Enter values into the initially empty form. 4. Wait for several seconds, and the entered values will disappear.
Actual results:
Entered values are disappeard
Expected results:
Entered values are appeard
Additional info:
I could not find the appropriate component to report this issue. I reluctantly chose Dev Console, but please adjust it to the correct component.
Description of problem:
While investigating a problem with OpenShift Container Platform 4 - Node scaling, I found the below messages reported in my OpenShift Container Platform 4 - Cluster. E0513 11:15:09.331353 1 orchestrator.go:450] Couldn't get autoscaling options for ng: MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c E0513 11:15:09.331365 1 orchestrator.go:450] Couldn't get autoscaling options for ng: MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c I0513 11:15:09.331529 1 orchestrator.go:546] Pod project-100/curl-67f84bd857-h92wb can't be scheduled on MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo= I0513 11:15:09.331684 1 orchestrator.go:157] No pod can fit to MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c E0513 11:15:09.332076 1 orchestrator.go:507] Failed to get autoscaling options for node group MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c: Not implemented I0513 11:15:09.332100 1 orchestrator.go:185] Best option to resize: MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c I0513 11:15:09.332110 1 orchestrator.go:189] Estimated 1 nodes needed in MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c I0513 11:15:09.332135 1 orchestrator.go:295] Final scale-up plan: [{MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c 0->1 (max: 12)}] The same events are reported in must-gather reviewed from customers. Given that we have https://github.com/kubernetes/autoscaler/issues/6037 and https://github.com/kubernetes/autoscaler/issues/6676 that appear to be solved via https://github.com/kubernetes/autoscaler/pull/6677 and https://github.com/kubernetes/autoscaler/pull/6038 I'm wondering whether we should pull in those changes as they seem to eventually impact automated scaling of OpenShift Container Platform 4 - Node(s).
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.15
How reproducible:
Always
Steps to Reproduce:
1. Setup OpenShift Container Platform 4 with ClusterAutoscaler configured 2. Trigger scaling activity and verify the cluster-autoscaler-default logs
Actual results:
Logs like the below are being reported. E0513 11:15:09.331353 1 orchestrator.go:450] Couldn't get autoscaling options for ng: MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c E0513 11:15:09.331365 1 orchestrator.go:450] Couldn't get autoscaling options for ng: MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c I0513 11:15:09.331529 1 orchestrator.go:546] Pod project-100/curl-67f84bd857-h92wb can't be scheduled on MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo= I0513 11:15:09.331684 1 orchestrator.go:157] No pod can fit to MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c E0513 11:15:09.332076 1 orchestrator.go:507] Failed to get autoscaling options for node group MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c: Not implemented I0513 11:15:09.332100 1 orchestrator.go:185] Best option to resize: MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c I0513 11:15:09.332110 1 orchestrator.go:189] Estimated 1 nodes needed in MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c I0513 11:15:09.332135 1 orchestrator.go:295] Final scale-up plan: [{MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c 0->1 (max: 12)}]
Expected results:
Scale-up of OpenShift Container Platform 4 - Node to happen without error being reported I0513 11:15:09.331529 1 orchestrator.go:546] Pod project-100/curl-67f84bd857-h92wb can't be scheduled on MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo= I0513 11:15:09.331684 1 orchestrator.go:157] No pod can fit to MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c I0513 11:15:09.332100 1 orchestrator.go:185] Best option to resize: MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c I0513 11:15:09.332110 1 orchestrator.go:189] Estimated 1 nodes needed in MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c I0513 11:15:09.332135 1 orchestrator.go:295] Final scale-up plan: [{MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c 0->1 (max: 12)}]
Additional info:
Please review https://github.com/kubernetes/autoscaler/issues/6037 and https://github.com/kubernetes/autoscaler/issues/6676 as they seem to document the problem and also have a solution linked/merged
Please review the following PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/217
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/telemeter/pull/533
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The image registry operator and ingress operator use the `/metrics` endpoint for liveness/readiness probes which in the case of the former results in a payload of ~100kb. This at scale can be non-performant and is also not best practice. The teams which own these operators should instead introduce health endpoints if these probes are needed.
Please review the following PR: https://github.com/openshift/images/pull/182
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-38272. The following is the description of the original issue:
—
Description of problem:
When user changes Infrastructure object, e.g. adds a new vCenter, the operator generates a new driver config (Secret named vsphere-csi-config-secret), but the controller pods are not restarted and use the old config.
Version-Release number of selected component (if applicable):
4.17.0-0.nightly *after* 2024-08-09-031511
How reproducible: always
Steps to Reproduce:
Actual results: the controller pods are not restarted
Expected results: the controller pods are restarted
This is a clone of issue OCPBUGS-39246. The following is the description of the original issue:
—
Description of problem:
Alerts with non-standard severity labels are sent to Telemeter.
Version-Release number of selected component (if applicable):
All supported versions
How reproducible:
Always
Steps to Reproduce:
1. Create an always firing alerting rule with severity=foo. 2. Make sure that telemetry is enabled for the cluster. 3.
Actual results:
The alert can be seen on the telemeter server side.
Expected results:
The alert is dropped by the telemeter allow-list.
Additional info:
Red Hat operators should use standard severities: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide Looking at the current data, it looks like ~2% of the alerts reported to Telemter have an invalid severity.
AC:
Please review the following PR: https://github.com/openshift/image-registry/pull/399
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When creating a HostedCluster that requires a request serving node larger than the ones for which we have placeholders (say a >93 node cluster in ROSA), in some cases the cluster creation does not succeed.
Version-Release number of selected component (if applicable):
HyperShift operator 0f9f686
How reproducible:
Sometimes
Steps to Reproduce:
1. In request serving scaling management cluster, create a cluster with a size that is greater than that for which we have placeholders. 2. Wait for the hosted cluster to schedule and run
Actual results:
The hosted cluster never schedules
Expected results:
The hosted cluster scales and comes up
Additional info:
It is more likely that this occurs when there are many existing hosted clusters on the management cluster already.
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/67
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-42196. The following is the description of the original issue:
—
Description of problem:
When setting .spec.storage.azure.networkAccess.type: Internal (without providing vnet and subnet names), the image registry will attempt to discover the vnet by tag. Previous to the installer switching to cluster-api, the vnet tagging happened here: https://github.com/openshift/installer/blob/10951c555dec2f156fad77ef43b9fb0824520015/pkg/asset/cluster/azure/azure.go#L79-L92. After the switch to cluster-api, this code no longer seems to be in use, so the tags are no longer there. From inspection of a failed job, the new tags in use seem to be in the form of `sigs.k8s.io_cluster-api-provider-azure_cluster_$infraID` instead of the previous `kubernetes.io_cluster.$infraID`. Image registry operator code responsible for this: https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/storage/azure/azure.go?plain=1#L678-L682 More details in slack discussion with installer team: https://redhat-internal.slack.com/archives/C68TNFWA2/p1726732108990319
Version-Release number of selected component (if applicable):
4.17, 4.18
How reproducible:
Always
Steps to Reproduce:
1. Get an Azure 4.17 or 4.18 cluster 2. oc edit configs.imageregistry/cluster 3. set .spec.storage.azure.networkAccess.type to Internal
Actual results:
The operator cannot find the vnet (look for "not found" in operator logs)
Expected results:
The operator should be able to find the vnet by tag and configure the storage account as private
Additional info:
If we make the switch to look for vnet tagged with `sigs.k8s.io_cluster-api-provider-azure_cluster_$infraID`, one thing that needs to be tested is BYO vnet/subnet clusters. What I have currently observed in CI is that the cluster has the new tag key with `owned` value, but for BYO networks the value *should* be `shared`, but I have not tested it. --- Although this bug is a regression, I'm not going to mark it as such because this affects a fairly new feature (introduced on 4.15), and there's a very easy workaround (manually setting the vnet and subnet names when configuring network access to internal).
This is a clone of issue OCPBUGS-39375. The following is the description of the original issue:
—
Description of problem:
Given 2 images with different names, but same layers, "oc image mirror" will only mirror 1 of them. For example: $ cat images.txt quay.io/openshift/community-e2e-images:e2e-33-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS quay.io/bertinatto/test-images:e2e-33-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS quay.io/openshift/community-e2e-images:e2e-31-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS quay.io/bertinatto/test-images:e2e-31-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS $ oc image mirror -f images.txt quay.io/ bertinatto/test-images manifests: sha256:298dcd808e27fbf96614e4c6f06730f22964dce41dcdc7bf21096c42411ba773 -> e2e-33-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS stats: shared=0 unique=0 size=0B phase 0: quay.io bertinatto/test-images blobs=0 mounts=0 manifests=1 shared=0 info: Planning completed in 2.6s sha256:298dcd808e27fbf96614e4c6f06730f22964dce41dcdc7bf21096c42411ba773 quay.io/bertinatto/test-images:e2e-33-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS info: Mirroring completed in 240ms (0B/s)
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Only one of the images were mirrored.
Expected results:
Both images should be mirrored.
Additional info:
Description of problem:
Pending CSR in PowerVS Hypershift cluster causes monitoring CO to not become available. Upon investigation providerID format set by CAPI is improper.
Version-Release number of selected component (if applicable):
4.17.0
How reproducible:
100%
Steps to Reproduce:
1.Create a cluster with hypershift on powervs with 4.17 release image. 2. 3.
Actual results:
Pending CSRs and monitoring CO in unavailable state
Expected results:
No pending CSRs and all CO should be available
Additional info:
Description of problem:
Sometimes dns name configured in EgressFirewall was not resolved
Version-Release number of selected component (if applicable):
Using the build by {code:java} build openshift/cluster-network-operator#2131
How reproducible:{code:none}
Steps to Reproduce:
% for i in {1..7};do oc create ns test$i;oc create -f data/egressfirewall/eg_policy_wildcard.yaml -n test$i; oc create -f data/list-for-pod.json -n test$i;sleep 1;done namespace/test1 created egressfirewall.k8s.ovn.org/default created replicationcontroller/test-rc created service/test-service created namespace/test2 created egressfirewall.k8s.ovn.org/default created replicationcontroller/test-rc created service/test-service created namespace/test3 created egressfirewall.k8s.ovn.org/default created replicationcontroller/test-rc created service/test-service created namespace/test4 created egressfirewall.k8s.ovn.org/default created replicationcontroller/test-rc created service/test-service created namespace/test5 created egressfirewall.k8s.ovn.org/default created replicationcontroller/test-rc created service/test-service created namespace/test6 created egressfirewall.k8s.ovn.org/default created replicationcontroller/test-rc created service/test-service created namespace/test7 created egressfirewall.k8s.ovn.org/default created replicationcontroller/test-rc created service/test-service created % cat data/egressfirewall/eg_policy_wildcard.yaml kind: EgressFirewall apiVersion: k8s.ovn.org/v1 metadata: name: default spec: egress: - type: Allow to: dnsName: "*.google.com" - type: Deny to: cidrSelector: 0.0.0.0/0 Then I created namespace test8, created egressfirewall and updated dns anme,it worked well. Then I deleted test8 After that I created namespace test11 as below steps, the issue happened again. % oc create ns test11 namespace/test11 created % oc create -f data/list-for-pod.json -n test11 replicationcontroller/test-rc created service/test-service created % oc create -f data/egressfirewall/eg_policy_dnsname1.yaml -n test11 egressfirewall.k8s.ovn.org/default created % oc get egressfirewall -n test11 NAME EGRESSFIREWALL STATUS default EgressFirewall Rules applied % oc get egressfirewall -n test11 -o yaml apiVersion: v1 items: - apiVersion: k8s.ovn.org/v1 kind: EgressFirewall metadata: creationTimestamp: "2024-05-16T05:32:07Z" generation: 1 name: default namespace: test11 resourceVersion: "101288" uid: 18e60759-48bf-4337-ac06-2e3252f1223a spec: egress: - to: dnsName: registry-1.docker.io type: Allow - ports: - port: 80 protocol: TCP to: dnsName: www.facebook.com type: Allow - to: cidrSelector: 0.0.0.0/0 type: Deny status: messages: - 'hrw-0516i-d884f-worker-a-m7769: EgressFirewall Rules applied' - 'hrw-0516i-d884f-master-0.us-central1-b.c.openshift-qe.internal: EgressFirewall Rules applied' - 'hrw-0516i-d884f-worker-b-q4fsm: EgressFirewall Rules applied' - 'hrw-0516i-d884f-master-1.us-central1-c.c.openshift-qe.internal: EgressFirewall Rules applied' - 'hrw-0516i-d884f-master-2.us-central1-f.c.openshift-qe.internal: EgressFirewall Rules applied' - 'hrw-0516i-d884f-worker-c-4kvgr: EgressFirewall Rules applied' status: EgressFirewall Rules applied kind: List metadata: resourceVersion: "" % oc get pods -n test11 NAME READY STATUS RESTARTS AGE test-rc-ffg4g 1/1 Running 0 61s test-rc-lw4r8 1/1 Running 0 61s % oc rsh -n test11 test-rc-ffg4g ~ $ curl registry-1.docker.io -I ^C ~ $ curl www.facebook.com ^C ~ $ ~ $ curl www.facebook.com --connect-timeout 5 curl: (28) Failed to connect to www.facebook.com port 80 after 2706 ms: Operation timed out ~ $ curl registry-1.docker.io --connect-timeout 5 curl: (28) Failed to connect to registry-1.docker.io port 80 after 4430 ms: Operation timed out ~ $ ^C ~ $ exit command terminated with exit code 130 % oc get dnsnameresolver -n openshift-ovn-kubernetes NAME AGE dns-67b687cfb5 7m47s dns-696b6747d9 2m12s dns-b6c74f6f4 2m12s % oc get dnsnameresolver dns-696b6747d9 -n openshift-ovn-kubernetes -o yaml apiVersion: network.openshift.io/v1alpha1 kind: DNSNameResolver metadata: creationTimestamp: "2024-05-16T05:32:07Z" generation: 1 name: dns-696b6747d9 namespace: openshift-ovn-kubernetes resourceVersion: "101283" uid: a8546ad8-b16d-4d81-a943-46bdd0d82aa5 spec: name: www.facebook.com. % oc get dnsnameresolver dns-696b6747d9 -n openshift-ovn-kubernetes -o yaml apiVersion: network.openshift.io/v1alpha1 kind: DNSNameResolver metadata: creationTimestamp: "2024-05-16T05:32:07Z" generation: 1 name: dns-696b6747d9 namespace: openshift-ovn-kubernetes resourceVersion: "101283" uid: a8546ad8-b16d-4d81-a943-46bdd0d82aa5 spec: name: www.facebook.com. % oc get dnsnameresolver dns-696b6747d9 -n openshift-ovn-kubernetes -o yaml apiVersion: network.openshift.io/v1alpha1 kind: DNSNameResolver metadata: creationTimestamp: "2024-05-16T05:32:07Z" generation: 1 name: dns-696b6747d9 namespace: openshift-ovn-kubernetes resourceVersion: "101283" uid: a8546ad8-b16d-4d81-a943-46bdd0d82aa5 spec: name: www.facebook.com.
Actual results:
The dns name like www.facebook.com configured in egressfirewall didn't get resolved to IP
Expected results:
EgressFirewall works as expected.
Additional info:
Refactor name to Dockerfile.ocp as a better, version independent alternative
Upstream Issue: https://github.com/kubernetes/kubernetes/issues/125370
Sometimes the PodIP field can blank or empty.
Description of problem:
*/network-status annotation doesn't not reflect multiple interfaces
Version-Release number of selected component (if applicable):
latest release
How reproducible:
Always
Steps to Reproduce:
https://gist.github.com/dougbtv/1eb8ac2d61d494b56d65a6b236a86e61
Description of problem: After changing the value of enable_topology in the openshift-config/cloud-provider-config config map, the CSI controller pods should restart to pick up the new value. This is not happening.
It seems like our understanding in https://github.com/openshift/openstack-cinder-csi-driver-operator/pull/127#issuecomment-1780967488 was wrong.
Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/268
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
[AWS] securityGroups and subnet don’t keep consistent in machine yaml and on aws console No securityGroups huliu-aws531d-vlzbw-master-sg for masters on aws console, but shows in master machines yaml No securityGroups huliu-aws531d-vlzbw-worker-sg for workers on aws console, but shows in worker machines yaml No subnet huliu-aws531d-vlzbw-private-us-east-2a for masters and workers on aws console, but shows in master and worker machines yaml
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-30-130713 This happens in the latest 4.16(CAPI) AWS cluster
How reproducible:
Always
Steps to Reproduce:
1. Install a AWS 4.16 cluster liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-05-30-130713 True False 46m Cluster version is 4.16.0-0.nightly-2024-05-30-130713 liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws531d-vlzbw-master-0 Running m6i.xlarge us-east-2 us-east-2a 65m huliu-aws531d-vlzbw-master-1 Running m6i.xlarge us-east-2 us-east-2b 65m huliu-aws531d-vlzbw-master-2 Running m6i.xlarge us-east-2 us-east-2c 65m huliu-aws531d-vlzbw-worker-us-east-2a-swwmk Running m6i.xlarge us-east-2 us-east-2a 62m huliu-aws531d-vlzbw-worker-us-east-2b-f2gw9 Running m6i.xlarge us-east-2 us-east-2b 62m huliu-aws531d-vlzbw-worker-us-east-2c-x6gbz Running m6i.xlarge us-east-2 us-east-2c 62m 2.Check the machines yaml, there are 4 securityGroups and 2 subnet value for master machines, 3 securityGroups and 2 subnet value for worker machines. But check on aws console, only 3 securityGroups and 1 subnet value for masters, 2 securityGroups and 1 subnet value for workers. liuhuali@Lius-MacBook-Pro huali-test % oc get machine huliu-aws531d-vlzbw-master-0 -oyaml … securityGroups: - filters: - name: tag:Name values: - huliu-aws531d-vlzbw-master-sg - filters: - name: tag:Name values: - huliu-aws531d-vlzbw-node - filters: - name: tag:Name values: - huliu-aws531d-vlzbw-lb - filters: - name: tag:Name values: - huliu-aws531d-vlzbw-controlplane subnet: filters: - name: tag:Name values: - huliu-aws531d-vlzbw-private-us-east-2a - huliu-aws531d-vlzbw-subnet-private-us-east-2a … https://drive.google.com/file/d/1YyPQjSCXOm-1gbD3cwktDQQJter6Lnk4/view?usp=sharing https://drive.google.com/file/d/1MhRIm8qIZWXdL9-cDZiyu0TOTFLKCAB6/view?usp=sharing https://drive.google.com/file/d/1Qo32mgBerWp5z6BAVNqBxbuH5_4sRuBv/view?usp=sharing https://drive.google.com/file/d/1seqwluMsPEFmwFL6pTROHYyJ_qPc0cCd/view?usp=sharing liuhuali@Lius-MacBook-Pro huali-test % oc get machine huliu-aws531d-vlzbw-worker-us-east-2a-swwmk -oyaml … securityGroups: - filters: - name: tag:Name values: - huliu-aws531d-vlzbw-worker-sg - filters: - name: tag:Name values: - huliu-aws531d-vlzbw-node - filters: - name: tag:Name values: - huliu-aws531d-vlzbw-lb subnet: filters: - name: tag:Name values: - huliu-aws531d-vlzbw-private-us-east-2a - huliu-aws531d-vlzbw-subnet-private-us-east-2a … https://drive.google.com/file/d/1FM7dxfSK0CGnm81dQbpWuVz1ciw9hgpq/view?usp=sharing https://drive.google.com/file/d/1QClWivHeGGhxK7FdBUJnGu-vHylqeg5I/view?usp=sharing https://drive.google.com/file/d/12jgyFfyP8fTzQu5wRoEa6RrXbYt_Gxm1/view?usp=sharing
Actual results:
securityGroups and subnet don’t keep consistent in machine yaml and on aws console
Expected results:
securityGroups and subnet should keep consistent in machine yaml and on aws console
Additional info:
Description of problem:
GCP private cluster with CCO Passthrough mode failed to install due to CCO degraded. status: conditions: - lastTransitionTime: "2024-06-24T06:04:39Z" message: 1 of 7 credentials requests are failing to sync. reason: CredentialsFailing status: "True" type: Degraded
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2024-06-21-203120
How reproducible:
Always
Steps to Reproduce:
1.Create GCP private cluster with CCO Passthrough mode, flexy template is private-templates/functionality-testing/aos-4_13/ipi-on-gcp/versioned-installer-xpn-private 2.Wait for cluster installation
Actual results:
jianpingshu@jshu-mac ~ % oc get clusterversionNAME VERSION AVAILABLE PROGRESSING SINCE STATUSversion False False 23m Error while reconciling 4.13.0-0.nightly-2024-06-21-203120: the cluster operator cloud-credential is degraded status: conditions: - lastTransitionTime: "2024-06-24T06:04:39Z" message: 1 of 7 credentials requests are failing to sync. reason: CredentialsFailing status: "True" type: Degraded jianpingshu@jshu-mac ~ % oc -n openshift-cloud-credential-operator get -o json credentialsrequests | jq -r '.items[] | select(tostring | contains("InfrastructureMismatch") | not) | .metadata.name as $n | .status.conditions // [{type: "NoConditions"}] | .[] | .type + "=" + .status + " " + $n + " " + .reason + ": " + .message' | sortCredentialsProvisionFailure=True cloud-credential-operator-gcp-ro-creds CredentialsProvisionFailure: failed to grant creds: error while validating permissions: error testing permissions: googleapi: Error 400: Permission commerceoffercatalog.agreements.list is not valid for this resource., badRequest NoConditions= openshift-cloud-network-config-controller-gcp : NoConditions= openshift-gcp-ccm : NoConditions= openshift-gcp-pd-csi-driver-operator : NoConditions= openshift-image-registry-gcs : NoConditions= openshift-ingress-gcp : NoConditions= openshift-machine-api-gcp :
Expected results:
Cluster installed successfully without degrade
Additional info:
Some problem PROW CI tests: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.14-multi-nightly-gcp-ipi-user-labels-tags-filestore-csi-tp-arm-f14/1805064266043101184 https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-4.14-upgrade-from-stable-4.13-gcp-ipi-xpn-fips-f28/1804676149503070208
This is a clone of issue OCPBUGS-43378. The following is the description of the original issue:
—
In https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-shiftstack-ci-release-4.18-e2e-openstack-ovn-etcd-scaling/1834144693181485056 I noticed the following panic:
Undiagnosed panic detected in pod expand_less 0s { pods/openshift-monitoring_prometheus-k8s-1_prometheus_previous.log.gz:ts=2024-09-12T09:30:09.273Z caller=klog.go:124 level=error component=k8s_client_runtime func=Errorf msg="Observed a panic: &runtime.TypeAssertionError{_interface:(*abi.Type)(0x3180480), concrete:(*abi.Type)(0x34a31c0), asserted:(*abi.Type)(0x3a0ac40), missingMethod:\"\"} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Node)\ngoroutine 13218 [running]:\nk8s.io/apimachinery/pkg/util/runtime.logPanic({0x32f1080, 0xc05be06840})\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x90\nk8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc010ef6000?})\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b\npanic({0x32f1080?, 0xc05be06840?})\n\t/usr/lib/golang/src/runtime/panic.go:770 +0x132\ngithub.com/prometheus/prometheus/discovery/kubernetes.NewEndpoints.func11({0x34a31c0?, 0xc05bf3a580?})\n\t/go/src/github.com/prometheus/prometheus/discovery/kubernetes/endpoints.go:170 +0x4e\nk8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/client-go/tools/cache/controller.go:253\nk8s.io/client-go/tools/cache.(*processorListener).run.func1()\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/client-go/tools/cache/shared_informer.go:977 +0x9f\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00fc92f70, {0x456ed60, 0xc031a6ba10}, 0x1, 0xc015a04fc0)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf\nk8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc011678f70, 0x3b9aca00, 0x0, 0x1, 0xc015a04fc0)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f\nk8s.io/apimachinery/pkg/util/wait.Until(...)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161\nk8s.io/client-go/tools/cache.(*processorListener).run(0xc04c607440)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/client-go/tools/cache/shared_informer.go:966 +0x69\nk8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x52\ncreated by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 12933\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73\n"}
This issue seems relatively common on openstack, these runs seem to very frequently be this failure.
Linked test name: Undiagnosed panic detected in pod
Description of problem:
OpenShift Assisted Installer reporting Dell PowerEdge C6615 node’s four 960GB SATA Solid State Disks as removable and subsequently refusing to continue installation of OpenShift on to at least one of those Disks. This issue is where by OpenShift agent installer reports installed SATA SDDs are removable and refuses to use any of them as installation targets. Linux Kernel reports: sd 4:0:0:0 [sdb] Attached SCSI removable disk sd 5:0:0:0 [sdc] Attached SCSI removable disk sd 6:0:0:0 [sdd] Attached SCSI removable disk sd 3:0:0:0 [sda] Attached SCSI removable disk Each removable disk is clean, 894.3GiB free space, no partitions etc. However - Insufficient This host does not meet the minimum hardware or networking requirements and will not be included in the cluster. Hardware: Failed Warning alert: Insufficient Minimum disks of required size: No eligible disks were found, please check specific disks to see why they are not eligible.
Version-Release number of selected component (if applicable):
4.15.z
How reproducible:
100 %
Steps to Reproduce:
1. Install with assisted Installer 2. Generate ISO using option over console. 3. Boot the ISO on dell HW mentioned in description 4. Observe journal logs for disk validations
Actual results:
Installation fails at disk validation
Expected results:
Installation should complete
Additional info:
Description of problem:
Due to the branching of 4.17 not having happened yet, the mce-2.7 Konflux application can't merge the .tekton pipeline
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-38070. The following is the description of the original issue:
—
Description of problem:
Create cluster with publish:Mixed by using CAPZ, 1. publish: Mixed + apiserver: Internal install-config: ================= publish: Mixed operatorPublishingStrategy: apiserver: Internal ingress: External In this case, api dns should not be created in public dns zone, but it was created. ================== $ az network dns record-set cname show --name api.jima07api --resource-group os4-common --zone-name qe.azure.devcluster.openshift.com { "TTL": 300, "etag": "6b13d901-07d1-4cd8-92de-8f3accd92a19", "fqdn": "api.jima07api.qe.azure.devcluster.openshift.com.", "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/os4-common/providers/Microsoft.Network/dnszones/qe.azure.devcluster.openshift.com/CNAME/api.jima07api", "metadata": {}, "name": "api.jima07api", "provisioningState": "Succeeded", "resourceGroup": "os4-common", "targetResource": {}, "type": "Microsoft.Network/dnszones/CNAME" } 2. publish: Mixed + ingress: Internal install-config: ============= publish: Mixed operatorPublishingStrategy: apiserver: External ingress: Internal In this case, load balance rule on port 6443 should be created in external load balancer, but it could not be found. ================ $ az network lb rule list --lb-name jima07ingress-krf5b -g jima07ingress-krf5b-rg []
Version-Release number of selected component (if applicable):
4.17 nightly build
How reproducible:
Always
Steps to Reproduce:
1. Specify publish: Mixed + mixed External/Internal for api/ingress 2. Create cluster 3. check public dns records and load balancer rules in internal/external load balancer to be created expected
Actual results:
see description, some resources are unexpected to be created or missed.
Expected results:
public dns records and load balancer rules in internal/external load balancer to be created expected based on setting in install-config
Additional info:
Description of problem:
The ingress cluster capability has been introduced in OCP 4.16 (https://github.com/openshift/enhancements/pull/1415). It includes the cluster ingress operator and all its controllers. If the ingress capability is disabled all the routes of the cluster become unavailable (no router to back them up). The console operator heavily depends on the working (admitted/active) routes to do the health checks, configure the authentication flows, client downloads, etc. The console operator goes degraded if the routes are not served by a router. The console operator needs to be able to tolerate the absence of the ingress capability.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Create ROSA HCP cluster. 2. Scale the default ingresscontroller to 0: oc -n openshift-ingress-operator patch ingresscontroller default --type='json' -p='[{"op": "replace", "path": "/spec/replicas", "value":0}]' 3. Check the status of console cluster operator: oc get co console
Actual results:
$ oc get co console NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE console 4.16.0 False False False 53s RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.49e4812b7122bc833b72.hypershift.aws-2.ci.openshift.org): Get "https://console-openshift-console.apps.49e4812b7122bc833b72.hypershift.aws-2.ci.openshift.org": EOF
Expected results:
$ oc get co console NAME VERSION AVAILABLE PROGRESSING DEGRADED console 4.16.0 True False False
Additional info:
The ingress capability cannot be disabled on a standalone OpenShift (when the payload is managed by ClusterVersionOperator). Only clusters managed HyperShift with HostedControlPlane are impacted.
Description of problem:
We are seeing: dhcp server failed to create private network: unable to retrieve active status for new PER connection information after create private network This error might be alleviated by not conflicting the CIDR when we create the DHCP network. One way to mitigate this is to use a random subnet address.
Version-Release number of selected component (if applicable):
How reproducible:
Sometimes
Steps to Reproduce:
1. Create a cluster
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-api-provider-aws/pull/513
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
I have a CU who reported that they are not able to edit the "Until" option from developers perspective.
Version-Release number of selected component (if applicable):
OCP v4.15.11
Screenshot
https://redhat-internal.slack.com/archives/C04BSV48DJS/p1716889816419439
Seen in a 4.15.19 cluster, the PrometheusOperatorRejectedResources alert was firing, but did not link a runbook, despite the runbook existing since MON-2358.
Seen in 4.15.19, but likely applies to all versions where the PrometheusOperatorRejectedResources alert exists.
Every time.
Check the cluster console at /monitoring/alertrules?rowFilter-alerting-rule-source=platform&name=PrometheusOperatorRejectedResources, and click through to the alert definition.
No mention of runbooks.
A Runbook section linking the runbook.
I haven't dug into the upstream/downstream sync process, but the runbook information likely needs to at least show up here, although that may or may not be the root location for injecting our canonical runbook into the upstream-sourced alert.
This is a clone of issue OCPBUGS-42528. The following is the description of the original issue:
—
Description of problem:
The created Node ISO is missing the architecture (<arch>) in its filename, which breaks consistency with other generated ISOs such as the Agent ISO.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%
Actual results:
Currently, the Node ISO is being created with the filename node.iso.
Expected results:
Node ISO should be created as node.<arch>.iso to maintain consistency.
Description of the problem:
In my ACM 2.10.5 / MCE 2.5.6 Hub, I updated successfully to ACM 2.11.0 / MCE 2.6.0
At the completion of the update, I now see some odd aspects with my `local-cluster` object.
How reproducible:
Any existing hub that updates to 2.11 will see these changes introduced.
Steps to reproduce:
1.Have an existing OCP IPI on AWS
2.Install ACM 2.10 on it
3.Upgrade to ACM 2.11
Actual results:
notice the ACM local-cluster is an AWS provider type in the UI
notice that the ACM local-cluster is now an Host Inventory provider type in the UI
notice that the ACM local-cluster now has an 'Add hosts' menu action in the UI. This does not make sense to have for an IPI style OCP on AWS.
Expected results:
I did not expect that the Hub on AWS IPI OCP would be shown as Host Inventory, nor would I expect that I can / should use this Add hosts menu action.
Description of problem:
Autoscaler balance similar node groups failed on aws when run regression for https://issues.redhat.com/browse/OCPCLOUD-2616
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-06-20-165244
How reproducible:
Always
Steps to Reproduce:
1. Create clusterautoscaler with balanceSimilarNodeGroups: true 2. Create 2 machineautoscaler min/max 1/8 3. Add workload
Actual results:
Couldn't see the "splitting scale-up" message from the cluster-autoscalerlogs. must-gather: https://drive.google.com/file/d/17aZmfQHKZxJEtqPvl37HPXkXA36Yp6i8/view?usp=sharing 2024-06-21T13:21:08.678016167Z I0621 13:21:08.678006 1 compare_nodegroups.go:157] nodes template-node-for-MachineSet/openshift-machine-api/zhsun-aws21-5slwv-worker-us-east-2b-5109433294514062211 and template-node-for-MachineSet/openshift-machine-api/zhsun-aws21-5slwv-worker-us-east-2c-760092546639056043 are not similar, labels do not match 2024-06-21T13:21:08.678030474Z I0621 13:21:08.678021 1 orchestrator.go:249] No similar node groups found
Expected results:
balanceSimilarNodeGroups works well
Additional info:
_<Architect is responsible for completing this section to define the
details of the story>_
As an OpenShift user, I'd like to see the LATEST charts, sorted by release (semver) version.
But today the list is in the order they were released, which makes sense if your chart is single-stream, but not if you are releasing for multiple product version streams (1.0.z, 1.1.z, 1.2.z)
RHDH has charts for 2 concurrent release streams today, with a 3rd stream coming in June, so we'll have 1.0.z updates after 1.2, and 1.1.z after that. Mixing them together is confusing, especially with the Freshmaker CVE updates.
One of two implementations:
a) default sorting is by logical semver, with latest being the bottom of the list (default selected) and top of the list being the oldest chart; or
b) UI to allow choosing to sort by release date or version
Development:
QE:
Documentation: Yes/No (needs-docs|upstream-docs / no-doc)
Upstream: <Inputs/Requirement details: Concept/Procedure>/ Not
Applicable
Downstream: <Inputs/Requirement details: Concept/Procedure>/ Not
Applicable
Release Notes Type: <New Feature/Enhancement/Known Issue/Bug
fix/Breaking change/Deprecated Functionality/Technology Preview>
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
v
Unknown
Verified
Unsatisfied
Description of problem:
Admission webhook warning on creation of Route - violates policy 299 - unknow field "metadata.defaultAnnotations" Admission webhook warning on creation of buildConfig - violates policy 299 - unknow field "spec.source.git.type"
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Navigate to Import from git form and create a deployment 2. See the `Admission webhook warning` toast notification
Actual results:
Admission webhook warning - violates policy 299 - unknow field "metadata.defaultAnnotations" showing up on creation of Route and Admission webhook warning on creation of buildConfig - violates policy 299 - unknow field "spec.source.git.type"
Expected results:
No Admission webhook warning should show
Additional info:
Description of problem:
Fix spelling "Rememeber" to "Remember"
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The cloud provider feature of NTO doesn't work as expected
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create a cloud-provider profile like as apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: provider-aws namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=GCE Cloud provider-specific profile # Your tuning for GCE Cloud provider goes here. [sysctl] vm.admin_reserve_kbytes=16386 name: provider-aws 2. 3.
Actual results:
the value of vm.admin_reserve_kbytes still using default value
Expected results:
the value of vm.admin_reserve_kbytes should change to 16386
Additional info:
This is a clone of issue OCPBUGS-43157. The following is the description of the original issue:
—
Description of problem:
When running the `make fmt` target in the repository the command can fail due to a mismatch of versions between the go language and the goimports dependency.
Version-Release number of selected component (if applicable):
4.16.z
How reproducible:
always
Steps to Reproduce:
1.checkout release-4.16 branch 2.run `make fmt`
Actual results:
INFO[2024-10-01T14:41:15Z] make fmt make[1]: Entering directory '/go/src/github.com/openshift/cluster-cloud-controller-manager-operator' hack/goimports.sh go: downloading golang.org/x/tools v0.25.0 go: golang.org/x/tools/cmd/goimports@latest: golang.org/x/tools@v0.25.0 requires go >= 1.22.0 (running go 1.21.11; GOTOOLCHAIN=local)
Expected results:
successful completion of `make fmt`
Additional info:
our goimports.sh script file reference `goimports@latest` which means that this problem will most likely affect older branches as well. we will need to set a specific version of the goimports package for those branches. given that the CCCMO includes golangci-lint, and uses it for a test, we should include goimports through golangci-lint which will solve this problem without needing special versions of goimports.
We are pushing metadata (https://github.com/openshift/assisted-service/blob/master/internal/uploader/events_uploader.go#L179)
from assisted service (onprem mode).
We need to consume this data from assisted-events-stream and project it to opensearch
Please review the following PR: https://github.com/openshift/cluster-policy-controller/pull/149
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
HCP fails to deploy with SR CSI driver failing to pull its image
Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver/pull/84
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The default channel of 4.17 clusters is stable-4.16.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-01-03-193825
How reproducible:
Always
Steps to Reproduce:
1. Install a 4.16 cluster 2. Check default channel ❯ oc adm upgrade Cluster version is 4.17.0-0.test-2024-07-07-082848-ci-ln-htjr9ib-latestUpstream is unset, so the cluster will use an appropriate default. Channel: stable-4.16 warning: Cannot display available updates: Reason: VersionNotFound Message: Unable to retrieve available updates: currently reconciling cluster version 4.17.0-0.test-2024-07-07-082848-ci-ln-htjr9ib-latest not found in the "stable-4.16" channel
Actual results:
Default channel is stable-4.16 in a 4.17 cluster
Expected results:
Default channel should be stable-4.17
Additional info:
similar issue was observed and fixed in previous versions
This is a clone of issue OCPBUGS-38228. The following is the description of the original issue:
—
Description of problem:
On overview page's getting started resources card, there is "OpenShift LightSpeed" link when this operator is available on the cluster, the text should be updated to "OpenShift Lightspeed" to keep consistent with operator name.
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-08-08-013133 4.16.0-0.nightly-2024-08-08-111530
How reproducible:
Always
Steps to Reproduce:
1. Check overview page's getting started resources card, 2. 3.
Actual results:
1. There is "OpenShift LightSpeed" link in "Explore new features and capabilities"
Expected results:
1. The text should be "OpenShift Lightspped" to keep consistent with operator name.
Additional info:
Seen in some 4.17 update runs, like this one:
disruption_tests: [bz-Cluster Version Operator] Verify presence of admin ack gate blocks upgrade until acknowledged expand_less 2h30m29s {Your Test Panicked github.com/openshift/origin/test/extended/util/openshift/clusterversionoperator/adminack.go:153 When you, or your assertion library, calls Ginkgo's Fail(), Ginkgo panics to prevent subsequent assertions from running. Normally Ginkgo rescues this panic so you shouldn't see it. However, if you make an assertion in a goroutine, Ginkgo can't capture the panic. To circumvent this, you should call defer GinkgoRecover() at the top of the goroutine that caused this panic. ... github.com/openshift/origin/test/extended/util/openshift/clusterversionoperator.getClusterVersion({0x8b34870, 0xc0004d3b20}, 0xc0004d3b20?) github.com/openshift/origin/test/extended/util/openshift/clusterversionoperator/adminack.go:153 +0xee github.com/openshift/origin/test/extended/util/openshift/clusterversionoperator.getCurrentVersion({0x8b34870?, 0xc0004d3b20?}, 0xd18c2e2800?) github.com/openshift/origin/test/extended/util/openshift/clusterversionoperator/adminack.go:163 +0x2c github.com/openshift/origin/test/extended/util/openshift/clusterversionoperator.(*AdminAckTest).Test(0xc001599de0, {0x8b34800, 0xc005a36910}) github.com/openshift/origin/test/extended/util/openshift/clusterversionoperator/adminack.go:72 +0x28d github.com/openshift/origin/test/e2e/upgrade/adminack.(*UpgradeTest).Test(0xc0018b23a0, {0x8b34608?, 0xccc6580?}, 0xc0055a3320?, 0xc0055ba180, 0x0?) github.com/openshift/origin/test/e2e/upgrade/adminack/adminack.go:53 +0xfa ...
We should deal with that noise, and get nicer error messages out of this test-case when there are hiccups calling getClusterVersion and similar.
Seen in 4.17 CI, but it's fairly old code, so likely earlier 4.y are also exposed.
Sippy reports 18 failures for this test-case vs. 250 success in the last week of periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade runs. So fairly rare, but not crazy rare.
Run hundreds of periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade and watch the Verify presence of admin ack gate blocks upgrade until acknowledged test-case.
Occasional failures complaining about Ginko Fail panics.
Reliable success.
Observed in
there was a delay provisioning one of the master nodes, we should figure out why this is happening and if it can be prevented
from the ironic logs, there was a 5 minute delay during cleaning, on the other 2 masters this too a few seconds
01:20:53 1f90131a...moved to provision state "verifying" from state "enroll" 01:20:59 1f90131a...moved to provision state "manageable" from state "verifying" 01:21:04 1f90131a...moved to provision state "inspecting" from state "manageable" 01:21:35 1f90131a...moved to provision state "inspect wait" from state "inspecting" 01:26:26 1f90131a...moved to provision state "inspecting" from state "inspect wait" 01:26:26 1f90131a...moved to provision state "manageable" from state "inspecting" 01:26:30 1f90131a...moved to provision state "cleaning" from state "manageable" 01:27:17 1f90131a...moved to provision state "clean wait" from state "cleaning" >>> whats this 5 minute gap about ?? <<< 01:32:07 1f90131a...moved to provision state "cleaning" from state "clean wait" 01:32:08 1f90131a...moved to provision state "clean wait" from state "cleaning" 01:32:12 1f90131a...moved to provision state "cleaning" from state "clean wait" 01:32:13 1f90131a...moved to provision state "available" from state "cleaning" 01:32:23 1f90131a...moved to provision state "deploying" from state "available" 01:32:28 1f90131a...moved to provision state "wait call-back" from state "deploying" 01:32:58 1f90131a...moved to provision state "deploying" from state "wait call-back" 01:33:14 1f90131a...moved to provision state "active" from state "deploying"
Please review the following PR: https://github.com/openshift/api/pull/1903
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-37663. The following is the description of the original issue:
—
Description of problem:
CAPZ creates an empty route table during installs
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Very
Steps to Reproduce:
1.Install IPI cluster using CAPZ 2. 3.
Actual results:
Empty route table created and attached to worker subnet
Expected results:
No route table created
Additional info:
Description of problem:
oc command cannot be used with RHEL 8 based bastion
Version-Release number of selected component (if applicable):
4.16.0-rc.1
How reproducible:
Very
Steps to Reproduce:
1. Have a bastion for z/VM installation at Red Hat Enterprise Linux release 8.9 (Ootpa) 2. Download and install the 4.16.0-rc.1 client on the bastion 3.Attempt to use the oc command
Actual results:
oc get nodes oc: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by oc) oc: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by oc) oc: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by oc)
Expected results:
oc command returns without error
Additional info:
This was introduced in 4.16.0-rc.1 - 4.16.0-rc.0 works fine
Please review the following PR: https://github.com/openshift/images/pull/184
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When doing offline SDN migration, setting the parameter "spec.migration.features.egressIP" to "false" to disable automatic migration of egressIP configuration doesn't work.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Launch a cluster with OpenShiftSDN. Configure an egressip to a node. 2. Start offline SDN migration. 3. In step-3, execute oc patch Network.operator.openshift.io cluster --type='merge' \ --patch '{ "spec": { "migration": { "networkType": "OVNKubernetes", "features": { "egressIP": false } } } }'
Actual results:
An egressip.k8s.ovn.org CR is created automatcially.
Expected results:
No egressip CR shall be created for OVN-K
Additional info:
Description of problem:
The ingress operator provides the "SyncLoadBalancerFailed" status with a message that says "The kube-controller-manager logs may contain more details.". Depending on the platform, that isn't accurate as we are transitioning the CCM out-of-tree to the "cloud-controller-manager". Code Link: https://github.com/openshift/cluster-ingress-operator/blob/55780444031714fc931d90af298a4b193888977a/pkg/operator/controller/ingress/status.go#L874
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Steps to Reproduce:
1. Create a IngressController with a broken LoadBalancer-type service so it produces "SyncLoadBalancerFailed" (TBD, I'll try to figure out how to produce this...) 2. 3.
Actual results:
"The kube-controller-manager logs may contain more details."
Expected results:
"The cloud-controller-manager logs may contain more details."
Additional info:
Description of problem:
When the cloud-credential operator is used in manual mode, and awsSTSIAMRoleARN is not present in the secret operator pods, it throws aggressive errors every second. One of the customer concern about the number of errors from the operator pods Two errors per second ============================ time="2024-05-10T00:43:45Z" level=error msg="error syncing credentials: an empty awsSTSIAMRoleARN was found so no Secret was created" controller=credreq cr=openshift-cloud-credential-operator/aws-ebs-csi-driver-operator secret=openshift-cluster-csi-drivers/ebs-cloud-credentials time="2024-05-10T00:43:46Z" level=error msg="errored with condition: CredentialsProvisionFailure" controller=credreq cr=openshift-cloud-credential-operator/aws-ebs-csi-driver-operator secret=openshift-cluster-csi-drivers/ebs-cloud-credentials
Version-Release number of selected component (if applicable):
4.15.3
How reproducible:
Always present in managed rosa clusters
Steps to Reproduce:
1.create a rosa cluster 2.check the errors of cloud credentials operator pods 3.
Actual results:
The CCO logs continually throw errors
Expected results:
The CCO logs should not be continually throwing these errors.
Additional info:
The focus of this bug is only to remove the error lines from the logs. The underlying issue, of continually attempting to reconcile the CRs will be handled by other bugs.
Please review the following PR: https://github.com/openshift/cloud-provider-kubevirt/pull/43
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-38860. The following is the description of the original issue:
—
Description of problem:
In 4.16 version now we can collapse and expand the "Getting Started resource" section under administrative perspective. But as in the earlier version, we can directly remove this tab [X], which is not there in the 4.16 version. There is only an expand and collapse function available, but removing that tab is not available as it was there in previous versions.
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Go to Web console. Click on the "Getting started resources." 2. Then you can expand and collapse this tab. 3. But there is no option to directly remove this tab.
Actual results:
Expected results:
Additional info:
Description of problem:
The network resource provisioning playbook for 4.15 dualstack UPI contains a task for adding an IPv6 subnet to the existing external router [1]. This task fails with: - ansible-2.9.27-1.el8ae.noarch & ansible-collections-openstack-1.8.0-2.20220513065417.5bb8312.el8ost.noarch in OSP 16 env (RHEL 8.5) or - openstack-ansible-core-2.14.2-4.1.el9ost.x86_64 & ansible-collections-openstack-1.9.1-17.1.20230621074746.0e9a6f2.el9ost.noarch in OSP 17 env (RHEL 9.2) Besides that we need to have a way for identifying resources for particular deployment, as it may interfere with existing one.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-22-160236
How reproducible:
Always
Steps to Reproduce:
1. Set the os_subnet6 in the inventory file for setting dualstack 2. Run the 4.15 network.yaml playbook
Actual results:
Playbook fails: TASK [Add IPv6 subnet to the external router] ********************************** fatal: [localhost]: FAILED! => {"changed": false, "extra_data": {"data": null, "details": "Invalid input for external_gateway_info. Reason: Validation of dictionary's keys failed. Expected keys: {'network_id'} Provided keys: {'external_fixed_ips'}.", "response": "{\"NeutronError\": {\"type\": \"HTTPBadRequest\", \"message\": \"Invalid input for external_gateway_info. Reason: Validation of dictionary's keys failed. Expected keys: {'network_id'} Provided keys: {'external_fixed_ips'}.\", \"detail\": \"\"}}"}, "msg": "Error updating router 8352c9c0-dc39-46ed-94ed-c038f6987cad: Client Error for url: https://10.46.43.81:13696/v2.0/routers/8352c9c0-dc39-46ed-94ed-c038f6987cad, Invalid input for external_gateway_info. Reason: Validation of dictionary's keys failed. Expected keys: {'network_id'} Provided keys: {'external_fixed_ips'}."}
Expected results:
Successful playbook execution
Additional info:
The router can be created in two different tasks, the playbook [2] worked for me.
[1] https://github.com/openshift/installer/blob/1349161e2bb8606574696bf1e3bc20ae054e60f8/upi/openstack/network.yaml#L43
[2] https://file.rdu.redhat.com/juriarte/upi/network.yaml
This is a clone of issue OCPBUGS-37850. The following is the description of the original issue:
—
Occasional machine-config daemon panics in test-preview. For example this run has:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1076/pull-ci-openshift-cluster-version-operator-master-e2e-aws-ovn-techpreview/1819082707058036736
And the referenced logs include a full stack trace, the crux of which appears to be:
E0801 19:23:55.012345 2908 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 127 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x2424b80, 0x4166150}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0004d5340?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b panic({0x2424b80?, 0x4166150?}) /usr/lib/golang/src/runtime/panic.go:770 +0x132 github.com/openshift/machine-config-operator/pkg/helpers.ListPools(0xc0007c5208, {0x0, 0x0}) /go/src/github.com/openshift/machine-config-operator/pkg/helpers/helpers.go:142 +0x17d github.com/openshift/machine-config-operator/pkg/helpers.GetPoolsForNode({0x0, 0x0}, 0xc0007c5208) /go/src/github.com/openshift/machine-config-operator/pkg/helpers/helpers.go:66 +0x65 github.com/openshift/machine-config-operator/pkg/daemon.(*PinnedImageSetManager).handleNodeEvent(0xc000a98480, {0x27e9e60?, 0xc0007c5208}) /go/src/github.com/openshift/machine-config-operator/pkg/daemon/pinned_image_set.go:955 +0x92
$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?name=^periodic&type=junit&search=machine-config-daemon.*Observed+a+panic' | grep 'failures match' periodic-ci-openshift-release-master-ci-4.17-e2e-azure-ovn-techpreview (all) - 37 runs, 62% failed, 13% of failures match = 8% impact periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-techpreview-serial (all) - 6 runs, 83% failed, 20% of failures match = 17% impact periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-techpreview (all) - 5 runs, 60% failed, 33% of failures match = 20% impact periodic-ci-openshift-multiarch-master-nightly-4.17-ocp-e2e-aws-ovn-arm64-techpreview-serial (all) - 10 runs, 40% failed, 25% of failures match = 10% impact periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-techpreview-serial (all) - 7 runs, 29% failed, 50% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-techpreview-serial (all) - 7 runs, 100% failed, 14% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-techpreview-serial (all) - 5 runs, 100% failed, 20% of failures match = 20% impact periodic-ci-openshift-multiarch-master-nightly-4.17-ocp-e2e-aws-ovn-arm64-techpreview (all) - 10 runs, 40% failed, 25% of failures match = 10% impact periodic-ci-openshift-release-master-ci-4.18-e2e-gcp-ovn-techpreview (all) - 5 runs, 40% failed, 50% of failures match = 20% impact periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-techpreview-serial (all) - 6 runs, 17% failed, 200% of failures match = 33% impact periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node-techpreview-serial (all) - 7 runs, 100% failed, 14% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node-techpreview (all) - 7 runs, 57% failed, 50% of failures match = 29% impact periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-techpreview (all) - 18 runs, 17% failed, 33% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-aws-ovn-arm64-techpreview-serial (all) - 11 runs, 18% failed, 50% of failures match = 9% impact periodic-ci-openshift-release-master-ci-4.17-e2e-azure-ovn-techpreview-serial (all) - 7 runs, 57% failed, 25% of failures match = 14% impact
looks like ~15% impact in those CI runs CI Search turns up.
Run lots of CI. Look for MCD panics.
CI Search results above.
No hits.
Please review the following PR: https://github.com/openshift/oc/pull/1780
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
There are 2 problematic tests in the ImageEcosystem testsuite in: the rails sample and the s2i perl test. This issue tries to fix them both at once so that we can get a passing image ecosystem test.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Run the imageecosystem testsuite 2. observe the {[Feature:ImageEcosystem][ruby]} and {[Feature:ImageEcosystem][perl]} test fail
Actual results:
The two tests fail
Expected results:
No test failures
Additional info:
This is a clone of issue OCPBUGS-42873. The following is the description of the original issue:
—
Description of problem:
openshift-apiserver that sends traffic through konnectivity proxy is sending traffic intended for the local audit-webhook service. The audit-webhook service should be included in the NO_PROXY env var of the openshift-apiserver container.
4.14.z,4.15.z,4.15.z,4.16.z
How reproducible:{code:none} Always
Steps to Reproduce:
1. Create a rosa hosted cluster 2. Obeserve logs of the konnectivity-proxy sidecar of openshift-apiserver 3.
Actual results:
Logs include requests to the audit-webhook local service
Expected results:
Logs do not include requests to audit-webhook
Additional info:
Description of problem:
cluster-api-provider-openstack panics when fed a non-existant network ID as the additional network of a Machine. Or as any network where to create a port really. This issue has been fixed upstream in https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/2064, which is present in the latest release v0.10.3.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
1. For the Linux nodes, the container runtime is CRI-O and the port 9537 has a crio process listening on it.While, windows nodes doesn't have CRIO container runtime.
2. Prometheus is trying to connect to /metrics endpoint on the windows nodes on port 9537 which actually does not have any process listening on it.
3. TargetDown is alerting crio job since it cannot reach the endpoint http://windows-node-ip:9537/metrics.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Install 4.13 cluster with windows operator 2. In the Prometheus UI, go to > Status > Targets to know which targets are down.
Actual results:
It gives the alert for targetDown
Expected results:
It should not give any such alert.
Additional info:
Description of problem:
This was discovered by a new alert that was reverted in https://issues.redhat.com/browse/OCPBUGS-36299 as the issue is making Hypershift Conformance fail.
Platform prometheus is asked to scrape targets from the namespace "openshift-operator-lifecycle-manager", but Prometheus isn't given the appropriate RBAC to do so.
The alert was revealing an RBAC issue on platform Prometheus: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.17-periodics-e2e-aws-ovn-conformance/1806305841511403520/artifacts/e2e-aws-ovn-conformance/dump/artifacts/hostedcluster-8a4fd7515fb581e231c4/namespaces/openshift-monitoring/pods/prometheus-k8s-0/prometheus/prometheus/logs/current.log
2024-06-27T14:59:38.968032082Z ts=2024-06-27T14:59:38.967Z caller=klog.go:108 level=warn component=k8s_client_runtime func=Warningf msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:554: failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"openshift-operator-lifecycle-manager\"" 2024-06-27T14:59:38.968032082Z ts=2024-06-27T14:59:38.968Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:554: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"openshift-operator-lifecycle-manager\""
Before adding this alert, such issues went unnoticed.
https://docs.google.com/document/d/1rCKAYTrYMESjJDyJ0KvNap05NNukmNXVVi6MShISnOw/edit#heading=h.13rhihr867kk explains what should be done (cf the "Also, in order for Prometheus to be able to discover..." paragraph) in order to make Prometheus able to discover the targets.
Because no test was failing before, maybe the metrics from "openshift-operator-lifecycle-manager" are not needed and we should stop asking Prometheus to discover targets from there: delete the ServiceMonitor/PodMonitor
Expected results:
Description of problem:
apply nncp to configure DNS, then edit nncp to update nameserver, but /etc/resolv.conf is not updated.
Version-Release number of selected component (if applicable):
OCP version: 4.16.0-0.nightly-2024-03-13-061822 knmstate operator version: kubernetes-nmstate-operator.4.16.0-202403111814
How reproducible:
always
Steps to Reproduce:
1. install knmstate operator 2. apply below nncp to configure dns on one of the node --- apiVersion: nmstate.io/v1 kind: NodeNetworkConfigurationPolicy metadata: name: dns-staticip-4 spec: nodeSelector: kubernetes.io/hostname: qiowang-031510-k4cjs-worker-0-rw4nt desiredState: dns-resolver: config: search: - example.org server: - 192.168.221.146 - 8.8.9.9 interfaces: - name: dummy44 type: dummy state: up ipv4: address: - ip: 192.0.2.251 prefix-length: 24 dhcp: false enabled: true auto-dns: false % oc apply -f dns-staticip-noroute.yaml nodenetworkconfigurationpolicy.nmstate.io/dns-staticip-4 created % oc get nncp NAME STATUS REASON dns-staticip-4 Available SuccessfullyConfigured % oc get nnce NAME STATUS STATUS AGE REASON qiowang-031510-k4cjs-worker-0-rw4nt.dns-staticip-4 Available 5s SuccessfullyConfigured 3. check dns on the node, dns configured correctly sh-5.1# cat /etc/resolv.conf # Generated by KNI resolv prepender NM dispatcher script search qiowang-031510.qe.devcluster.openshift.com example.org nameserver 192.168.221.146 nameserver 192.168.221.146 nameserver 8.8.9.9 # nameserver 192.168.221.1 sh-5.1# sh-5.1# cat /var/run/NetworkManager/resolv.conf # Generated by NetworkManager search example.org nameserver 192.168.221.146 nameserver 8.8.9.9 nameserver 192.168.221.1 sh-5.1# sh-5.1# nmcli | grep 'DNS configuration' -A 10 DNS configuration: servers: 192.168.221.146 8.8.9.9 domains: example.org interface: dummy44 ... ... 4. edit nncp, update nameserver, save the modification --- spec: desiredState: dns-resolver: config: search: - example.org server: - 192.168.221.146 - 8.8.8.8 <---- update from 8.8.9.9 to 8.8.8.8 interfaces: - ipv4: address: - ip: 192.0.2.251 prefix-length: 24 auto-dns: false dhcp: false enabled: true name: dummy44 state: up type: dummy nodeSelector: kubernetes.io/hostname: qiowang-031510-k4cjs-worker-0-rw4nt % oc edit nncp dns-staticip-4 nodenetworkconfigurationpolicy.nmstate.io/dns-staticip-4 edited % oc get nncp NAME STATUS REASON dns-staticip-4 Available SuccessfullyConfigured % oc get nnce NAME STATUS STATUS AGE REASON qiowang-031510-k4cjs-worker-0-rw4nt.dns-staticip-4 Available 8s SuccessfullyConfigured 5. check dns on the node again
Actual results:
the dns nameserver in file /etc/resolv.conf is not updated after nncp updated, file /var/run/NetworkManager/resolv.conf updated correctly: sh-5.1# cat /etc/resolv.conf # Generated by KNI resolv prepender NM dispatcher script search qiowang-031510.qe.devcluster.openshift.com example.org nameserver 192.168.221.146 nameserver 192.168.221.146 nameserver 8.8.9.9 <---- it is not updated # nameserver 192.168.221.1 sh-5.1# sh-5.1# cat /var/run/NetworkManager/resolv.conf # Generated by NetworkManager search example.org nameserver 192.168.221.146 nameserver 8.8.8.8 <---- updated correctly nameserver 192.168.221.1 sh-5.1# sh-5.1# nmcli | grep 'DNS configuration' -A 10 DNS configuration: servers: 192.168.221.146 8.8.8.8 domains: example.org interface: dummy44 ... ...
Expected results:
the dns nameserver in file /etc/resolv.conf can be updated accordingly
Additional info:
Description of problem:
When creating an application from a code sample, some of the icons are stretched out
Version-Release number of selected component (if applicable):
4.17.0
How reproducible:
Always
Steps to Reproduce:
1. With a project selected, click +Add, then click samples 2. Observe that some icons are stretched, e.g., Basic .NET, Basic Python
Actual results:
Icons are stretched horizontally
Expected results:
They are not stretched
Additional info:
Access sample page via route /samples/ns/default
Description of problem:
PAC provide the log link in git to see log of the PLR. Which is broken on 4.15 after this change https://github.com/openshift/console/pull/13470. This PR changes the log URL after react route package upgrade.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-38789. The following is the description of the original issue:
—
Description of problem:
The network section will be delivered using the networking-console-plugin through the cluster-network-operator.
So we have to remove the section from here to avoid duplication.
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Always
Steps to Reproduce:
Actual results:
Service, Route, Ingress and NetworkPolicy are defined two times in the section
Expected results:
Service, Route, Ingress and NetworkPolicy are defined only one time in the section
Additional info:
Description of problem:
Renable knative and A-04-TC01 tests that are being disabled in the pr https://github.com/openshift/console/pull/13931
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-37953. The following is the description of the original issue:
—
Description of problem:
Specify long cluster name in install-config, ============== metadata: name: jima05atest123456789test123 Create cluster, installer exited with below error: 08-05 09:46:12.788 level=info msg=Network infrastructure is ready 08-05 09:46:12.788 level=debug msg=Creating storage account 08-05 09:46:13.042 level=debug msg=Collecting applied cluster api manifests... 08-05 09:46:13.042 level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed provisioning resources after infrastructure ready: error creating storage account jima05atest123456789tsh586sa: PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima05atest123456789t-sh586-rg/providers/Microsoft.Storage/storageAccounts/jima05atest123456789tsh586sa 08-05 09:46:13.042 level=error msg=-------------------------------------------------------------------------------- 08-05 09:46:13.042 level=error msg=RESPONSE 400: 400 Bad Request 08-05 09:46:13.043 level=error msg=ERROR CODE: AccountNameInvalid 08-05 09:46:13.043 level=error msg=-------------------------------------------------------------------------------- 08-05 09:46:13.043 level=error msg={ 08-05 09:46:13.043 level=error msg= "error": { 08-05 09:46:13.043 level=error msg= "code": "AccountNameInvalid", 08-05 09:46:13.043 level=error msg= "message": "jima05atest123456789tsh586sa is not a valid storage account name. Storage account name must be between 3 and 24 characters in length and use numbers and lower-case letters only." 08-05 09:46:13.043 level=error msg= } 08-05 09:46:13.043 level=error msg=} 08-05 09:46:13.043 level=error msg=-------------------------------------------------------------------------------- 08-05 09:46:13.043 level=error 08-05 09:46:13.043 level=info msg=Shutting down local Cluster API controllers... 08-05 09:46:13.298 level=info msg=Stopped controller: Cluster API 08-05 09:46:13.298 level=info msg=Stopped controller: azure infrastructure provider 08-05 09:46:13.298 level=info msg=Stopped controller: azureaso infrastructure provider 08-05 09:46:13.298 level=info msg=Shutting down local Cluster API control plane... 08-05 09:46:15.177 level=info msg=Local Cluster API system has completed operations See azure doc[1], the naming rules on storage account name, it must be between 3 and 24 characters in length and may contain numbers and lowercase letters only. The prefix of storage account created by installer seems changed to use infraID with CAPI-based installation, it's "cluster" when installing with terraform. Is it possible to change back to use "cluster" as sa prefix to keep consistent with terraform? because there are several storage accounts being created once cluster installation is completed. One is created by installer starting with "cluster", others are created by image-registry starting with "imageregistry". And QE has some CI profiles[2] and automated test cases relying on installer sa, need to search prefix with "cluster", and not sure if customer also has similar scenarios. [1] https://learn.microsoft.com/en-us/azure/storage/common/storage-account-overview [2] https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-commands.sh#L241
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-41228. The following is the description of the original issue:
—
Description of problem:
The console crashes when the user selects SSH as the Authentication type for the git server under add secret in the start pipeline form
Version-Release number of selected component (if applicable):
How reproducible:
Everytime. Only in developer perspective and if the Pipelines dynamic plugin is enabled.
Steps to Reproduce:
1. Create a pipeline through add flow and open start pipeline page 2. Under show credentials select add secret 3. In the secret form select `Access to ` as Git server and `Authentication type` as SSH key
Actual results:
Console crashes
Expected results:
UI should work as expected
Additional info:
Attaching console log screenshot
https://drive.google.com/file/d/1bGndbq_WLQ-4XxG5ylU7VuZWZU15ywTI/view?usp=sharing
Description of problem:
VirtualizedTable component in console dynamic plugin don't have default sorting column. We need default sorting column for list pages.
https://github.com/openshift/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#virtualizedtable
This is a clone of issue OCPBUGS-43360. The following is the description of the original issue:
—
Description of problem:
Start last run option from the Action menu does not work on the BuildConfig details page
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Create workloads with with builds 2. Goto the builds page from navigation 3. Select the build config 4. Select the` Start last run` option from the Action menu
Actual results:
The option doesn't work
Expected results:
The option should work
Additional info:
Attaching video
https://drive.google.com/file/d/10shQqcFbIKfE4Jv60AxNYBXKz08EdUAK/view?usp=sharing
Searching for "Unable to obtain risk analysis from sippy after retries" indicates that sometimes the Risk Analysis request fails (which of course does not fail any tests, we just don't get RA for the job). It's pretty rare, but since we run a lot of tests, that's still a fair sample size.
Found in 0.04% of runs (0.25% of failures) across 37359 total runs and 5522 jobs
Interestingly, searching for the error that leads up to this, "error requesting risk analysis from sippy", leads to similar frequency.
Found in 0.04% of runs (0.25% of failures) across 37460 total runs and 5531 jobs
If failures were completely random and only occasionally repeated enough for retries to all fail, we would expect to see the lead-up a lot more often than the final failure. This suggests that either there's something problematic about a tiny subset of requests, or that perhaps postgres or other dependency is unusually slow for several minutes at a time.
Description of problem:
See:
event happened 183 times, something is wrong: node/ip-10-0-52-0.ec2.internal hmsg/9cff2a8527 - reason/ErrorUpdatingResource error creating gateway for node ip-10-0-52-0.ec2.internal: failed to configure the policy based routes for network "default": invalid host address: 10.0.52.0/18 (17:55:20Z) result=reject |
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
This comes from this bug https://issues.redhat.com/browse/OCPBUGS-29940
After applying the workaround suggested [1][2] with "oc adm must-gather --node-name" we found another issue where must-gather creates the debug pod on all master nodes and gets stuck for a while because the script gather_network_logs_basics loop. Filtering out the NotReady nodes would allow us to apply the workaround.
The script gather_network_logs_basics gets the master nodes by label (node-role.kubernetes.io/master) and saves them in the CLUSTER_NODES variable. It then passes this as a parameter to the function gather_multus_logs $CLUSTER_NODES, where it loops through the list of master nodes and performs debugging for each node.
collection-scripts/gather_network_logs_basics
...
CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -oname)}"
/usr/bin/gather_multus_logs $CLUSTER_NODES
...
collection-scripts/gather_multus_logs ... function gather_multus_logs { for NODE in "$@"; do nodefilename=$(echo "$NODE" | sed -e 's|node/||') out=$(oc debug "${NODE}" -- \ /bin/bash -c "cat $INPUT_LOG_PATH" 2>/dev/null) && echo "$out" 1> "${OUTPUT_LOG_PATH}/multus-log-$nodefilename.log" done }
This could be resolved with something similar to this:
CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -o json | jq -r '.items[] | select(.status.conditions[] | select(.type=="Ready" and .status=="True")).metadata.name')}"
/usr/bin/gather_multus_logs $CLUSTER_NODES
[1] - https://access.redhat.com/solutions/6962230
[2] - https://issues.redhat.com/browse/OCPBUGS-29940
This is a clone of issue OCPBUGS-36261. The following is the description of the original issue:
—
Description of problem:
In hostedcluster installations, when the following OAuthServer service is configure without any configured hostname parameter, the oauth route is created in the management cluster with the standard hostname which following the pattern from ingresscontroller wilcard domain (oauth-<hosted-cluster-namespace>.<wildcard-default-ingress-controller-domain>): ~~~ $ oc get hostedcluster -n <namespace> <hosted-cluster-name> -oyaml - service: OAuthServer servicePublishingStrategy: type: Route ~~~ On the other hand, if any custom hostname parameter is configured, the oauth route is created in the management cluster with the following labels: ~~~ $ oc get hostedcluster -n <namespace> <hosted-cluster-name> -oyaml - service: OAuthServer servicePublishingStrategy: route: hostname: oauth.<custom-domain> type: Route $ oc get routes -n hcp-ns --show-labels NAME HOST/PORT LABELS oauth oauth.<custom-domain> hypershift.openshift.io/hosted-control-plane=hcp-ns <--- ~~~ The configured label makes the ingresscontroller does not admit the route as the following configuration is added by hypershift operator to the default ingresscontroller resource: ~~~ $ oc get ingresscontroller -n openshift-ingress-default default -oyaml routeSelector: matchExpressions: - key: hypershift.openshift.io/hosted-control-plane <--- operator: DoesNotExist <--- ~~~ This configuration should be allowed as there are use-cases where the route should have a customized hostname. Currently the HCP platform is not allowing this configuration and the oauth route does not work.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Easily
Steps to Reproduce:
1. Install HCP cluster 2. Configure OAuthServer with type Route 3. Add a custom hostname different than default wildcard ingress URL from management cluster
Actual results:
Oauth route is not admitted
Expected results:
Oauth route should be admitted by Ingresscontroller
Additional info:
Description of problem:
When a OCB is enabled, and a new MC is created, nodes are drained twice when the resulting osImage build is applied.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Enable OCB in the worker pool oc create -f - << EOF apiVersion: machineconfiguration.openshift.io/v1alpha1 kind: MachineOSConfig metadata: name: worker spec: machineConfigPool: name: worker buildInputs: imageBuilder: imageBuilderType: PodImageBuilder baseImagePullSecret: name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy") renderedImagePushSecret: name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}') renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest" EOF 2. Wait for the image to be built 3. When the opt-in image has been finished and applied create a new MC apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: test-machine-config-1 spec: config: ignition: version: 3.1.0 storage: files: - contents: source: data:text/plain;charset=utf-8;base64,dGVzdA== filesystem: root mode: 420 path: /etc/test-file-1.test 4. Wait for the image to be built
Actual results:
Once the image is built it is applied to the worker nodes. If we have a look at the drain operation, we can see that every worker node was drained twice instead of once: oc -n openshift-machine-config-operator logs $(oc -n openshift-machine-config-operator get pods -l k8s-app=machine-config-controller -o jsonpath='{.items[0].metadata.name}') -c machine-config-controller | grep "initiating drain" I0430 13:28:48.740300 1 drain_controller.go:182] node ip-10-0-70-208.us-east-2.compute.internal: initiating drain I0430 13:30:08.330051 1 drain_controller.go:182] node ip-10-0-70-208.us-east-2.compute.internal: initiating drain I0430 13:32:32.431789 1 drain_controller.go:182] node ip-10-0-69-154.us-east-2.compute.internal: initiating drain I0430 13:33:50.643544 1 drain_controller.go:182] node ip-10-0-69-154.us-east-2.compute.internal: initiating drain I0430 13:48:08.183488 1 drain_controller.go:182] node ip-10-0-70-208.us-east-2.compute.internal: initiating drain I0430 13:49:01.379416 1 drain_controller.go:182] node ip-10-0-70-208.us-east-2.compute.internal: initiating drain I0430 13:50:52.933337 1 drain_controller.go:182] node ip-10-0-69-154.us-east-2.compute.internal: initiating drain I0430 13:52:12.191203 1 drain_controller.go:182] node ip-10-0-69-154.us-east-2.compute.internal: initiating drain
Expected results:
Nodes should drained only once when applying a new MC
Additional info:
Description of problem:
Disable serverless tests
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-37736. The following is the description of the original issue:
—
Modify the import to strip or change the bootOptions.efiSecureBootEnabled
https://redhat-internal.slack.com/archives/CLKF3H5RS/p1722368792144319
archive := &importx.ArchiveFlag{Archive: &importx.TapeArchive{Path: cachedImage}}
ovfDescriptor, err := archive.ReadOvf("*.ovf")
if err != nil {
// Open the corrupt OVA file
f, ferr := os.Open(cachedImage)
if ferr != nil
defer f.Close()
// Get a sha256 on the corrupt OVA file
// and the size of the file
h := sha256.New()
written, cerr := io.Copy(h, f)
if cerr != nil
return fmt.Errorf("ova %s has a sha256 of %x and a size of %d bytes, failed to read the ovf descriptor %w", cachedImage, h.Sum(nil), written, err)
}
ovfEnvelope, err := archive.ReadEnvelope(ovfDescriptor)
if err != nil
Description of problem:
The Installer still requires permissions to create and delete IAM roles even when the users brings existing roles.
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
always
Steps to Reproduce:
1. Specify existing IAM role in the install-config 2. 3.
Actual results:
The following permissions are required even though they are not used: "iam:CreateRole", "iam:DeleteRole", "iam:DeleteRolePolicy", "iam:PutRolePolicy", "iam:TagInstanceProfile"
Expected results:
Only actually needed permissions are required.
Additional info:
I think this is tech debt from when roles were not tagged. The fix will kind of revert https://github.com/openshift/installer/pull/5286
This is a clone of issue OCPBUGS-31738. The following is the description of the original issue:
—
Description of problem:
The [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility setup test frequently fails on OpenStack platform, which in turn also causes the [sig-network] can collect pod-to-service poller pod logs and [sig-network] can collect host-to-service poller pod logs tests to fail.
These failure happen frequently in vh-mecha, for example for all CSI jobs, such as 4.16-e2e-openstack-csi-cinder.
Description of problem:
monitor-add-nodes.sh returns Error: open .addnodesparams: permission denied.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
sometimes
Steps to Reproduce:
1. Monitor adding a day2 node using monitor-add-nodes.sh 2. 3.
Actual results:
Error: open .addnodesparams: permission denied.
Expected results:
monitor-add-nodes runs successfully
Additional info:
zhenying niu found an issue the node-joiner-monitor.sh
[core@ocp-edge49 installer]$ ./node-joiner-monitor.sh 192.168.122.6 namespace/openshift-node-joiner-mz8anfejbn created serviceaccount/node-joiner-monitor created clusterrole.rbac.authorization.k8s.io/node-joiner-monitor unchanged clusterrolebinding.rbac.authorization.k8s.io/node-joiner-monitor configured pod/node-joiner-monitor created Now using project "openshift-node-joiner-mz8anfejbn" on server "https://api.ostest.test.metalkube.org:6443". pod/node-joiner-monitor condition met time=2024-05-21T09:24:19Z level=info msg=Monitoring IPs: [192.168.122.6] Error: open .addnodesparams: permission denied Usage: node-joiner monitor-add-nodes [flags] Flags: -h, --help help for monitor-add-nodes Global Flags: --dir string assets directory (default ".") --kubeconfig string Path to the kubeconfig file. --log-level string log level (e.g. "debug | info | warn | error") (default "info") time=2024-05-21T09:24:19Z level=fatal msg=open .addnodesparams: permission denied Cleaning up Removing temporary file /tmp/nodejoiner-mZ8aNfEjbn
[~afasano@redhat.com] found the root cause, the working directory was not set, so the pwd folder /output is used, and is not writable. An easy fix would be to just use /tmp, ie: {code:java} command: ["/bin/sh", "-c", "node-joiner monitor-add-nodes $ipAddresses --dir=/tmp --log-level=info; sleep 5"]
The OCM-operator's imagePullSecretCleanupController attempts to prevent new pods from using an image pull secret that needs to be deleted, but this results in the OCM creating a new image pull secret in the meantime.
The overlap occurs when OCM-operator has detected the registry is removed, simultaneously triggering the imagePullSecretCleanup controller to start deleting and updating the OCM config to stop creating, but the OCM behavior change is delayed until its pods are restarted.
In 4.16 this churn is minimized due to the OCM naming the image pull secrets consistently, but the churn can occur during an upgrade given that the OCM-operator is updated first.
Description of problem:
There is one pod of metal3 operator in constant failure state. The cluster was acting as Hub cluster with ACM + GitOps for SNO installation. It was working well for a few days, until this moment when no other sites could be deployed. oc get pods -A | grep metal3 openshift-machine-api metal3-64cf86fb8b-fg5b9 3/4 CrashLoopBackOff 35 (108s ago) 155m openshift-machine-api metal3-baremetal-operator-84875f859d-6kj9s 1/1 Running 0 155m openshift-machine-api metal3-image-customization-57f8d4fcd4-996hd 1/1 Running 0 5h
Version-Release number of selected component (if applicable):
OCP version: 4.16.ec5
How reproducible:
Once it starts to fail, it does not recover.
Steps to Reproduce:
1. Unclear. Install Hub cluster with ACM+GitOps 2. (Perhaps: Update AgentServiceConfig
Actual results:
Pod crashing and installation of spoke cluster fails
Expected results:
Pod running and installation of spoke cluster succeds.
Additional info:
Logs of metal3-ironic-inspector: `[kni@infra608-1 ~]$ oc logs pods/metal3-64cf86fb8b-fg5b9 -c metal3-ironic-inspector + CONFIG=/etc/ironic-inspector/ironic-inspector.conf + export IRONIC_INSPECTOR_ENABLE_DISCOVERY=false + IRONIC_INSPECTOR_ENABLE_DISCOVERY=false + export INSPECTOR_REVERSE_PROXY_SETUP=true + INSPECTOR_REVERSE_PROXY_SETUP=true + . /bin/tls-common.sh ++ export IRONIC_CERT_FILE=/certs/ironic/tls.crt ++ IRONIC_CERT_FILE=/certs/ironic/tls.crt ++ export IRONIC_KEY_FILE=/certs/ironic/tls.key ++ IRONIC_KEY_FILE=/certs/ironic/tls.key ++ export IRONIC_CACERT_FILE=/certs/ca/ironic/tls.crt ++ IRONIC_CACERT_FILE=/certs/ca/ironic/tls.crt ++ export IRONIC_INSECURE=true ++ IRONIC_INSECURE=true ++ export 'IRONIC_SSL_PROTOCOL=-ALL +TLSv1.2 +TLSv1.3' ++ IRONIC_SSL_PROTOCOL='-ALL +TLSv1.2 +TLSv1.3' ++ export 'IPXE_SSL_PROTOCOL=-ALL +TLSv1.2 +TLSv1.3' ++ IPXE_SSL_PROTOCOL='-ALL +TLSv1.2 +TLSv1.3' ++ export IRONIC_VMEDIA_SSL_PROTOCOL=ALL ++ IRONIC_VMEDIA_SSL_PROTOCOL=ALL ++ export IRONIC_INSPECTOR_CERT_FILE=/certs/ironic-inspector/tls.crt ++ IRONIC_INSPECTOR_CERT_FILE=/certs/ironic-inspector/tls.crt ++ export IRONIC_INSPECTOR_KEY_FILE=/certs/ironic-inspector/tls.key ++ IRONIC_INSPECTOR_KEY_FILE=/certs/ironic-inspector/tls.key ++ export IRONIC_INSPECTOR_CACERT_FILE=/certs/ca/ironic-inspector/tls.crt ++ IRONIC_INSPECTOR_CACERT_FILE=/certs/ca/ironic-inspector/tls.crt ++ export IRONIC_INSPECTOR_INSECURE=true ++ IRONIC_INSPECTOR_INSECURE=true ++ export IRONIC_VMEDIA_CERT_FILE=/certs/vmedia/tls.crt ++ IRONIC_VMEDIA_CERT_FILE=/certs/vmedia/tls.crt ++ export IRONIC_VMEDIA_KEY_FILE=/certs/vmedia/tls.key ++ IRONIC_VMEDIA_KEY_FILE=/certs/vmedia/tls.key ++ export IPXE_CERT_FILE=/certs/ipxe/tls.crt ++ IPXE_CERT_FILE=/certs/ipxe/tls.crt ++ export IPXE_KEY_FILE=/certs/ipxe/tls.key ++ IPXE_KEY_FILE=/certs/ipxe/tls.key ++ export RESTART_CONTAINER_CERTIFICATE_UPDATED=false ++ RESTART_CONTAINER_CERTIFICATE_UPDATED=false ++ export MARIADB_CACERT_FILE=/certs/ca/mariadb/tls.crt ++ MARIADB_CACERT_FILE=/certs/ca/mariadb/tls.crt ++ export IPXE_TLS_PORT=8084 ++ IPXE_TLS_PORT=8084 ++ mkdir -p /certs/ironic ++ mkdir -p /certs/ironic-inspector ++ mkdir -p /certs/ca/ironic mkdir: cannot create directory '/certs/ca/ironic': Permission denied
This is a clone of issue OCPBUGS-38515. The following is the description of the original issue:
—
Description of problem:
container_network* metrics disappeared from pods
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-08-13-031847
How reproducible:
always
Steps to Reproduce:
1.create a pod 2.check container_network* metrics from the pod $oc get --raw /api/v1/nodes/jimabug02-95wr2-worker-westus-b2cpv/proxy/metrics/cadvisor | grep container_network_transmit | grep $pod_name
Actual results:
2 It failed to report container_network* metrics
Expected results:
2 It should report container_network* metrics
Additional info:
This may be a regression issue, we hit it in 4.14 https://issues.redhat.com/browse/OCPBUGS-13741
This is a clone of issue OCPBUGS-41328. The following is the description of the original issue:
—
Description of problem:
Rotating the root certificates (root CA) requires multiple certificates during the rotation process to prevent downtime as the server and client certificates are updated in the control and data planes. Currently, the HostedClusterConfigOperator uses the cluster-signer-ca from the control plane to create a kublet-serving-ca on the data plane. The cluster-signer-ca contains only a single certificate that is used for signing certificates for the kube-controller-manager. During a rotation, the kublet-serving-ca will be updated with the new CA which triggers the metrics-server pod to restart and use the new CA. This will lead to an error in the metrics-server where it cannot scrape metrics as the kublet has yet to pickup the new certificate. E0808 16:57:09.829746 1 scraper.go:149] "Failed to scrape node" err="Get \"https://10.240.0.29:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate signed by unknown authority" node="pres-cqogb7a10b7up68kvlvg-rkcpsms0805-default-00000130" rkc@rmac ~> kubectl get pods -n openshift-monitoring NAME READY STATUS RESTARTS AGE metrics-server-594cd99645-g8bj7 0/1 Running 0 2d20h metrics-server-594cd99645-jmjhj 1/1 Running 0 46h The HostedClusterConfigOperator should likely be using the KubeletClientCABundle from the control plane for the kublet-serving-ca in the data plane. This CA bundle will contain both the new and old CA such that all data plane components can remain up during the rotation process.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
After running tests on an SNO with Telco DU profile for a couple of hours kubernetes.io/kubelet-serving CSRs in Pending state start showing up and accumulating in time.
Version-Release number of selected component (if applicable):
4.16.0-rc.1
How reproducible:
once so far
Steps to Reproduce:
1. Deploy SNO with DU profile with disabled capabilities: installConfigOverrides: "{\"capabilities\":{\"baselineCapabilitySet\": \"None\", \"additionalEnabledCapabilities\": [ \"NodeTuning\", \"ImageRegistry\", \"OperatorLifecycleManager\" ] }}" 2. Leave the node running tests overnight for a couple of hours 3. Check for Pending CSRs
Actual results:
oc get csr -A | grep Pending | wc -l 27
Expected results:
No pending CSRs Also oc logs will return a tls internal error: oc -n openshift-cluster-machine-approver --insecure-skip-tls-verify-backend=true logs machine-approver-866c94c694-7dwks Defaulted container "kube-rbac-proxy" out of: kube-rbac-proxy, machine-approver-controller Error from server: Get "https://[2620:52:0:8e6::d0]:10250/containerLogs/openshift-cluster-machine-approver/machine-approver-866c94c694-7dwks/kube-rbac-proxy": remote error: tls: internal error
Additional info:
Checking the machine-approver-controller container logs on the node we can see the reconciliation is failing be cause it cannot find the Machine API which is disabled from the capabilities. I0514 13:25:09.266546 1 controller.go:120] Reconciling CSR: csr-dw9c8 E0514 13:25:09.275585 1 controller.go:138] csr-dw9c8: Failed to list machines in API group machine.openshift.io/v1beta1: no matches for kind "Machine" in version "machine.openshift.io/v1beta1" E0514 13:25:09.275665 1 controller.go:329] "Reconciler error" err="Failed to list machines: no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\"" controller="certificatesigningrequest" controllerGroup="certificates.k8s.io" controllerKind="CertificateSigningRequest" CertificateSigningRequest="csr-dw9c8" namespace="" name="csr-dw9c8" reconcileID="6f963337-c6f1-46e7-80c4-90494d21653c" I0514 13:25:43.792140 1 controller.go:120] Reconciling CSR: csr-jvrvt E0514 13:25:43.798079 1 controller.go:138] csr-jvrvt: Failed to list machines in API group machine.openshift.io/v1beta1: no matches for kind "Machine" in version "machine.openshift.io/v1beta1" E0514 13:25:43.798128 1 controller.go:329] "Reconciler error" err="Failed to list machines: no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\"" controller="certificatesigningrequest" controllerGroup="certificates.k8s.io" controllerKind="CertificateSigningRequest" CertificateSigningRequest="csr-jvrvt" namespace="" name="csr-jvrvt" reconcileID="decbc5d9-fa10-45d1-92f1-1c999df956ff"
Description of problem:
When worker nodes are defined with DaemonSets that have PVC attached, they are node properly deleted, and node stay in deleting state as volumes are still attched
Version-Release number of selected component (if applicable):
1.14.19
How reproducible:
Check mustGather for details.
Steps to Reproduce:
Actual results:
Node stay in deltion state, until we manually "remove" DaemonSets
Expected results:
Node should properly delete without manual action
Additional info:
Cx is not experiencing this issue on std Rosa
Description of problem:
For STS, an AWS creds file is injected with credentials_process for installer to use. That usually points to a command that loads a Secret containing the creds necessary to assume role. For CAPI, installer runs in an ephemeral envtest cluster. So when it runs that credentials_process (via the black box of passing the creds file to the AWS SDK) the command ends up requesting that Secret from the envtest kube API server… where it doesn’t exist. The Installer should avoid overriding KUBECONFIG whenever possible.
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
always
Steps to Reproduce:
1. Deploy cluster with STS credentials 2. 3.
Actual results:
Install fails with: time="2024-06-02T23:50:17Z" level=debug msg="failed to get the service provider secret: secrets \"shawnnightly-aws-service-provider-secret\" not foundfailed to get the service provider secret: oc get events -n uhc-staging-2blaesc1478urglmcfk3r79a17n82lm3E0602 23:50:17.324137 151 awscluster_controller.go:327] \"failed to reconcile network\" err=<" time="2024-06-02T23:50:17Z" level=debug msg="\tfailed to create new managed VPC: failed to create vpc: ProcessProviderExecutionError: error in credential_process" time="2024-06-02T23:50:17Z" level=debug msg="\tcaused by: exit status 1" time="2024-06-02T23:50:17Z" level=debug msg=" > controller=\"awscluster\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"AWSCluster\" AWSCluster=\"openshift-cluster-api-guests/shawnnightly-c8zdl\" namespace=\"openshift-cluster-api-guests\" name=\"shawnnightly-c8zdl\" reconcileID=\"e7524343-f598-4b71-a788-ad6975e92be7\" cluster=\"openshift-cluster-api-guests/shawnnightly-c8zdl\"" time="2024-06-02T23:50:17Z" level=debug msg="I0602 23:50:17.324204 151 recorder.go:104] \"Failed to create new managed VPC: ProcessProviderExecutionError: error in credential_process\\ncaused by: exit status 1\" logger=\"events\" type=\"Warning\" object={\"kind\":\"AWSCluster\",\"namespace\":\"openshift-cluster-api-guests\",\"name\":\"shawnnightly-c8zdl\",\"uid\":\"f20bd7ae-a8d2-4b16-91c2-c9525256bb46\",\"apiVersion\":\"infrastructure.cluster.x-k8s.io/v1beta2\",\"resourceVersion\":\"311\"} reason=\"FailedCreateVPC\""
Expected results:
No failures
Additional info:
This is a clone of issue OCPBUGS-37560. The following is the description of the original issue:
—
Description of problem:
Console user settings are saved in a ConfigMap for each user in the namespace openshift-console-user-settings.
The console frontend uses the k8s API to read and write that ConfigMap. The console backend creates a ConfigMap with a Role and RoleBinding for each user, giving that single user read and write access to his/her own ConfigMap.
The number of Role and RoleBindings might decrease a cluster performance. This has happened in the past, esp. on the Developer Sandbox, where a long-living cluster creates new users that is then automatically removed after a month. Keeping the Role and RoleBinding results in performance issues.
The resources had an ownerReference before 4.15 so that the 3 resources (1 ConfigMap, 1 Role, 1 RoleBinding) was automatically removed when the User resource was deleted. This ownerReference was removed with 4.15 to support external OIDC providers.
The ask in this issue is to restore that ownerReference for the OpenShift auth provider.
History:
See also:
Version-Release number of selected component (if applicable):
4.15+
How reproducible:
Always
Steps to Reproduce:
Actual results:
The three resources weren't deleted after the user was deleted.
Expected results:
The three resources should be deleted after the user is deleted.
Additional info:
Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/232
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
setting the feature gate UserNamespacesSupport should set a cluster in tech preview not custom no upgrade
Version-Release number of selected component (if applicable):
4.17.0
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-23080. The following is the description of the original issue:
—
Description of problem:
This is essentially an incarnation of the bug https://bugzilla.redhat.com/show_bug.cgi?id=1312444 that was fixed in OpenShift 3 but is now present again.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
Select a template in the console web UI, try to enter a multiline value.
Actual results:
It's impossible to enter line breaks.
Expected results:
It should be possible to achieve entering a multiline parameter when creating apps from templates.
Additional info:
I also filed an issue here https://github.com/openshift/console/issues/13317. P.S. It's happening on https://openshift-console.osci.io, not sure what version of OpenShift they're running exactly.
After fixing https://issues.redhat.com/browse/OCPBUGS-29919 by merging https://github.com/openshift/baremetal-runtimecfg/pull/301 we have lost ability to properly debug the logic of selection Node IP used in runtimecfg.
In order to preserve debugability of this component, it should be possible to selectively enable verbose logs.
Description of problem:
In the RBAC which is set up for networkTypes other than OVNKubernetes, the cluster-network-operator role allows access to a configmap named "openshift-service-ca.crt", but the configmap which is actually used is named "root-ca".
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-38011. The following is the description of the original issue:
—
Description of problem:
This seems to be a requirement to set Project/namespace.However, in the CLI, RoleBinding objects can be created without namespace with no issues.
$ oc describe rolebinding.rbac.authorization.k8s.io/monitor
Name: monitor
Labels: <none>
Annotations: <none>
Role:
Kind: ClusterRole
Name: view
Subjects:
Kind Name Namespace
---- ---- ---------
ServiceAccount monitor
—
This is inconsistent with the dev console, causing confusion for developers and administrators and making things cumbersome for administrators.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Login to the web console for Developer. 2. Select Project on the left. 3. Select 'Project Access' tab. 4. Add access -> Select Sevice Account on the dropdown
Actual results:
Save button is not active when no project is selected
Expected results:
The Save button is enabled even though the Project is not selected, so that it can be created just as it is handled in the CLI.
Additional info:
Description of problem:
The cluster-api-provider-openstack branch used for e2e testing in cluster-capi-operator is not pinned to a branch. As such the go version used in the two projects goes out of sync causing the test to fail starting.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The rhel8 build, necessary for rhel8 workers, is actually a rhel9 build.
Version-Release number of selected component (if applicable):
4.16 + where base images are now rhel9
Update the 4.17 installer to use commit c6bcd313bce0fc9866e41bb9e3487d9f61c628a3 of cluster-api-provider-ibmcloud. This includes a couple of necessary Transit Gateway fixes.
Please review the following PR: https://github.com/openshift/oauth-apiserver/pull/115
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/2185
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-35947. The following is the description of the original issue:
—
Description of problem:
In install-config file, there is no zone/instance type setting under controlplane or defaultMachinePlatform ========================== featureSet: CustomNoUpgrade featureGates: - ClusterAPIInstallAzure=true compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: {} replicas: 3 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: {} replicas: 3 create cluster, master instances should be created in multi zones, since default instance type 'Standard_D8s_v3' have availability zones. Actually, master instances are not created in any zone. $ az vm list -g jima24a-f7hwg-rg -otable Name ResourceGroup Location Zones ------------------------------------------ ---------------- -------------- ------- jima24a-f7hwg-master-0 jima24a-f7hwg-rg southcentralus jima24a-f7hwg-master-1 jima24a-f7hwg-rg southcentralus jima24a-f7hwg-master-2 jima24a-f7hwg-rg southcentralus jima24a-f7hwg-worker-southcentralus1-wxncv jima24a-f7hwg-rg southcentralus 1 jima24a-f7hwg-worker-southcentralus2-68nxv jima24a-f7hwg-rg southcentralus 2 jima24a-f7hwg-worker-southcentralus3-4vts4 jima24a-f7hwg-rg southcentralus 3
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-06-23-145410
How reproducible:
Always
Steps to Reproduce:
1. CAPI-based install on azure platform with default configuration 2. 3.
Actual results:
master instances are created but not in any zone.
Expected results:
master instances should be created per zone based on selected instance type, keep the same behavior as terraform based install.
Additional info:
When setting zones under controlPlane in install-config, master instances can be created per zone. install-config: =========================== controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: azure: zones: ["1","3"] $ az vm list -g jima24b-p76w4-rg -otable Name ResourceGroup Location Zones ------------------------------------------ ---------------- -------------- ------- jima24b-p76w4-master-0 jima24b-p76w4-rg southcentralus 1 jima24b-p76w4-master-1 jima24b-p76w4-rg southcentralus 3 jima24b-p76w4-master-2 jima24b-p76w4-rg southcentralus 1 jima24b-p76w4-worker-southcentralus1-bbcx8 jima24b-p76w4-rg southcentralus 1 jima24b-p76w4-worker-southcentralus2-nmgfd jima24b-p76w4-rg southcentralus 2 jima24b-p76w4-worker-southcentralus3-x2p7g jima24b-p76w4-rg southcentralus 3
This is a clone of issue OCPBUGS-38217. The following is the description of the original issue:
—
Description of problem:
After changing LB type from CLB to NLB, the "status.endpointPublishingStrategy.loadBalancer.providerParameters.aws.classicLoadBalancer" is still there, but if create new NLB ingresscontroller the "classicLoadBalancer" will not appear. // after changing default ingresscontroller to NLB $ oc -n openshift-ingress-operator get ingresscontroller/default -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws classicLoadBalancer: <<<< connectionIdleTimeout: 0s <<<< networkLoadBalancer: {} type: NLB // create new ingresscontroller with NLB $ oc -n openshift-ingress-operator get ingresscontroller/nlb -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws networkLoadBalancer: {} type: NLB
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-08-08-013133
How reproducible:
100%
Steps to Reproduce:
1. changing default ingresscontroller to NLB $ oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"type":"LoadBalancerService","loadBalancer":{"providerParameters":{"type":"AWS","aws":{"type":"NLB"}},"scope":"External"}}}}' 2. create new ingresscontroller with NLB kind: IngressController apiVersion: operator.openshift.io/v1 metadata: name: nlb namespace: openshift-ingress-operator spec: domain: nlb.<base-domain> replicas: 1 endpointPublishingStrategy: loadBalancer: providerParameters: aws: type: NLB type: AWS scope: External type: LoadBalancerService 3. check both ingresscontrollers status
Actual results:
// after changing default ingresscontroller to NLB $ oc -n openshift-ingress-operator get ingresscontroller/default -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws classicLoadBalancer: connectionIdleTimeout: 0s networkLoadBalancer: {} type: NLB // new ingresscontroller with NLB $ oc -n openshift-ingress-operator get ingresscontroller/nlb -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws networkLoadBalancer: {} type: NLB
Expected results:
If type=NLB, then "classicLoadBalancer" should not appear in the status. and the status part should keep consistent whatever changing ingresscontroller to NLB or creating new one with NLB.
Additional info:
This is a clone of issue OCPBUGS-39420. The following is the description of the original issue:
—
Description of problem:
ROSA HCP allows customers to select hostedcluster and nodepool OCP z-stream versions, respecting version skew requirements. E.g.:
Version-Release number of selected component (if applicable):
Reproducible on 4.14-4.16.z, this bug report demonstrates it for a 4.15.28 hostedcluster with a 4.15.25 nodepool
How reproducible:
100%
Steps to Reproduce:
1. Create a ROSA HCP cluster, which comes with a 2-replica nodepool with the same z-stream version (4.15.28) 2. Create an additional nodepool at a different version (4.15.25)
Actual results:
Observe that while nodepool objects report the different version (4.15.25), the resulting kernel version of the node is that of the hostedcluster (4.15.28) ❯ k get nodepool -n ocm-staging-2didt6btjtl55vo3k9hckju8eeiffli8 NAME CLUSTER DESIRED NODES CURRENT NODES AUTOSCALING AUTOREPAIR VERSION UPDATINGVERSION UPDATINGCONFIG MESSAGE mshen-hyper-np-4-15-25 mshen-hyper 1 1 False True 4.15.25 False False mshen-hyper-workers mshen-hyper 2 2 False True 4.15.28 False False ❯ k get no -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-129-139.us-west-2.compute.internal Ready worker 24m v1.28.12+396c881 10.0.129.139 <none> Red Hat Enterprise Linux CoreOS 415.92.202408100433-0 (Plow) 5.14.0-284.79.1.el9_2.aarch64 cri-o://1.28.9-5.rhaos4.15.git674ed4c.el9 ip-10-0-129-165.us-west-2.compute.internal Ready worker 98s v1.28.12+396c881 10.0.129.165 <none> Red Hat Enterprise Linux CoreOS 415.92.202408100433-0 (Plow) 5.14.0-284.79.1.el9_2.aarch64 cri-o://1.28.9-5.rhaos4.15.git674ed4c.el9 ip-10-0-132-50.us-west-2.compute.internal Ready worker 30m v1.28.12+396c881 10.0.132.50 <none> Red Hat Enterprise Linux CoreOS 415.92.202408100433-0 (Plow) 5.14.0-284.79.1.el9_2.aarch64 cri-o://1.28.9-5.rhaos4.15.git674ed4c.el9
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/openshift-state-metrics/pull/115
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Updating the secrets using Form editor displays the an unknown warning message. This is caused due to incorrect request object sent to server in edit Secret form.
Description of problem:
Version-Release number of selected component (if applicable):
Version4.16 - Always
How reproducible:
Steps to Reproduce:
1. Goto Edit Secret form editor 2. Click Save The warning notification is triggered because of incorrect request object
Actual results:
Expected results:
Additional info:
Description of problem:
console always send GET CSV requests to 'openshift' namespace even copiedCSV is not disabled
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-15-001800
How reproducible:
Always
Steps to Reproduce:
1. Install operator into a specific namespace on the cluster(Operator will be available in a single Namespace only.) for example, subscribe APIcast into project 'test' 2. Check if copiedCSV is disabled >> window.SERVER_FLAGS.copiedCSVsDisabled <- false // copiedCSV is NOT disabled $ oc get olmconfig cluster -o json | jq .status.conditions [ { "lastTransitionTime": "2024-05-15T23:16:50Z", "message": "Copied CSVs are enabled and present across the cluster", "reason": "CopiedCSVsEnabled", "status": "False", "type": "DisabledCopiedCSVs" } ] 3. monitor the browser console errors when check operator details via Operators -> Installed Operators -> click on the operator APIcast
Actual results:
3. we can see a GET CSV request to 'openshift' namespace is sent and 404 was returned GET https://console-openshift-console.apps.qe-daily-416-0516.qe.devcluster.openshift.com/api/kubernetes/apis/operators.coreos.com/v1alpha1/namespaces/openshift/clusterserviceversions/apicast-community-operator.v0.7.1
Expected results:
3. copiedCSV is not disabled, probably we should not send request to query CSVs from 'openshift' namespace
Additional info:
Description of problem:
When deploying nodepools on OpenStack, the Nodepool condition complains about unsupported amd64 while we actually support it.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/network-tools/pull/129
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
https://github.com/openshift/console/pull/13769 removed the bulk of packages/kubevirt-plugin, but it orphaned packages/kubevirt-plugin/locales/* as those files were added after 13769 was authored.
Description of problem:
In the use case when worker nodes require a proxy for outside access and the control plane is external (and only accessible via the internet), ovnkube-node pods never become available because the ovnkube-controller container cannot reach the Kube APIServer.
Version-Release number of selected component (if applicable):
How reproducible: Always
Steps to Reproduce:
1. Create an AWS hosted cluster with Public access and requires a proxy to access the internet.
2. Wait for nodes to become active
Actual results:
Nodes join cluster, but never become active
Expected results:
Nodes join cluster and become active
This is a clone of issue OCPBUGS-38701. The following is the description of the original issue:
—
Description of problem:
clear all filters button is counted as part of resource type
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-08-19-002129
How reproducible:
Always
Steps to Reproduce:
1. navigate to Home -> Events page, choose 3 resource types, check what's shown on page 2. navigate to Home -> Search page, choose 3 resource types, check what's shown on page. Choose 4 resource types and check what's shown
Actual results:
1. it shows `1 more`, only clear all button will be shown if we click on `1 more` button 2. `1 more` button is only displayed when 4 resource types are selected, this is working as expected
Expected results:
1. clear all button should not be counted as part of resource number, the number 'N more' should reveal correct resource type numbers
Additional info:
Description of problem:
Power VS endpoint validation in the API only allows for lower case characters. However, the endpoint struct we check against is hardcoded to be PascalCase. We need to check against the lower case version of the string in the image registry operator.
Version-Release number of selected component (if applicable):
How reproducible:
Easily
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
After upgrading from 4.12 to 4.14, the customer reports that the pods cannot reach their service when a NetworkAttachmentDefinition is set.
How reproducible:
Create a NetworkAttachmentDefinition
Steps to Reproduce:
1.Create a pod with a service. 2. Curl the service from inside the pod. Works. 3. Create a NetworkAttachmentDefinition. 4. The same curl does not work
Actual results:
Pod does not reach service
Expected results:
Pod reaches service
Additional info:
specifically updating the bug overview for posterity here but the specific issue is that we have pods set up with an exposed port (8080 - port doesn't matter), and a service with 1 endpoint pointing to the specific pod. We can call OTHER PODS in the same namespace via their single-endpoint call service, but we cannot call OURSELVES from inside the pod. The issue is with hairpinning loopback return. Is not affected by networkpolicy and appears to be an issue with (as discovered later in this jira) asymmetric routing in that return path to the container after it leaves the local net. This behavior is only observed when a network-attachment-definition is added to the pod and appears to be an issue with the way route rules are defined. A workaround is available to inject the container with a route specicically, or modify the Net-attach-def to ensure a loopback route is available to the container space.
KCS for this problem with workarounds + patch fix versions (when available): https://access.redhat.com/solutions/7084866
This is a clone of issue OCPBUGS-4466. The following is the description of the original issue:
—
Description of problem:
deploying compact 3-nodes cluster on GCP, by setting mastersSchedulable as true and removing worker machineset YAMLs, got panic
Version-Release number of selected component (if applicable):
$ openshift-install version openshift-install 4.13.0-0.nightly-2022-12-04-194803 built from commit cc689a21044a76020b82902056c55d2002e454bd release image registry.ci.openshift.org/ocp/release@sha256:9e61cdf7bd13b758343a3ba762cdea301f9b687737d77ef912c6788cbd6a67ea release architecture amd64
How reproducible:
Always
Steps to Reproduce:
1. create manifests 2. set 'spec.mastersSchedulable' as 'true', in <installation dir>/manifests/cluster-scheduler-02-config.yml 3. remove the worker machineset YAML file from <installation dir>/openshift directory 4. create cluster
Actual results:
Got "panic: runtime error: index out of range [0] with length 0".
Expected results:
The installation should succeed, or giving clear error messages.
Additional info:
$ openshift-install version openshift-install 4.13.0-0.nightly-2022-12-04-194803 built from commit cc689a21044a76020b82902056c55d2002e454bd release image registry.ci.openshift.org/ocp/release@sha256:9e61cdf7bd13b758343a3ba762cdea301f9b687737d77ef912c6788cbd6a67ea release architecture amd64 $ $ openshift-install create manifests --dir test1 ? SSH Public Key /home/fedora/.ssh/openshift-qe.pub ? Platform gcp INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" ? Project ID OpenShift QE (openshift-qe) ? Region us-central1 ? Base Domain qe.gcp.devcluster.openshift.com ? Cluster Name jiwei-1205a ? Pull Secret [? for help] ****** INFO Manifests created in: test1/manifests and test1/openshift $ $ vim test1/manifests/cluster-scheduler-02-config.yml $ yq-3.3.0 r test1/manifests/cluster-scheduler-02-config.yml spec.mastersSchedulable true $ $ rm -f test1/openshift/99_openshift-cluster-api_worker-machineset-?.yaml $ $ tree test1 test1 ├── manifests │ ├── cloud-controller-uid-config.yml │ ├── cloud-provider-config.yaml │ ├── cluster-config.yaml │ ├── cluster-dns-02-config.yml │ ├── cluster-infrastructure-02-config.yml │ ├── cluster-ingress-02-config.yml │ ├── cluster-network-01-crd.yml │ ├── cluster-network-02-config.yml │ ├── cluster-proxy-01-config.yaml │ ├── cluster-scheduler-02-config.yml │ ├── cvo-overrides.yaml │ ├── kube-cloud-config.yaml │ ├── kube-system-configmap-root-ca.yaml │ ├── machine-config-server-tls-secret.yaml │ └── openshift-config-secret-pull-secret.yaml └── openshift ├── 99_cloud-creds-secret.yaml ├── 99_kubeadmin-password-secret.yaml ├── 99_openshift-cluster-api_master-machines-0.yaml ├── 99_openshift-cluster-api_master-machines-1.yaml ├── 99_openshift-cluster-api_master-machines-2.yaml ├── 99_openshift-cluster-api_master-user-data-secret.yaml ├── 99_openshift-cluster-api_worker-user-data-secret.yaml ├── 99_openshift-machineconfig_99-master-ssh.yaml ├── 99_openshift-machineconfig_99-worker-ssh.yaml ├── 99_role-cloud-creds-secret-reader.yaml └── openshift-install-manifests.yaml2 directories, 26 files $ $ openshift-install create cluster --dir test1 INFO Consuming Openshift Manifests from target directory INFO Consuming Master Machines from target directory INFO Consuming Worker Machines from target directory INFO Consuming OpenShift Install (Manifests) from target directory INFO Consuming Common Manifests from target directory INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" panic: runtime error: index out of range [0] with length 0goroutine 1 [running]: github.com/openshift/installer/pkg/tfvars/gcp.TFVars({{{0xc000cf6a40, 0xc}, {0x0, 0x0}, {0xc0011d4a80, 0x91d}}, 0x1, 0x1, {0xc0010abda0, 0x58}, ...}) /go/src/github.com/openshift/installer/pkg/tfvars/gcp/gcp.go:70 +0x66f github.com/openshift/installer/pkg/asset/cluster.(*TerraformVariables).Generate(0x1daff070, 0xc000cef530?) /go/src/github.com/openshift/installer/pkg/asset/cluster/tfvars.go:479 +0x6bf8 github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc000c78870, {0x1a777f40, 0x1daff070}, {0x0, 0x0}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:226 +0x5fa github.com/openshift/installer/pkg/asset/store.(*storeImpl).Fetch(0x7ffc4c21413b?, {0x1a777f40, 0x1daff070}, {0x1dadc7e0, 0x8, 0x8}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:76 +0x48 main.runTargetCmd.func1({0x7ffc4c21413b, 0x5}) /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:259 +0x125 main.runTargetCmd.func2(0x1dae27a0?, {0xc000c702c0?, 0x2?, 0x2?}) /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:289 +0xe7 github.com/spf13/cobra.(*Command).execute(0x1dae27a0, {0xc000c70280, 0x2, 0x2}) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:876 +0x67b github.com/spf13/cobra.(*Command).ExecuteC(0xc000c3a500) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:990 +0x3bd github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:918 main.installerMain() /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:61 +0x2b0 main.main() /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:38 +0xff $
Description of problem:
In debugging recent cyclictest issues on OCP 4.16 (5.14.0-427.22.1.el9_4.x86_64+rt kernel), we have discovered that the "psi=1" kernel cmdline argument, which is now added by default due to cgroupsv2 being enabled, is causing latency issues (both cyclictest and timerlat are failing to meet the latency KPIs we commit to for Telco RAN DU deployments). See RHEL-42737 for reference.
Version-Release number of selected component (if applicable):
OCP 4.16
How reproducible:
Cyclictest and timerlat consistently fail on long duration runs (e.g. 12 hours).
Steps to Reproduce:
1. Install OCP 4.16 and configure with the Telco RAN DU reference configuration. 2. Run a long duration cyclictest or timerlat test
Actual results:
Maximum latencies are detected above 20us.
Expected results:
All latencies are below 20us.
Additional info:
See RHEL-42737 for test results and debugging information. This was originally suspected to be an RHEL issue, but it turns out that PSI is being enabled by OpenShift code (which adds psi=1 to the kernel cmdline).
This is a clone of issue OCPBUGS-39398. The following is the description of the original issue:
—
Description of problem:
When the console is loaded there are errors in the browsers console abouth failing to fetch networking-console-plugin locales.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The issue is also effecting console CI
This is a clone of issue OCPBUGS-39222. The following is the description of the original issue:
—
The on-prem-resolv-prepender.path is enabled in UPI setup when it should only run for IPI
Description of problem:
RWOP accessMode is tech preview feature starting from OCP 4.14 and GA in 4.16. But on OCP console UI, there is not option available for creating a PVC with RWOP accessMode
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Login to OCP console in Administrator mode (4.14/4.15/4.16) 2. Go to 'Storage -> PersistentVolumeClaim -> Click on Create PersistentVolumeClaim' 3. Check under 'Access Mode*', RWOP option is not present
Actual results:
RWOP accessMode option is not present
Expected results:
RWOP accessMode option is present
Additional info:
Storage feature: https://issues.redhat.com/browse/STOR-1171
Description of problem:
AWS VPCs support a primary CIDR range and multiple secondary CIDR ranges: https://aws.amazon.com/about-aws/whats-new/2017/08/amazon-virtual-private-cloud-vpc-now-allows-customers-to-expand-their-existing-vpcs/
Let's pretend a VPC exists with:
and a hostedcontrolplane object like:
networking: ... machineNetwork: - cidr: 10.1.0.0/24 ... olmCatalogPlacement: management platform: aws: cloudProviderConfig: subnet: id: subnet-b vpc: vpc-069a93c6654464f03
Even though all EC2 instances will be spun up in subnet-b (10.1.0.0/24), CPO will detect the CIDR range of the VPC as 10.0.0.0/24 (https://github.com/openshift/hypershift/blob/0d10c822912ed1af924e58ccb8577d2bb1fd68be/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go#L4755-L4765) and create security group rules only allowing inboud traffic from 10.0.0.0/24. This specifically prevents these EC2 instances from communicating with the VPC Endpoint created by the awsendpointservice CR and reading the hosted control plane pods.
Version-Release number of selected component (if applicable):
Reproduced on a 4.14.20 ROSA HCP cluster, but the version should not matter
How reproducible:
100%
Steps to Reproduce:
1. Create a VPC with at least one secondary CIDR block 2. Install a ROSA HCP cluster providing the secondary CIDR block as the machine CIDR range and selecting the appropriate subnets within the secondary CIDR range
Actual results:
* Observe that the default security group contains inbound security group rules allowing traffic from the VPC's primary CIDR block (not a CIDR range containing the cluster's worker nodes) * As a result, the EC2 instances (worker nodes) fail to reach the ignition-server
Expected results:
The EC2 instances are able to reach the ignition-server and HCP pods
Additional info:
This bug seems like it could be fixed by using the machine CIDR range for the security group instead of the VPC CIDR range. Alternatively, we could duplicate rules for every secondary CIDR block, but the default AWS quota is 60 inbound security group rules/security group, so it's another failure condition to keep in mind if we go that route.
aws ec2 describe-vpcs output for a VPC with secondary CIDR blocks: ❯ aws ec2 describe-vpcs --region us-east-2 --vpc-id vpc-069a93c6654464f03 { "Vpcs": [ { "CidrBlock": "10.0.0.0/24", "DhcpOptionsId": "dopt-0d1f92b25d3efea4f", "State": "available", "VpcId": "vpc-069a93c6654464f03", "OwnerId": "429297027867", "InstanceTenancy": "default", "CidrBlockAssociationSet": [ { "AssociationId": "vpc-cidr-assoc-0abbc75ac8154b645", "CidrBlock": "10.0.0.0/24", "CidrBlockState": { "State": "associated" } }, { "AssociationId": "vpc-cidr-assoc-098fbccc85aa24acf", "CidrBlock": "10.1.0.0/24", "CidrBlockState": { "State": "associated" } } ], "IsDefault": false, "Tags": [ { "Key": "Name", "Value": "test" } ] } ] }
Please review the following PR: https://github.com/openshift/images/pull/185
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
To support different auth providers (SSO via OIDC), we needed to remove the ownerReference from the ConfigMap, Role, and Rolebinding we create for each user to store the user settings.
Keeping these resources also when the user is deleted might decrease the overall cluster performance, esp. on Dev Sandbox where users are automatically removed every month.
We should make it easier to understand which user created these resources. This will help the Dev Sandbox team and maybe other customers in the future.
Version-Release number of selected component (if applicable):
4.15+
How reproducible:
Always when a user is deleted
Steps to Reproduce:
Actual results:
The user settings ConfigMap, Role, and RoleBinding in the same namespace aren't deleted and can only be found via the user uid. Which we might not know anymore since the User CR is already deleted.
Expected results:
The user settings ConfigMap, Role, and RoleBinding should also have a label or annotation referencing the user who created these resources.
See also https://github.com/openshift/console/issues/13696
For example:
metadata: labels: console.openshift.io/user-settings: "true" console.openshift.io/user-settings-username: "" # escaped if the username contains characters that are not valid as label-value console.openshift.io/user-settings-uid: "..." # only if available
Additional info:
Description of problem:
Add networking-console-plugin image to CNO as an env var, so the hosted CNO can fetch the image to deploy it on the hosted cluster.
Version-Release number of selected component (if applicable):
4.17.0
How reproducible:
100%
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
release-4.17 of openshift/cluster-api-provider-openstack is missing some commits that were backported in upstream project into the release-0.10 branch. We should import them in our downstream fork.
We need to add more people to the owners file of multus repo.
4.17.0-0.nightly-2024-08-05-063006 failed in part due to aws-ovn-upgrade-4.17-micro aggregation failures
Caused by
{Zero successful runs, we require at least one success to pass (P70=3.00s failures=[1820347983690993664=56s 1820348008856817664=80s 1820347977814773760=46s 1820347658217197568=52s 1820347967748444160=70s 1820347998836625408=52s 1820347972789997568=38s 1820347993786683392=52s 1820347988728352768=72s 1820347962715279360=80s 1820348003832041472=76s]) name: kube-api-http1-localhost-new-connections disruption P70 should not be worse
Failed: Mean disruption of openshift-api-http2-localhost-new-connections is 70.20 seconds is more than the failureThreshold of the weekly historical mean from 10 days ago: historicalMean=0.00s standardDeviation=0.00s failureThreshold=1.00s historicalP95=0.00s successes=[] failures=[1820347983690993664=68s 1820348008856817664=96s 1820347972789997568=48s 1820347658217197568=62s 1820348003832041472=88s 1820347962715279360=94s 1820347998836625408=62s 1820347993786683392=60s 1820347988728352768=86s 1820347977814773760=54s 1820347967748444160=80s] name: openshift-api-http2-localhost-new-connections mean disruption should be less than historical plus five standard deviations testsuitename: aggregated-disruption
Additionally we are seeing failures on azure upgrades that show large disruption during etcd-operator updating
Opening this bug to investigate and run payload tests against to rule it out.
Description of problem:
The valid values for installconfig.platform.vsphere.diskType are thin, thick, and eagerZeroedThick.But no matter the diskType is set to thick or eagerZeroedThick, the actual check result is thin. govc vm.info --json /DEVQEdatacenter/vm/wwei-511d-gtbqd/wwei-511d-gtbqd-master-1 | jq -r .VirtualMachines[].Layout.Disk[].DiskFile[][vsanDatastore] e7323f66-86ef-9947-a2b9-507c6f3b795c/wwei-511d-gtbqd-master-1.vmdk[fedora@preserve-wwei ~]$ govc datastore.disk.info -ds /DEVQEdatacenter/datastore/vsanDatastore e7323f66-86ef-9947-a2b9-507c6f3b795c/wwei-511d-gtbqd-master-1.vmdk |grep Type Type: thin
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-07-025557
How reproducible:
Setting installconfig.platform.vsphere.diskType to thick or eagerZeroedThick and continue installation.
Steps to Reproduce:
1.Setting installconfig.platform.vsphere.diskType to thick or eagerZeroedThick 2.continue installation
Actual results:
diskType is thin when install-config setting diskType: thick/eagerZeroedThick
Expected results:
The check result for disk info should match the setting in install-config
Additional info:
The issue was observed during testing of the k8s 1.30 rebase in which the webhook client started using http2 for loopback IPs: kubernetes/kubernetes#122558.
It looks like the issue is caused by how a http2 client handles this invalid address, I verified this change by setting up a cluster with openshift/kubernetes#1953 and this pr.
Bug of https://issues.redhat.com//browse/OCPNODE-2668.
This will display errors for Node Not Ready if there are no deleted events.
Description of problem:
When a must-gather creation fails in the middle, the clusterrolebindings created for must-gather creation remains in the cluster.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Run a must-gather command: `oc adm must-gather` 2. Interrupt the must-gather creation 3.Search for the cluster-rolebinding: `oc get clusterrolebinding | grep -i must` 4. Try deleting the must-gather namespace 5. Search for the cluster-rolebinding again:`oc get clusterrolebinding | grep -i must`
Actual results:
clusterrolebindings created for must-gather creation remains in the cluster
Expected results:
clusterrolebindings created for must-gather creation shouldn't remain in the cluster
Additional info:
Description of problem:
When we configure a userCA or a cloudCA MCO adds those certificates to the ignition config and the nodes. Nevertheless, when we remove those certificates MCO does not remove them from the nodes and the ignition config.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-01-24-133352 True False 5h49m Cluster version is 4.16.0-0.nightly-2024-01-24-133352
How reproducible:
Always
Steps to Reproduce:
1. Create a new certificate $ openssl genrsa -out privateKey.pem 4096 $ openssl req -new -x509 -nodes -days 3600 -key privateKey.pem -out ca-bundle.crt -subj "/OU=MCO qe/CN=example.com" 2. Configure a userCA # Create the configmap with the certificate $ oc create cm cm-test-cert -n openshift-config --from-file=ca-bundle.crt configmap/cm-test-cert created #Configure the proxy with the new test certificate $ oc patch proxy/cluster --type merge -p '{"spec": {"trustedCA": {"name": "cm-test-cert"}}}' proxy.config.openshift.io/cluster patched 3. Configure a cloudCA $ oc set data -n openshift-config ConfigMap cloud-provider-config --from-file=ca-bundle.pem=ca-bundle.crt 4. Check that the certificates have been added $ oc debug -q node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") -- chroot /host cat "/etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt" $ oc debug -q node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") -- chroot /host cat "/etc/kubernetes/static-pod-resources/configmaps/cloud-config/ca-bundle.pem" 5. Remove the configured userCA and cloudCA certificates $ oc patch proxy/cluster --type merge -p '{"spec": {"trustedCA": {"name": ""}}}' $ oc edit -n openshift-config ConfigMap cloud-provider-config ### REMOVE THE ca-bundle.pem KEY
Actual results:
Even though we have removed the certificates from the cluster config those can be found in the nodes $ oc debug -q node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") -- chroot /host cat "/etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt" $ oc debug -q node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") -- chroot /host cat "/etc/kubernetes/static-pod-resources/configmaps/cloud-config/ca-bundle.pem"
Expected results:
The certificates should be removed from the nodes and the ignition config when they are removed from the cluster config
Additional info:
Description of problem:
When you pass in endpoints in install config: serviceEndpoints: - name: dnsservices url: https://api.dns-svcs.cloud.ibm.com - name: cos url: https://s3.us-south.cloud-object-storage.appdomain.cloud You see the following error: failed creating minimalPowerVS client Error: setenv: invalid argument
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Install cluster 2. Pass in above to install config yaml 3.
Actual results:
Worker nodes aren't created.
Expected results:
Worker nodes are created.
Additional info:
This is a clone of issue OCPBUGS-37052. The following is the description of the original issue:
—
Description of problem:
This is a followup of https://issues.redhat.com/browse/OCPBUGS-34996, in which comments led us to better understand the issue customers are facing. LDAP IDP traffic from the oauth pod seems to be going through the configured HTTP(S) proxy, while it should not due to it being a different protocol. This results in customers adding the ldap endpoint to their no-proxy config to circumvent the issue.
Version-Release number of selected component (if applicable):
4.15.11
How reproducible:
Steps to Reproduce:
(From the customer) 1. Configure LDAP IDP 2. Configure Proxy 3. LDAP IDP communication from the control plane oauth pod goes through proxy instead of going to the ldap endpoint directly
Actual results:
LDAP IDP communication from the control plane oauth pod goes through proxy
Expected results:
LDAP IDP communication from the control plane oauth pod should go to the ldap endpoint directly using the ldap protocol, it should not go through the proxy settings
Additional info:
For more information, see linked tickets.
Description of problem:
When application grouping is unchecked in display filters under the expand section the topology display is distorted and Application name is also missing.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Have some deployments 2. In topology unselect the application grouping in the display filter 3.
Actual results:
Topology shows distorted UI and Application name is missing.
Expected results:
UI should be in the correct condition and Apllication name should present.
Additional info:
Screenshot:
https://drive.google.com/file/d/1z80qLrr5v-K8ZFDa3P-n7SoDMaFtuxI7/view?usp=sharing
This is a clone of issue OCPBUGS-38599. The following is the description of the original issue:
—
Description of problem:
If folder is undefined and the datacenter exists in a datacenter-based folder
the installer will create the entire path of folders from the root of vcenter - which is incorrect
This does not occur if folder is defined.
An upstream bug was identified when debugging this:
Some AWS installs are failing to bootstrap due to an issue where CAPA may fail to create load balancer resources, but still declare that infrastructure is ready (see upstream issue for more details).
In these cases, load balancers are failing to be created due to either rate limiting:
time="2024-05-25T21:43:07Z" level=debug msg="E0525 21:43:07.975223 356 awscluster_controller.go:280] \"failed to reconcile load balancer\" err=<" time="2024-05-25T21:43:07Z" level=debug msg="\t[failed to modify target group attribute: Throttling: Rate exceeded"
or in some cases another error:
time="2024-06-01T06:43:58Z" level=debug msg="E0601 06:43:58.902534 356 awscluster_controller.go:280] \"failed to reconcile load balancer\" err=<" time="2024-06-01T06:43:58Z" level=debug msg="\t[failed to apply security groups to load balancer \"ci-op-jnqi01di-5feef-92njc-int\": ValidationError: A load balancer ARN must be specified" time="2024-06-01T06:43:58Z" level=debug msg="\t\tstatus code: 400, request id: 77446593-03d2-40e9-93c0-101590d150c6, failed to create target group for load balancer: DuplicateTargetGroupName: A target group with the same name 'apiserver-target-1717224237' exists, but with different settings"
We have an upstream PR in progress to retry the reconcile logic for load balancers.
Original component readiness report below.
=====
Component Readiness has found a potential regression in install should succeed: cluster bootstrap.
There is no significant evidence of regression
Sample (being evaluated) Release: 4.16
Start Time: 2024-05-28T00:00:00Z
End Time: 2024-06-03T23:59:59Z
Success Rate: 96.60%
Successes: 227
Failures: 8
Flakes: 0
Base (historical) Release: 4.15
Start Time: 2024-02-01T00:00:00Z
End Time: 2024-02-28T23:59:59Z
Success Rate: 99.87%
Successes: 767
Failures: 1
Flakes: 0
This is a clone of issue OCPBUGS-35036. The following is the description of the original issue:
—
Description of problem:
The following logs are from namespaces/openshift-apiserver/pods/apiserver-6fcd57c747-57rkr/openshift-apiserver/openshift-apiserver/logs/current.log
2024-06-06T15:57:06.628216833Z E0606 15:57:06.628186 1 finisher.go:175] FinishRequest: post-timeout activity - time-elapsed: 139.823053ms, panicked: true, err: <nil>, panic-reason: runtime error: invalid memory address or nil pointer dereference 2024-06-06T15:57:06.628216833Z goroutine 192790 [running]: 2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers/finisher.finishRequest.func1.1() 2024-06-06T15:57:06.628216833Z k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/finisher/finisher.go:105 +0xa5 2024-06-06T15:57:06.628216833Z panic({0x498ac60?, 0x74a51c0?}) 2024-06-06T15:57:06.628216833Z runtime/panic.go:914 +0x21f 2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/importer.(*ImageStreamImporter).importImages(0xc0c5bf0fc0, {0x5626bb0, 0xc0a50c7dd0}, 0xc07055f4a0, 0xc0a2487600) 2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/importer/importer.go:263 +0x1cf5 2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/importer.(*ImageStreamImporter).Import(0xc0c5bf0fc0, {0x5626bb0, 0xc0a50c7dd0}, 0x0?, 0x0?) 2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/importer/importer.go:110 +0x139 2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/registry/imagestreamimport.(*REST).Create(0xc0033b2240, {0x5626bb0, 0xc0a50c7dd0}, {0x5600058?, 0xc07055f4a0?}, 0xc08e0b9ec0, 0x56422e8?) 2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/registry/imagestreamimport/rest.go:337 +0x1574 2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers.(*namedCreaterAdapter).Create(0x55f50e0?, {0x5626bb0?, 0xc0a50c7dd0?}, {0xc0b5704000?, 0x562a1a0?}, {0x5600058?, 0xc07055f4a0?}, 0x1?, 0x2331749?) 2024-06-06T15:57:06.628216833Z k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/create.go:254 +0x3b 2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers.CreateResource.createHandler.func1.1() 2024-06-06T15:57:06.628216833Z k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/create.go:184 +0xc6 2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers.CreateResource.createHandler.func1.2() 2024-06-06T15:57:06.628216833Z k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/create.go:209 +0x39e 2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers/finisher.finishRequest.func1() 2024-06-06T15:57:06.628216833Z k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/finisher/finisher.go:117 +0x84
Version-Release number of selected component (if applicable):
We applied into all clusters in CI and checked 3 of them and all 3 share the same errors.
oc --context build09 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-rc.3 True False 3d9h Error while reconciling 4.16.0-rc.3: the cluster operator machine-config is degraded oc --context build02 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-rc.2 True False 15d Error while reconciling 4.16.0-rc.2: the cluster operator machine-config is degraded oc --context build03 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.16 True False 34h Error while reconciling 4.15.16: the cluster operator machine-config is degraded
How reproducible:
We applied this PR https://github.com/openshift/release/pull/52574/files to the clusters.
It breaks at least 3 of them.
"qci-pull-through-cache-us-east-1-ci.apps.ci.l2s4.p1.openshiftapps.com" is a registry cache server https://github.com/openshift/release/blob/master/clusters/app.ci/quayio-pull-through-cache/qci-pull-through-cache-us-east-1.yaml
Additional info:
There are lots of image imports in OpenShift CI jobs.
It feels like the registry cache server returns unexpected results to the openshift-apiserver:
2024-06-06T18:13:13.781520581Z E0606 18:13:13.781459 1 strategy.go:60] unable to parse manifest for "sha256:c5bcd0298deee99caaf3ec88de246f3af84f80225202df46527b6f2b4d0eb3c3": unexpected end of JSON input
Our theory is that the requests of imports from all CI clusters crashed the cache server and it sent some unexpected data which caused apiserver to panic.
The expected behaviour is that if the image cannot be pulled from the first mirror in the ImageDigestMirrorSet, then it will be failed over to the next one.
This is a clone of issue OCPBUGS-38775. The following is the description of the original issue:
—
Description of problem:
see from screen recording https://drive.google.com/file/d/1LwNdyISRmQqa8taup3nfLRqYBEXzH_YH/view?usp=sharing
dev console, "Observe -> Metrics" tab, input in in the query-browser input text-area, the cursor would focus in the project drop-down list, this issue exists in 4.17.0-0.nightly-2024-08-19-165854 and 4.18.0-0.nightly-2024-08-19-002129, no such issue with admin console
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-08-19-165854 and 4.18.0-0.nightly-2024-08-19-002129
How reproducible:
always
Steps to Reproduce:
1. see the description
Actual results:
cursor would focus in the project drop-down
Expected results:
cursor should not move
Additional info:
This is a clone of issue OCPBUGS-36236. The following is the description of the original issue:
—
Description of problem:
The installer for IBM Cloud currently only checks the first group of subnets (50) when searching for Subnet details by name. It should provide pagination support to search all subnets.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%, dependent on order of subnets returned by IBM Cloud API's however
Steps to Reproduce:
1. Create 50+ IBM Cloud VPC Subnets 2. Use Bring Your Own Network (BYON) configuration (with Subnet names for CP and/or Compute) in install-config.yaml 3. Attempt to create manifests (openshift-install create manifests)
Actual results:
ERROR failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: [platform.ibmcloud.controlPlaneSubnets: Not found: "eu-de-subnet-paginate-1-cp-eu-de-1", platform.ibmcloud.controlPlaneSubnets: Not found: "eu-de-subnet-paginate-1-cp-eu-de-2", platform.ibmcloud.controlPlaneSubnets: Not found: "eu-de-subnet-paginate-1-cp-eu-de-3", platform.ibmcloud.controlPlaneSubnets: Invalid value: []string{"eu-de-subnet-paginate-1-cp-eu-de-1", "eu-de-subnet-paginate-1-cp-eu-de-2", "eu-de-subnet-paginate-1-cp-eu-de-3"}: number of zones (0) covered by controlPlaneSubnets does not match number of provided or default zones (3) for control plane in eu-de, platform.ibmcloud.computeSubnets: Not found: "eu-de-subnet-paginate-1-compute-eu-de-1", platform.ibmcloud.computeSubnets: Not found: "eu-de-subnet-paginate-1-compute-eu-de-2", platform.ibmcloud.computeSubnets: Not found: "eu-de-subnet-paginate-1-compute-eu-de-3", platform.ibmcloud.computeSubnets: Invalid value: []string{"eu-de-subnet-paginate-1-compute-eu-de-1", "eu-de-subnet-paginate-1-compute-eu-de-2", "eu-de-subnet-paginate-1-compute-eu-de-3"}: number of zones (0) covered by computeSubnets does not match number of provided or default zones (3) for compute[0] in eu-de]
Expected results:
Successful manifests and cluster creation
Additional info:
IBM Cloud is working on a fix
Please review the following PR: https://github.com/operator-framework/operator-marketplace/pull/563
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-39189. The following is the description of the original issue:
—
Expected results:
networking-console-plugin deployment has the required-scc annotation
Additional info:
The deployment does not have any annotation about it
CI warning
# [sig-auth] all workloads in ns/openshift-network-console must set the 'openshift.io/required-scc' annotation annotation missing from pod 'networking-console-plugin-7c55b7546c-kc6db' (owners: replicaset/networking-console-plugin-7c55b7546c); suggested required-scc: 'restricted-v2'
This is a clone of issue OCPBUGS-41270. The following is the description of the original issue:
—
Component Readiness has found a potential regression in the following test:
[sig-network] pods should successfully create sandboxes by adding pod to network
Probability of significant regression: 96.41%
Sample (being evaluated) Release: 4.17
Start Time: 2024-08-27T00:00:00Z
End Time: 2024-09-03T23:59:59Z
Success Rate: 88.37%
Successes: 26
Failures: 5
Flakes: 12
Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 98.46%
Successes: 43
Failures: 1
Flakes: 21
Here is an example run.
We see the following signature for the failure:
namespace/openshift-etcd node/master-0 pod/revision-pruner-11-master-0 hmsg/b90fda805a - 111.86 seconds after deletion - firstTimestamp/2024-09-02T13:14:37Z interesting/true lastTimestamp/2024-09-02T13:14:37Z reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_revision-pruner-11-master-0_openshift-etcd_08346d8f-7d22-4d70-ab40-538a67e21e3c_0(d4b61f9ff9f2ddfd3b64352203e8a3eafc2c3bd7c3d31a0a573bc29e4ac6da57): error adding pod openshift-etcd_revision-pruner-11-master-0 to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: 'ContainerID:"d4b61f9ff9f2ddfd3b64352203e8a3eafc2c3bd7c3d31a0a573bc29e4ac6da57" Netns:"/var/run/netns/97dc5eb9-19da-462f-8b2e-c301cfd7f3cf" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-etcd;K8S_POD_NAME=revision-pruner-11-master-0;K8S_POD_INFRA_CONTAINER_ID=d4b61f9ff9f2ddfd3b64352203e8a3eafc2c3bd7c3d31a0a573bc29e4ac6da57;K8S_POD_UID=08346d8f-7d22-4d70-ab40-538a67e21e3c" Path:"" ERRORED: error configuring pod [openshift-etcd/revision-pruner-11-master-0] networking: Multus: [openshift-etcd/revision-pruner-11-master-0/08346d8f-7d22-4d70-ab40-538a67e21e3c]: error waiting for pod: pod "revision-pruner-11-master-0" not found
The same signature has been reported for both azure and x390x as well.
It is worth mentioning that sdn to ovn transition adds some complication to our analysis. From the component readiness above, you will see most of the failures are for job: periodic-ci-openshift-release-master-nightly-X.X-upgrade-from-stable-X.X-e2e-metal-ipi-ovn-upgrade. This is a new job for 4.17 and therefore miss base stats in 4.16.
So we ask for:
Description of problem:
The ovnkube-sbdb route removal is missing a management cluster capabilities check and thus fails on a Kubernetes based management cluster.
Version-Release number of selected component (if applicable):
4.15.z, 4.16.0, 4.17.0
How reproducible:
Always
Steps to Reproduce:
Deploy an OpenShift version 4.16.0-rc.6 cluster control plane using HyperShift on a Kubernetes based management cluster.
Actual results:
Cluster control plane deployment fails because the cluster-network-operator pod is stuck in Init state due to the following error: {"level":"error","ts":"2024-06-19T20:51:37Z","msg":"Reconciler error","controller":"hostedcontrolplane","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedControlPlane","HostedControlPlane":{"name":"cppjslm10715curja3qg","namespace":"master-cppjslm10715curja3qg"},"namespace":"master-cppjslm10715curja3qg","name":"cppjslm10715curja3qg","reconcileID":"037842e8-82ea-4f6e-bf28-deb63abc9f22","error":"failed to update control plane: failed to reconcile cluster network operator: failed to clean up ovnkube-sbdb route: error getting *v1.Route: no matches for kind \"Route\" in version \"route.openshift.io/v1\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}
Expected results:
Cluster control plane deployment succeeds.
Additional info:
https://ibm-argonauts.slack.com/archives/C01C8502FMM/p1718832205747529
openshift-install version /root/installer/bin/openshift-install 4.17.0-0.nightly-2024-07-16-033047 built from commit 8b7d5c6fe26a70eafc47a142666b90ed6081159e release image registry.ci.openshift.org/ocp/release@sha256:afb704dd7ab8e141c56f1da15ce674456f45c7767417e625f96a6619989e362b release architecture amd64
openshift-install image-based create image --dir tt panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x158 pc=0x5a2ed8e] ~/installer/bin/openshift-install image-based create image --dir tt panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x158 pc=0x5a2ed8e] goroutine 1 [running]: github.com/openshift/installer/pkg/asset/imagebased/image.(*RegistriesConf).Generate(0xc00150a000, {0x5?, 0x81d2708?}, 0xc0014f8d80) /go/src/github.com/openshift/installer/pkg/asset/imagebased/image/registriesconf.go:38 +0x6e github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc0014f8270, {0x275ca770, 0xc0014b5220}, {0x275a7900, 0xc00150a000}, {0xc00146cdb0, 0x4}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:227 +0x6ec github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc0014f8270, {0x275ca770, 0xc0014b5220}, {0x275a7990, 0xc000161610}, {0x8189983, 0x2}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:221 +0x54c github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc0014f8270, {0x275ca770, 0xc0014b5220}, {0x7fce0489ad20, 0x2bc3eea0}, {0x0, 0x0}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:221 +0x54c github.com/openshift/installer/pkg/asset/store.(*storeImpl).Fetch(0xc0014f8270, {0x275ca770?, 0xc0014b5220?}, {0x7fce0489ad20, 0x2bc3eea0}, {0x2bb33070, 0x1, 0x1}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:77 +0x4e github.com/openshift/installer/pkg/asset/store.(*fetcher).FetchAndPersist(0xc000aa8640, {0x275ca770, 0xc0014b5220}, {0x2bb33070, 0x1, 0x1}) /go/src/github.com/openshift/installer/pkg/asset/store/assetsfetcher.go:47 +0x16b main.newImageBasedCreateCmd.runTargetCmd.func3({0x7ffdb9f8638c?, 0x2?}) /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:306 +0x6a main.newImageBasedCreateCmd.runTargetCmd.func4(0x2bc00100, {0xc00147fd00?, 0x4?, 0x818b15a?}) /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:320 +0x102 github.com/spf13/cobra.(*Command).execute(0x2bc00100, {0xc00147fcc0, 0x2, 0x2}) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:987 +0xab1 github.com/spf13/cobra.(*Command).ExecuteC(0xc000879508) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:1115 +0x3ff github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:1039 main.installerMain() /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:67 +0x3c6 main.main() /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:39 +0x168
Description of problem:
with the release of 4.16 prometheus adapter[0] is deprecated and there is a new alert[1] ClusterMonitoringOperatorDeprecatedConfig, there needs to be a better details on how these alerts can be handled which will reduce the support cases. [0] https://docs.openshift.com/container-platform/4.16/release_notes/ocp-4-16-release-notes.html#ocp-4-16-prometheus-adapter-removed [1] https://docs.openshift.com/container-platform/4.16/release_notes/ocp-4-16-release-notes.html#ocp-4-16-monitoring-changes-to-alerting-rules
Version-Release number of selected component (if applicable):
4.16
How reproducible:
NA
Steps to Reproduce:
NA
Actual results:
As per the current configuration the clarification for the alert is not provided with much information
Expected results:
more information should be provided on how to fix the alert.
Additional info:
As per the discussion, there will be a runbook added which will help in better understanding of the alert
Please review the following PR: https://github.com/openshift/cluster-kube-scheduler-operator/pull/542
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
According to https://github.com/openshift/enhancements/pull/1502 all managed TLS artifacts (secrets, configmaps and files on disk) should have clear ownership and other necessary metadata `metal3-ironic-tls` is created by cluster-baremetal-operator but doesn't have ownership annotation
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
Under config.interfaces we should add additional config ONLY for vlan
Example.
spec:
config:
interfaces:
- name: "eth0"
type: ethernet
state: up
mac-address: "52:54:00:0A:86:94"
We didn't require to add it in 4.15 and for example in 4.16 bond still doesn't have this config and passes the deployment
How reproducible:
100%
Steps to reproduce:
1. Deploy vlan spoke with 4.15 nmstate config.
2.
3.
Actual results:
deployment failes
Expected results:
deployment succeeded or documentation for 4.16 should be updated
This is a clone of issue OCPBUGS-38288. The following is the description of the original issue:
—
Description of problem:
The control loop that manages /var/run/keepalived/iptables-rule-exists looks at the error returned by os.Stat and decides that the file exists as long as os.IsNotExist returns false. In other words, if the error is some non-nil error other than NotExist, the sentinel file would not be created.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-39573. The following is the description of the original issue:
—
Description of problem:
Enabling the topology tests in CI
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/coredns/pull/119
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-38226. The following is the description of the original issue:
—
Description of problem:
https://search.dptools.openshift.org/?search=Helm+Release&maxAge=168h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Component Readiness has found a potential regression in the following test:
operator conditions monitoring
Probability of significant regression: 100.00%
Sample (being evaluated) Release: 4.17
Start Time: 2024-07-17T00:00:00Z
End Time: 2024-07-23T23:59:59Z
Success Rate: 59.49%
Successes: 47
Failures: 32
Flakes: 0
Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 147
Failures: 0
Flakes: 0
This test / pattern is actually showing up on various other variant combinations but the commonality is vsphere, so this test, and installs in general, are not going well on vsphere.
Error message:
operator conditions monitoring expand_less 0s
{Operator unavailable (UpdatingPrometheusFailed): UpdatingPrometheus: client rate limiter Wait returned an error: context deadline exceeded Operator unavailable (UpdatingPrometheusFailed): UpdatingPrometheus: client rate limiter Wait returned an error: context deadline exceeded}
This is a clone of issue OCPBUGS-41920. The following is the description of the original issue:
—
Description of problem:
When we move one node from one custom MCP to another custom MCP, the MCPs are reporting a wrong number of nodes. For example, we reach this situation (worker-perf MCP is not reporting the right number of nodes) $ oc get mcp,nodes NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE machineconfigpool.machineconfiguration.openshift.io/master rendered-master-c8d23b071e1ccf6cf85c7f1b31c0def6 True False False 3 3 3 0 142m machineconfigpool.machineconfiguration.openshift.io/worker rendered-worker-36ee1fdc485685ac9c324769889c3348 True False False 1 1 1 0 142m machineconfigpool.machineconfiguration.openshift.io/worker-perf rendered-worker-perf-6b5fbffac62c3d437e307e849c44b556 True False False 2 2 2 0 24m machineconfigpool.machineconfiguration.openshift.io/worker-perf-canary rendered-worker-perf-canary-6b5fbffac62c3d437e307e849c44b556 True False False 1 1 1 0 7m52s NAME STATUS ROLES AGE VERSION node/ip-10-0-13-228.us-east-2.compute.internal Ready worker,worker-perf-canary 138m v1.30.4 node/ip-10-0-2-250.us-east-2.compute.internal Ready control-plane,master 145m v1.30.4 node/ip-10-0-34-223.us-east-2.compute.internal Ready control-plane,master 144m v1.30.4 node/ip-10-0-35-61.us-east-2.compute.internal Ready worker,worker-perf 136m v1.30.4 node/ip-10-0-79-232.us-east-2.compute.internal Ready control-plane,master 144m v1.30.4 node/ip-10-0-86-124.us-east-2.compute.internal Ready worker 139m v1.30.4 After 20 minutes or half an hour the MCPs start reporting the right number of nodes
Version-Release number of selected component (if applicable):
IPI on AWS version:
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.17.0-0.nightly-2024-09-13-040101 True False 124m Cluster version is 4.17.0-0.nightly-2024-09-13-040101
How reproducible:
Always
Steps to Reproduce:
1. Create a MCP oc create -f - << EOF apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: worker-perf spec: machineConfigSelector: matchExpressions: - { key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-perf] } nodeSelector: matchLabels: node-role.kubernetes.io/worker-perf: "" EOF 2. Add 2 nodes to the MCP $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf= $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[1].metadata.name}") node-role.kubernetes.io/worker-perf= 3. Create another MCP oc create -f - << EOF apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: worker-perf-canary spec: machineConfigSelector: matchExpressions: - { key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-perf,worker-perf-canary] } nodeSelector: matchLabels: node-role.kubernetes.io/worker-perf-canary: "" EOF 3. Move one node from the MCP created in step 1 to the MCP created in step 3 $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf-canary= $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf-
Actual results:
The worker-perf pool is not reporting the right number of nodes. It continues reporting 2 nodes even though one of them was moved to the worker-perf-canary MCP. $ oc get mcp,nodes NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE machineconfigpool.machineconfiguration.openshift.io/master rendered-master-c8d23b071e1ccf6cf85c7f1b31c0def6 True False False 3 3 3 0 142m machineconfigpool.machineconfiguration.openshift.io/worker rendered-worker-36ee1fdc485685ac9c324769889c3348 True False False 1 1 1 0 142m machineconfigpool.machineconfiguration.openshift.io/worker-perf rendered-worker-perf-6b5fbffac62c3d437e307e849c44b556 True False False 2 2 2 0 24m machineconfigpool.machineconfiguration.openshift.io/worker-perf-canary rendered-worker-perf-canary-6b5fbffac62c3d437e307e849c44b556 True False False 1 1 1 0 7m52s NAME STATUS ROLES AGE VERSION node/ip-10-0-13-228.us-east-2.compute.internal Ready worker,worker-perf-canary 138m v1.30.4 node/ip-10-0-2-250.us-east-2.compute.internal Ready control-plane,master 145m v1.30.4 node/ip-10-0-34-223.us-east-2.compute.internal Ready control-plane,master 144m v1.30.4 node/ip-10-0-35-61.us-east-2.compute.internal Ready worker,worker-perf 136m v1.30.4 node/ip-10-0-79-232.us-east-2.compute.internal Ready control-plane,master 144m v1.30.4 node/ip-10-0-86-124.us-east-2.compute.internal Ready worker 139m v1.30.4
Expected results:
MCPs should always report the right number of nodes
Additional info:
It is very similar to this other issue https://bugzilla.redhat.com/show_bug.cgi?id=2090436 That was discussed in this slack conversation https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1653479831004619
Description of problem:
The presubmit test that expects an inactive CPMS to be regnerated, resets the state at the end of the test. In doing so, it causes the CPMS generator to re-generate back to the original state. Part of regeneration involves deleting and recreating the CPMS. If the regeneration is not quick enough, the next part of the test can fail, as it is expecting the CPMS to exist. We should change this to an eventually to avoid the race between the generator and the test. See https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-control-plane-machine-set-operator/304/pull-ci-openshift-cluster-control-plane-machine-set-operator-release-4.13-e2e-aws-operator/1801195115868327936 as an example failure
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-42324. The following is the description of the original issue:
—
Description of problem:
This is a spinoff of https://issues.redhat.com/browse/OCPBUGS-38012. For additional context please see that bug. The TLDR is that Restart=on-failure for oneshot units were only supported in systemd v244 and onwards, meaning any bootimage for 4.12 and previous doesn't support this on firstboot, and upgraded clusters would no longer be able to scale nodes if it references any such service. Right now this is only https://github.com/openshift/machine-config-operator/blob/master/templates/common/openstack/units/afterburn-hostname.service.yaml#L16-L24 which isn't covered by https://issues.redhat.com/browse/OCPBUGS-38012
Version-Release number of selected component (if applicable):
4.16 right now
How reproducible:
Uncertain, but https://issues.redhat.com/browse/OCPBUGS-38012 is 100%
Steps to Reproduce:
1.install old openstack cluster 2.upgrade to 4.16 3.attempt to scale node
Actual results:
Expected results:
Additional info:
Description of problem:
In PR - https://github.com/openshift/console/pull/13676 we worked on improving the performance of the PipelineRun list page and the issue https://issues.redhat.com/browse/OCPBUGS-32631 is created to still improve the performance of the PLR list page. Once this is complete, we have to improve the performance of Pipeline list page by considering below point, 1. TaskRuns should not be fetched for all the PLR's. 2. Use pipelinerun.status.conditions.message to get the status of TaskRuns 3. For any PLR, if string pipelinerun.status.conditions.message having data about Tasks status use that string only instead of fetching TaskRuns
Description of problem:
oc adm prune deployments` does not work and giving below error when using --replica-set option.
[root@weyb1525 ~]# oc adm prune deployments --orphans --keep-complete=1 --keep-failed=0 --keep-younger-than=1440m --replica-sets --v=6 I0603 09:55:39.588085 1540280 loader.go:373] Config loaded from file: /root/openshift-install/paas-03.build.net.intra.laposte.fr/auth/kubeconfig I0603 09:55:39.890672 1540280 round_trippers.go:553] GET https://api-int.paas-03.build.net.intra.laposte.fr:6443/apis/apps.openshift.io/v1/deploymentconfigs 200 OK in 301 milliseconds Warning: apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+ I0603 09:55:40.529367 1540280 round_trippers.go:553] GET https://api-int.paas-03.build.net.intra.laposte.fr:6443/apis/apps/v1/deployments 200 OK in 65 milliseconds I0603 09:55:41.369413 1540280 round_trippers.go:553] GET https://api-int.paas-03.build.net.intra.laposte.fr:6443/api/v1/replicationcontrollers 200 OK in 706 milliseconds I0603 09:55:43.083804 1540280 round_trippers.go:553] GET https://api-int.paas-03.build.net.intra.laposte.fr:6443/apis/apps/v1/replicasets 200 OK in 118 milliseconds I0603 09:55:43.320700 1540280 prune.go:58] Creating deployment pruner with keepYoungerThan=24h0m0s, orphans=true, replicaSets=true, keepComplete=1, keepFailed=0 Dry run enabled - no modifications will be made. Add --confirm to remove deployments panic: interface conversion: interface {} is *v1.Deployment, not *v1.DeploymentConfig goroutine 1 [running]: github.com/openshift/oc/pkg/cli/admin/prune/deployments.(*dataSet).GetDeployment(0xc007fa9bc0, {0x5052780?, 0xc00a0b67b0?}) /go/src/github.com/openshift/oc/pkg/cli/admin/prune/deployments/data.go:171 +0x3d6 github.com/openshift/oc/pkg/cli/admin/prune/deployments.(*orphanReplicaResolver).Resolve(0xc006ec87f8) /go/src/github.com/openshift/oc/pkg/cli/admin/prune/deployments/resolvers.go:78 +0x1a6 github.com/openshift/oc/pkg/cli/admin/prune/deployments.(*mergeResolver).Resolve(0x55?) /go/src/github.com/openshift/oc/pkg/cli/admin/prune/deployments/resolvers.go:28 +0xcf github.com/openshift/oc/pkg/cli/admin/prune/deployments.(*pruner).Prune(0x5007c40?, {0x50033e0, 0xc0083c19e0}) /go/src/github.com/openshift/oc/pkg/cli/admin/prune/deployments/prune.go:96 +0x2f github.com/openshift/oc/pkg/cli/admin/prune/deployments.PruneDeploymentsOptions.Run({0x0, 0x1, 0x1, 0x4e94914f0000, 0x1, 0x0, {0x0, 0x0}, {0x5002d00, 0xc000ba78c0}, ...}) /go/src/github.com/openshift/oc/pkg/cli/admin/prune/deployments/deployments.go:206 +0xa03 github.com/openshift/oc/pkg/cli/admin/prune/deployments.NewCmdPruneDeployments.func1(0xc0005f4900?, {0xc0006db020?, 0x0?, 0x6?}) /go/src/github.com/openshift/oc/pkg/cli/admin/prune/deployments/deployments.go:78 +0x118 github.com/spf13/cobra.(*Command).execute(0xc0005f4900, {0xc0006dafc0, 0x6, 0x6}) /go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:944 +0x847 github.com/spf13/cobra.(*Command).ExecuteC(0xc000e5b800) /go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:1068 +0x3bd github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:992 k8s.io/component-base/cli.run(0xc000e5b800) /go/src/github.com/openshift/oc/vendor/k8s.io/component-base/cli/run.go:146 +0x317 k8s.io/component-base/cli.RunNoErrOutput(...) /go/src/github.com/openshift/oc/vendor/k8s.io/component-base/cli/run.go:84 main.main() /go/src/github.com/openshift/oc/cmd/oc/oc.go:77 +0x365
Version-Release number of selected component (if applicable):
How reproducible:
Run oc adm prune deployments command with --replica-sets option
# oc adm prune deployments --keep-younger-than=168h --orphans --keep-complete=5 --keep-failed=1 --replica-sets=true
Actual results:
Its failing with below error:panic: interface conversion: interface {} is *v1.Deployment, not *v1.DeploymentConfig
Expected results:
Its should not fail and work as expected.
Additional info:
Slack thread https://redhat-internal.slack.com/archives/CKJR6200N/p1717519017531979
Please review the following PR: https://github.com/openshift/cloud-credential-operator/pull/705
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
When attempting to install on a provider network on PSI, I get the following pre-flight validation error:
ERROR failed to fetch Metadata: failed to load asset "Install Config": failed to create install config: platform.openstack.controlPlanePort.network: Invalid value: "316eeb47-1498-46b4-b39e-00ddf73bd2a5": network must contain subnets
The network does contain one subnet.
install-config.yaml:
# yaml-language-server: $schema=https://raw.githubusercontent.com/pierreprinetti/openshift-installconfig-schema/release-4.16/installconfig.schema.json
apiVersion: v1
baseDomain: ${BASE_DOMAIN}
compute:
- name: worker
platform:
openstack:
type: ${COMPUTE_FLAVOR}
replicas: 3
controlPlane:
name: master
platform:
openstack:
type: ${CONTROL_PLANE_FLAVOR}
replicas: 3
metadata:
name: ${CLUSTER_NAME}
platform:
openstack:
controlPlanePort:
network:
id: 316eeb47-1498-46b4-b39e-00ddf73bd2a5
cloud: ${OS_CLOUD}
clusterOSImage: rhcos-4.16
publish: External
pullSecret: |
${PULL_SECRET}
sshKey: |
${SSH_PUB_KEY}
In our vertical scaling test, after we delete a machine, we rely on the `status.readyReplicas` field of the ControlPlaneMachineSet (CPMS) to indicate that it has successfully created a new machine that let's us scale up before we scale down.
https://github.com/openshift/origin/blob/3deedee4ae147a03afdc3d4ba86bc175bc6fc5a8/test/extended/etcd/vertical_scaling.go#L76-L87
As we've seen in the past as well, that status field isn't a reliable indicator of the scale up of machines, as status.readyReplicas might stay at 3 as the soon-to-be-removed node that is pending deletion can go Ready=Unknown in runs such as the following: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1286/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling/1808186565449486336
Which then ends up the test timing out on waiting for status.readyReplicas=4 while the scale-up and down may already have happened.
This shows up across scaling tests on all platforms as:
fail [github.com/openshift/origin/test/extended/etcd/vertical_scaling.go:81]: Unexpected error: <*errors.withStack | 0xc002182a50>: scale-up: timed out waiting for CPMS to show 4 ready replicas: timed out waiting for the condition { error: <*errors.withMessage | 0xc00304c3a0>{ cause: <wait.errInterrupted>{ cause: <*errors.errorString | 0xc0003ca800>{ s: "timed out waiting for the condition", }, }, msg: "scale-up: timed out waiting for CPMS to show 4 ready replicas", },
In hindsight all we care about is whether the deleted machine's member is replaced by another machine's member and can ignore the flapping of node and machine statuses while we wait for the scale-up then down of members to happen. So we can relax or replace that check on status.readyReplicas with just looking at the membership change.
PS: We can also update the outdated Godoc comments for the test to mention that it relies on CPMSO to create a machine for us https://github.com/openshift/origin/blob/3deedee4ae147a03afdc3d4ba86bc175bc6fc5a8/test/extended/etcd/vertical_scaling.go#L34-L38
This is a clone of issue OCPBUGS-38437. The following is the description of the original issue:
—
Description of problem:
After branching, main branch still publishes Konflux builds to mce-2.7
Version-Release number of selected component (if applicable):
mce-2.7
How reproducible:
100%
Steps to Reproduce:
1.Post a PR to
main
2. Check the jobs that run
Actual results:
Both mce-2.7 and main Konflux builds get triggered
Expected results:
Only main branch Konflux builds gets triggered
Additional info:
This is a clone of issue OCPBUGS-41358. The following is the description of the original issue:
—
Description of problem:
While upgrading the cluster from web-console the below warning message observed. ~~~ Warning alert:Admission Webhook Warning ClusterVersion version violates policy 299 - "unknown field \"spec.desiredUpdate.channels\"", 299 - "unknown field \"spec.desiredUpdate.url\"" ~~~ There are no such fields in the clusterVersion yaml for which the warning message fired. From the documentation here: https://docs.openshift.com/container-platform/4.16/rest_api/config_apis/clusterversion-config-openshift-io-v1.html It's possible to see that "spec.desiredUpdate" exists, but there is no mention of values "channels" or "url" under desiredUpdate. Note: This is not impacting the cluster upgrade. However creating confusion among customers due to the warning message.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Everytime
Steps to Reproduce:
1. Install cluster of version 4.16.4 2. Upgrade the cluster from web-console to the next-minor version 3.
Actual results:
Upgrade should proceed with no such warnings
Expected results:
Additional info:
Description of problem:
Running `oc exec` through a proxy doesn't work
Version-Release number of selected component (if applicable):
4.17.1
How reproducible:
100%
Additional info:
Looks to have been fixed upstream in https://github.com/kubernetes/kubernetes/pull/126253 which made it into 1.30.4 and should already be in 1.31.1 as used in 4.18. Likely just needs to bump oc to that version or later.
Reported by Forrest:
Looks to me like 4.16 is logging heavily now compared to the last good
4.16.0-0.nightly-2024-05-23-173505
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-sdn-upgrade/1793698069414416384/artifacts/e2e-aws-sdn-upgrade/openshift-e2e-test/build-log.txt - 208,414
Compared to
4.16.0-0.nightly-2024-05-24-204326
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-sdn-upgrade/1794107881906245632/artifacts/e2e-aws-sdn-upgrade/openshift-e2e-test/build-log.txt - 3,156,576
Description of problem:
When using nvme disks the assisted installation fails with: error Host perf-intel-6.perf.eng.bos2.dc.redhat.com: updated status from installing-in-progress to error (Failed - failed after 3 attempts, last error: failed executing /usr/bin/nsenter [--target 1 --cgroup --mount --ipc --pid -- coreos-installer install --insecure -i /opt/openshift/master.ign --append-karg ip=eno12399:dhcp /dev/nvme0n1], Error exit status 1, LastOutput "Error: checking for exclusive access to /dev/nvme0n1 Caused by: 0: couldn't reread partition table: device is in use 1: EBUSY: Device or resource busy")
Version-Release number of selected component (if applicable):
4.15 using the web assisted installer
How reproducible:
In a PowerEdge R760 with 5 disks (3ssd in raid 5, and 2 nvme no raid), if you use the 5 ssd disks in raid the installer works as expected. If you disable these disks and use the nvme storage the installer fails with the above message. I tried other distributions booting and using only the nvme disk and they work as expected (Fedora rawhide and Ubuntu 22.04).
Steps to Reproduce:
1. Try the assisted installer with nvme disks
Actual results:
The installer fails
Expected results:
The installer finishes correctly
Additional info:
Description of problem:
non-existing oauth.config.openshift.io resource is listed on Global Configuration page
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-05-082646
How reproducible:
Always
Steps to Reproduce:
1. visit global configuration page /settings/cluster/globalconfig 2. check listed items on the page 3.
Actual results:
2. There are two OAuth.config.openshift.io entries, one is linking to /k8s/cluster/config.openshift.io~v1~OAuth/oauth-config, this will return 404: Not Found $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-06-05-082646 True False 171m Cluster version is 4.16.0-0.nightly-2024-06-05-082646 $ oc get oauth.config.openshift.io NAME AGE cluster 3h26m
Expected results:
from CLI output we can see there is only one oauth.config.openshift.io resource, but we are showing one more 'oauth-config' Only one oauth.config.openshift.io resource should be listed
Additional info:
Description of problem:
The runbook was added in https://issues.redhat.com/browse/MON-3862 The alert is more likely to fire in >=4.16
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-43655. The following is the description of the original issue:
—
Description of problem:
When we switched the API servers to use the /livez endpoint, we overlooked updating the audit policy to exclude this endpoint from being logged. As a result, requests to the /livez endpoint are currently being persisted in the audit log files. The issue applies to the other API servers as well (oas and oauth-apiserver)
Version-Release number of selected component (if applicable):
How reproducible:
Just download must-gather and grep for /livez endpoint.
Steps to Reproduce:
Just download must-gather and grep for /livez endpoint.
Actual results:
Requests to the /livez endpoint are being recorded in the audit log files.
Expected results:
Requests to the /livez endpoint are NOT being recorded in the audit log files.
Additional info:
This is a clone of issue OCPBUGS-42987. The following is the description of the original issue:
—
It is been observed that the esp_offload kernel module might be loaded by libreswan even if bond ESP offloads have been correctly turned off.
This might be because ipsec service and configure-ovs run at the same time, so it is possible that ipsec service starts when bond offloads are not yet turned off and trick libreswan into thinking they should be used.
The potential fix would be to run ipsec service after configure-ovs.
Upstream just merged https://github.com/etcd-io/etcd/pull/16246
which refactors the way the dashboards are defined. We need to see whether our own jsonnet integration still works with that and we can still display dashboards in OpenShift.
See additional challenges they faced with the helm chart:
https://github.com/prometheus-community/helm-charts/pull/3880
AC:
This is a clone of issue OCPBUGS-42143. The following is the description of the original issue:
—
Description of problem:
There is another panic occurred in https://issues.redhat.com/browse/OCPBUGS-34877?focusedId=25580631&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-25580631 which should be fixed
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-36222. The following is the description of the original issue:
—
Description of problem:
The AWS Cluster API Provider (CAPA) runs a required check to resolve the DNS Name for load balancers it creates. If the CAPA controller (in this case, running in the installer) cannot resolve the DNS record, CAPA will not report infrastructure ready. We are seeing in some cases, that installations running on local hosts (we have not seen this problem in CI) will not be able to resolve the LB DNS name record and the install will fail like this:
DEBUG I0625 17:05:45.939796 7645 awscluster_controller.go:295] "Waiting on API server ELB DNS name to resolve" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/umohnani-4-16test-5ndjw" namespace="openshift-cluster-api-guests" name="umohnani-4-16test-5ndjw" reconcileID="553beb3d-9b53-4d83-b417-9c70e00e277e" cluster="openshift-cluster-api-guests/umohnani-4-16test-5ndjw" DEBUG Collecting applied cluster api manifests... ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: infrastructure was not ready within 15m0s: client rate limiter Wait returned an error: context deadline exceeded
We do not know why some hosts cannot resolve these records, but it could be something like issues with the local DNS resolver cache, DNS records are slow to propagate in AWS, etc.
Version-Release number of selected component (if applicable):
4.16, 4.17
How reproducible:
Not reproducible / unknown -- this seems to be dependent on specific hosts and we have not determined why some hosts face this issue while others do not.
Steps to Reproduce:
n/a
Actual results:
Install fails because CAPA cannot resolve LB DNS name
Expected results:
As the DNS record does exist, install should be able to proceed.
Additional info:
Slack thread:
https://redhat-internal.slack.com/archives/C68TNFWA2/p1719351032090749
Description of problem:
Install OCP with capi, when setting bootType: "UEFI", got unsupported value error. Installing with terrform did not met this issue.
platform:
nutanix:
bootType: "UEFI"
# ./openshift-install create cluster --dir cluster --log-level debug ... ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to create control-plane manifest: NutanixMachine.infrastructure.cluster.x-k8s.io "sgao-nutanix-zonal-jwp6d-bootstrap" is invalid: spec.bootType: Unsupported value: "UEFI": supported values: "legacy", "uefi"
Set bootType: "uefi" also won't work
# ./openshift-install create manifests --dir cluster ... FATAL failed to fetch Master Machines: failed to generate asset "Master Machines": failed to create master machine objects: platform.nutanix.bootType: Invalid value: "uefi": valid bootType: "", "Legacy", "UEFI", "SecureBoot".
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-08-222442
How reproducible:
always
Steps to Reproduce:
1.Create install config with bootType: "UEFI" and enable capi by setting: featureSet: CustomNoUpgrade featureGates: - ClusterAPIInstall=true 2.Install cluster
Actual results:
Install failed
Expected results:
Install passed
Additional info:
Description of problem:
Contrary to terraform, we do not delete the S3 bucket used for ignition during bootstrapping.
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
always
Steps to Reproduce:
1. Deploy cluster 2. Check that openshift-bootstrap-data-$infraID bucket exists and is empty. 3.
Actual results:
Empty bucket left.
Expected results:
Bucket is deleted.
Additional info:
IR-467 did not merge in time to be included in 4.17. This will be needed for OCP 4.17.z.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When we run opm on RHEL8, we met the following error ./opm: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by ./opm) ./opm: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by ./opm) ./opm: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by ./opm) Note: it happened for 4.15.0-ec.3 I tried the 4.14, it works. I also tried to compile it with latest code, it also work.
Version-Release number of selected component (if applicable):
4.15.0-ec.3
How reproducible:
always
Steps to Reproduce:
[root@preserve-olm-env2 slavecontainer]# curl -s -k -L https://mirror2.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/candidate/opm-linux-4.15.0-ec.3.tar.gz -o opm.tar.gz && tar -xzvf opm.tar.gz opm [root@preserve-olm-env2 slavecontainer]# ./opm version ./opm: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by ./opm) ./opm: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by ./opm) ./opm: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by ./opm) [root@preserve-olm-env2 slavecontainer]# curl -s -l -L https://mirror2.openshift.com/pub/openshift-v4/x86_64/clients/ocp/latest-4.14/opm-linux-4.14.5.tar.gz -o opm.tar.gz && tar -xzvf opm.tar.gz opm [root@preserve-olm-env2 slavecontainer]# opm version Version: version.Version{OpmVersion:"639fc1203", GitCommit:"639fc12035292dec74a16b306226946c8da404a2", BuildDate:"2023-11-21T08:03:15Z", GoOs:"linux", GoArch:"amd64"} [root@preserve-olm-env2 kuiwang]# cd operator-framework-olm/ [root@preserve-olm-env2 operator-framework-olm]# git branch gs * master release-4.10 release-4.11 release-4.12 release-4.13 release-4.8 release-4.9 [root@preserve-olm-env2 operator-framework-olm]# git pull origin master remote: Enumerating objects: 1650, done. remote: Counting objects: 100% (1650/1650), done. remote: Compressing objects: 100% (831/831), done. remote: Total 1650 (delta 727), reused 1617 (delta 711), pack-reused 0 Receiving objects: 100% (1650/1650), 2.03 MiB | 12.81 MiB/s, done. Resolving deltas: 100% (727/727), completed with 468 local objects. From github.com:openshift/operator-framework-olm * branch master -> FETCH_HEAD 639fc1203..85c579f9b master -> origin/master Updating 639fc1203..85c579f9b Fast-forward go.mod | 120 +- go.sum | 240 ++-- manifests/0000_50_olm_00-pprof-secret.yaml ... create mode 100644 vendor/google.golang.org/protobuf/types/dynamicpb/types.go [root@preserve-olm-env2 operator-framework-olm]# rm -fr bin/opm [root@preserve-olm-env2 operator-framework-olm]# make build/opm make bin/opm make[1]: Entering directory '/data/kuiwang/operator-framework-olm' go build -ldflags "-X 'github.com/operator-framework/operator-registry/cmd/opm/version.gitCommit=85c579f9be61aaea11e90b6c870452c72107300a' -X 'github.com/operator-framework/operator-registry/cmd/opm/version.opmVersion=85c579f9b' -X 'github.com/operator-framework/operator-registry/cmd/opm/version.buildDate=2023-12-11T06:12:50Z'" -mod=vendor -tags "json1" -o bin/opm github.com/operator-framework/operator-registry/cmd/opm make[1]: Leaving directory '/data/kuiwang/operator-framework-olm' [root@preserve-olm-env2 operator-framework-olm]# which opm /data/kuiwang/operator-framework-olm/bin/opm [root@preserve-olm-env2 operator-framework-olm]# opm version Version: version.Version{OpmVersion:"85c579f9b", GitCommit:"85c579f9be61aaea11e90b6c870452c72107300a", BuildDate:"2023-12-11T06:12:50Z", GoOs:"linux", GoArch:"amd64"}
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-41610. The following is the description of the original issue:
—
Description of problem:
After click "Don's show again" on Lightspeed popup modal, it went to user preference page, and highlighted "Hide Lightspeed" part with rectangle lines, if now click "Lightspeed" popup button at the right bottom, the highlighted rectangle lines lay above the popup modal.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-09-09-150616
How reproducible:
Always
Steps to Reproduce:
1.Clicked "Don's show again" on Lightspeed popup modal, it went to user preference page, and highlighted "Hide Lightspeed" part with rectangle lines. At the same time, click "Lightspeed" popup button at the right bottom. 2. 3.
Actual results:
1. The highlighted rectangle lines lay above the popup modal. Screenshot: https://drive.google.com/drive/folders/15te0dbavJUTGtqRYFt-rM_U8SN7euFK5?usp=sharing
Expected results:
1. The Lightspeed popup modal should be on the top layer.
Additional info:
This is a clone of issue OCPBUGS-42097. The following is the description of the original issue:
—
Example failed test:
4/1291 Tests Failed.expand_less: user system:serviceaccount:openshift-infra:serviceaccount-pull-secrets-controller in ns/openshift-infra must not produce too many applies {had 7618 applies, check the audit log and operator log to figure out why details in audit log}
Description of problem:
The ci/prow/verify-crd-schema job on openshift/api fails due to missing listType tags when adding a tech preview feature to the IngressController because it branches the CRDs into separate versions via features sets. As an example, it fails on: https://github.com/openshift/api/pull/1841. The errors "must set x-kubernetes-list-type" need to resolved by adding: // +listType=atomic or // +listType=map // +listMapKey=<key> to the fields that are missing the tags.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%
Steps to Reproduce:
1. Add a techpreview field to the IngressController API 2. make update 3. ./hack/verify-crd-schema-checker.sh
Actual results:
t.io version/v1 field/^.spec.httpHeaders.headerNameCaseAdjustments must set x-kubernetes-list-type error in operator/v1/zz_generated.crd-manifests/0000_50_ingress_00_ingresscontrollers-CustomNoUpgrade.crd.yaml: ListsMustHaveSSATags: crd/ingresscontrollers.operator.openshift.io version/v1 field/^.spec.logging.access.httpCaptureCookies must set x-kubernetes-list-type ...
Expected results:
No errors except for any errors for embedded external fields. E.g. this error is unavoidable and must always be overridden: error in operator/v1/zz_generated.featuregated-crd-manifests/ingresscontrollers.operator.openshift.io/IngressControllerLBSubnetsAWS.yaml: NoMaps: crd/ingresscontrollers.operator.openshift.io version/v1 field/^.spec.routeSelector.matchLabels may not be a map
Additional info:
This is a clone of issue OCPBUGS-36494. The following is the description of the original issue:
—
Description of problem:
If the `template:` field in the vsphere platform spec is defined the installer should not be downloading the OVA
Version-Release number of selected component (if applicable):
4.16.x 4.17.x
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
While creating an install configuration for PowerVS IPI, the default region is not set leading to the survey getting stuck if nothing is entered at the command line.
Description of problem:
The coresPerSocket value set in install-config does not match the actual result. When setting controlPlane.platform.vsphere.cpus to 16 and controlPlane.platform.vsphere.coresPerSocket to 8.The actual result I checked was: "NumCPU": 16,"NumCoresPerSocket": 16, NumCoresPerSocket should match the setting in install-config instead of NumCPU. Check the setting in VSphereMachine-openshift-cluster-api-guests-wwei1215a-42n48-master-0.yaml, the numcorespersocket is 0: numcpus: 16 numcorespersocket: 0
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-08-222442
How reproducible:
See description
Steps to Reproduce:
1.setting coresPerSocket for control plane in install-config. cpu needs to be a multiple of corespersocket. 2.install the cluster
Actual results:
The NumCoresPerSocket is equal to NumCPU. In file VSphereMachine-openshift-cluster-api-guests-xxxx-xxxx-master-0.yaml, the numcorespersocket is 0. and in vm setting: "NumCoresPerSocket": 8.
Expected results:
The NumCoresPerSocket should match the setting in install-config.
Additional info:
installconfig setting: controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: vsphere: cpus: 16 coresPerSocket: 8 check result: "Hardware": { "NumCPU": 16, "NumCoresPerSocket": 16, the check result for compute node is expected. installconfig setting: compute:- architecture: amd64 hyperthreading: Enabled name: worker platform: vsphere: cpus: 8 coresPerSocket: 4 check result: "Hardware": { "NumCPU": 8, "NumCoresPerSocket": 4,
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/117
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The goal is to collect metrics about CPU cores of ACM managed clusters because it is one of the sources to bill the customers for the product subscription usage.
acm_managed_cluster_worker_cores represents the total number of CPU cores on the worker nodes of the ACM managed clusters.
Labels
The cardinality of the metric is at most 1.
Please review the following PR: https://github.com/openshift/csi-driver-nfs/pull/142
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
When performing a LocalClusterImport, the following error is seen when booting from the discovery ISO created by the provided Infraenv.
```
State Info: Host does not meet the minimum hardware requirements: This host has failed to download the ignition file from http://api.local-cluster.ocp-edge-cluster-0.qe.lab.redhat.com:22624/config/worker with the following error: ignition file download failed: request failed: Get "http://api.local-cluster.ocp-edge-cluster-0.qe.lab.redhat.com:22624/config/worker": dial tcp: lookup api.local-cluster.ocp-edge-cluster-0.qe.lab.redhat.com on 192.168.123.1:53: no such host. Please ensure the host can reach this URL
```
The URL for the discovery ISO is reported as
`api.local-cluster.ocp-edge-cluster-0.qe.lab.redhat.com`
When for this use case, the URL for the discovery ISO should be
`api.ocp-edge-cluster-0.qe.lab.redhat.com`
Some changes to how the Day2 import is performed for the hub cluster will need to take place to ensure that when importing the local hub cluster, this issue is avoided.
Description of problem:
Mirror failed due to {{manifest unknown}} on certain images for v2 format
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202403251146.p0.g03ce0ca.assembly.stream.el9-03ce0ca", GitCommit:"03ce0ca797e73b6762fd3e24100ce043199519e9", GitTreeState:"clean", BuildDate:"2024-03-25T16:34:33Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
always
Steps to Reproduce:
1) Test full==true with following imagesetconfig: cat config-full.yaml apiVersion: mirror.openshift.io/v1alpha2 kind: ImageSetConfiguration mirror: operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.16 full: true `oc-mirror --config config-full.yaml file://out-full --v2`
Actual results:
Mirror command always failed, but hit errors : 2024/04/08 02:50:52 [ERROR] : [Worker] errArray initializing source docker://registry.redhat.io/3scale-mas/zync-rhel8@sha256:8a108677b0b4100a3d58d924b2c7a47425292492df3dc6a2ebff33c58ca4e9e8: reading manifest sha256:8a108677b0b4100a3d58d924b2c7a47425292492df3dc6a2ebff33c58ca4e9e8 in registry.redhat.io/3scale-mas/zync-rhel8: manifest unknown 2024/04/08 09:12:55 [ERROR] : [Worker] errArray initializing source docker://registry.redhat.io/integration/camel-k-rhel8-operator@sha256:4796985f3efcd37b057dea0a35b526c02759f8ea63327921cdd2e504c575d3c0: reading manifest sha256:4796985f3efcd37b057dea0a35b526c02759f8ea63327921cdd2e504c575d3c0 in registry.redhat.io/integration/camel-k-rhel8-operator: manifest unknown 2024/04/08 09:12:55 [ERROR] : [Worker] errArray initializing source docker://registry.redhat.io/integration/camel-k-rhel8-operator@sha256:4796985f3efcd37b057dea0a35b526c02759f8ea63327921cdd2e504c575d3c0: reading manifest sha256:4796985f3efcd37b057dea0a35b526c02759f8ea63327921cdd2e504c575d3c0 in registry.redhat.io/integration/camel-k-rhel8-operator: manifest unknown
Expected results:
No error
The APIRemovedInNextReleaseInUse and APIRemovedInNextEUSReleaseInUse need to be updated for kube 1.30 in OCP 4.17.
Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/302
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
https://github.com/openshift/console/pull/13964 fixed pseudolocalization, but now the user needs to know their first preferred language code in order for pseudolocalization to work. Add information to INTERNATIONALIZATION.md on how to obtain that language code.
Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver/pull/83
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/installer/pull/8449
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
add-flow-ci.feature test is flaking sporadically for both console and console-operator repositories.
Running: add-flow-ci.feature (1 of 1) [23798:0602/212526.775826:ERROR:zygote_host_impl_linux.cc(273)] Failed to adjust OOM score of renderer with pid 24169: Permission denied (13) Couldn't determine Mocha version Logging in as test Create the different workloads from Add page redirect to home ensure perspective switcher is set to Developer ✓ Getting started resources on Developer perspective (16906ms) redirect to home ensure perspective switcher is set to Developer Select Template category CI/CD You are on Topology page - Graph view ✓ Deploy Application using Catalog Template "CI/CD": A-01-TC02 (example #1) (27858ms) redirect to home ensure perspective switcher is set to Developer Select Template category Databases You are on Topology page - Graph view ✓ Deploy Application using Catalog Template "Databases": A-01-TC02 (example #2) (29800ms) redirect to home ensure perspective switcher is set to Developer Select Template category Languages You are on Topology page - Graph view ✓ Deploy Application using Catalog Template "Languages": A-01-TC02 (example #3) (38286ms) redirect to home ensure perspective switcher is set to Developer Select Template category Middleware You are on Topology page - Graph view ✓ Deploy Application using Catalog Template "Middleware": A-01-TC02 (example #4) (30501ms) redirect to home ensure perspective switcher is set to Developer Select Template category Other You are on Topology page - Graph view ✓ Deploy Application using Catalog Template "Other": A-01-TC02 (example #5) (35567ms) redirect to home ensure perspective switcher is set to Developer Application Name "sample-app" is created Resource type "deployment" is selected You are on Topology page - Graph view ✓ Deploy secure image with Runtime icon from external registry: A-02-TC02 (example #1) (28896ms) redirect to home ensure perspective switcher is set to Developer Application Name "sample-app" is selected Resource type "deployment" is selected You are on Topology page - Graph view ✓ Deploy image with Runtime icon from internal registry: A-02-TC03 (example #1) (23555ms) redirect to home ensure perspective switcher is set to Developer Resource type "deployment" is selected You are on Topology page - Graph view You are on Topology page - Graph view You are on Topology page - Graph view ✓ Edit Runtime Icon while Editing Image: A-02-TC05 (47438ms) redirect to home ensure perspective switcher is set to Developer You are on Topology page - Graph view ✓ Create the Database from Add page: A-03-TC01 (19645ms) redirect to home ensure perspective switcher is set to Developer redirect to home ensure perspective switcher is set to Developer 1) Deploy git workload with devfile from topology page: A-04-TC01 redirect to home ensure perspective switcher is set to Developer Resource type "Deployment" is selected You are on Topology page - Graph view ✓ Create a workload from Docker file with "Deployment" as resource type: A-05-TC02 (example #1) (43434ms) redirect to home ensure perspective switcher is set to Developer You are on Topology page - Graph view ✓ Create a workload from YAML file: A-07-TC01 (31905ms) redirect to home ensure perspective switcher is set to Developer ✓ Upload Jar file page details: A-10-TC01 (24692ms) redirect to home ensure perspective switcher is set to Developer You are on Topology page - Graph view ✓ Create Sample Application from Add page: GS-03-TC05 (example #1) (40882ms) redirect to home ensure perspective switcher is set to Developer You are on Topology page - Graph view ✓ Create Sample Application from Add page: GS-03-TC05 (example #2) (52287ms) redirect to home ensure perspective switcher is set to Developer ✓ Quick Starts page when no Quick Start has started: QS-03-TC02 (23439ms) redirect to home ensure perspective switcher is set to Developer quick start is complete ✓ Quick Starts page when Quick Start has completed: QS-03-TC03 (28139ms) 17 passing (10m) 1 failing 1) Create the different workloads from Add page Deploy git workload with devfile from topology page: A-04-TC01: CypressError: `cy.focus()` can only be called on a single element. Your subject contained 14 elements. https://on.cypress.io/focus at Context.focus (https://console-openshift-console.apps.ci-op-lm9pvf4l-be832.origin-ci-int-aws.dev.rhcloud.com/__cypress/runner/cypress_runner.js:112944:70) at wrapped (https://console-openshift-console.apps.ci-op-lm9pvf4l-be832.origin-ci-int-aws.dev.rhcloud.com/__cypress/runner/cypress_runner.js:138021:19) From Your Spec Code: at Context.eval (webpack:///./support/step-definitions/addFlow/create-from-devfile.ts:10:59) at Context.resolveAndRunStepDefinition (webpack:////go/src/github.com/openshift/console/frontend/node_modules/cypress-cucumber-preprocessor/lib/resolveStepDefinition.js:217:0) at Context.eval (webpack:////go/src/github.com/openshift/console/frontend/node_modules/cypress-cucumber-preprocessor/lib/createTestFromScenario.js:26:0) [mochawesome] Report JSON saved to /go/src/github.com/openshift/console/frontend/gui_test_screenshots/cypress_report_devconsole.json (Results) ┌────────────────────────────────────────────────────────────────────────────────────────────────┐ │ Tests: 18 │ │ Passing: 17 │ │ Failing: 1 │ │ Pending: 0 │ │ Skipped: 0 │ │ Screenshots: 2 │ │ Video: false │ │ Duration: 10 minutes, 0 seconds │ │ Spec Ran: add-flow-ci.feature │ └────────────────────────────────────────────────────────────────────────────────────────────────┘ (Screenshots) - /go/src/github.com/openshift/console/frontend/gui_test_screenshots/cypress/scree (1280x720) nshots/add-flow-ci.feature/Create the different workloads from Add page -- Deplo y git workload with devfile from topology page A-04-TC01 (failed).png - /go/src/github.com/openshift/console/frontend/gui_test_screenshots/cypress/scree (1280x720) nshots/add-flow-ci.feature/Create the different workloads from Add page -- Deplo y git workload with devfile from topology page A-04-TC01 (failed) (attempt 2).pn g ==================================================================================================== (Run Finished) Spec Tests Passing Failing Pending Skipped ┌────────────────────────────────────────────────────────────────────────────────────────────────┐ │ ✖ add-flow-ci.feature 10:00 18 17 1 - - │ └────────────────────────────────────────────────────────────────────────────────────────────────┘ ✖ 1 of 1 failed (100%) 10:00 18 17 1 - -
discoverOpenIDURLs and checkOIDCPasswordGrantFlow fail if endpoints are private to the data plane.
This enabled the oauth server traffic to flow through the dataplane to enable reaching private endpoints e.g ldap https://issues.redhat.com/browse/HOSTEDCP-421
This enabled fallback to the management cluster network so for public endpoints we are not blocking on having data plane, e.g. github https://issues.redhat.com/browse/OCPBUGS-8073
This issue is to enable the CPO oidc checks to flow through the data plane and fallback to the management side to satisfy both cases above.
This woudl cover https://issues.redhat.com/browse/RFE-5638
Describe Konnectivity in HyperShift, its components, and how to debug it.
This is a clone of issue OCPBUGS-38026. The following is the description of the original issue:
—
Description of problem:
There are two enhancements we could have for cns-migration:
1. we can print the error message once the target datastore is not found, currently it exits as nothing did:
sh-5.1$ /bin/cns-migration -kubeconfig /tmp/kubeconfig -source vsanDatastore -destination invalid -volume-file /tmp/pv.txt KubeConfig is: /tmp/kubeconfig I0806 07:59:34.884908 131 logger.go:28] logging successfully to vcenter I0806 07:59:36.078911 131 logger.go:28] ----------- Migration Summary ------------ I0806 07:59:36.078944 131 logger.go:28] Migrated 0 volumes I0806 07:59:36.078960 131 logger.go:28] Failed to migrate 0 volumes I0806 07:59:36.078968 131 logger.go:28] Volumes not found 0
See the source datastore checing:
sh-5.1$ /bin/cns-migration -kubeconfig /tmp/kubeconfig -source invalid -destination Datastorenfsdevqe -volume-file /tmp/pv.txt KubeConfig is: /tmp/kubeconfig I0806 08:02:08.719657 138 logger.go:28] logging successfully to vcenter E0806 08:02:08.749709 138 logger.go:10] error listing cns volumes: error finding datastore invalid in datacenter DEVQEdatacenter
2. If we the volume-file has one invalid pv name which is not found like at the beginning, it exits immediately and all the remaining pvs are skips, we can let it continue to check other pvs.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Always
Steps to Reproduce:
See Description
Description of problem:
Redfish exception occurred while provisioning a worker using HW RAID configuration on HP server with ILO 5: step': 'delete_configuration', 'abortable': False, 'priority': 0}: Redfish exception occurred. Error: The attribute StorageControllers/Name is missing from the resource /redfish/v1/Systems/1/Storage/DE00A000 spec used: spec: raid: hardwareRAIDVolumes: - name: test-vol level: "1" numberOfPhysicalDisks: 2 sizeGibibytes: 350 online: true
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Provision an HEP worker with ILO 5 using redfish 2. 3.
Actual results:
Expected results:
Additional info:
As of OpenShift 4.16, CRD management is more complex. This is an artifact of improvements made to feature gates and feature sets. David Eads and I agreed that, to avoid confusion, we should aim to stop having CRDs installed via operator repos, and, if their types live in o/api, install them from there instead.
We started this by moving the ControlPlaneMachineSet back to o/api, which is part of the MachineAPI capability.
Unbeknown to us at the time, the way the installer currently works is that all resources that are rendered, get applied by a cluster-bootstrap tools, roughly here and not by CVO.
Cluster-bootstrap is not capability aware, so installed the CPMS CRD, which in turn broke the check in the CSR approver which stops it from crashing on MachineAPI less clusters.
Options for moving forward include:
I'm not sure presently which of the 2nd or 3rd options is better, nor am I sure how I would expect the caps to come into knowledge of the "renderers", installer can provide them as args in bootkube.sh.template?
Original bug below, description of what's happening above
Description of problem:
After running tests on an SNO with Telco DU profile for a couple of hours kubernetes.io/kubelet-serving CSRs in Pending state start showing up and accumulating in time.
Version-Release number of selected component (if applicable):
4.16.0-rc.1
How reproducible:
once so far
Steps to Reproduce:
1. Deploy SNO with DU profile with disabled capabilities: installConfigOverrides: "{\"capabilities\":{\"baselineCapabilitySet\": \"None\", \"additionalEnabledCapabilities\": [ \"NodeTuning\", \"ImageRegistry\", \"OperatorLifecycleManager\" ] }}" 2. Leave the node running tests overnight for a couple of hours 3. Check for Pending CSRs
Actual results:
oc get csr -A | grep Pending | wc -l 27
Expected results:
No pending CSRs Also oc logs will return a tls internal error: oc -n openshift-cluster-machine-approver --insecure-skip-tls-verify-backend=true logs machine-approver-866c94c694-7dwks Defaulted container "kube-rbac-proxy" out of: kube-rbac-proxy, machine-approver-controller Error from server: Get "https://[2620:52:0:8e6::d0]:10250/containerLogs/openshift-cluster-machine-approver/machine-approver-866c94c694-7dwks/kube-rbac-proxy": remote error: tls: internal error
Additional info:
Checking the machine-approver-controller container logs on the node we can see the reconciliation is failing be cause it cannot find the Machine API which is disabled from the capabilities. I0514 13:25:09.266546 1 controller.go:120] Reconciling CSR: csr-dw9c8 E0514 13:25:09.275585 1 controller.go:138] csr-dw9c8: Failed to list machines in API group machine.openshift.io/v1beta1: no matches for kind "Machine" in version "machine.openshift.io/v1beta1" E0514 13:25:09.275665 1 controller.go:329] "Reconciler error" err="Failed to list machines: no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\"" controller="certificatesigningrequest" controllerGroup="certificates.k8s.io" controllerKind="CertificateSigningRequest" CertificateSigningRequest="csr-dw9c8" namespace="" name="csr-dw9c8" reconcileID="6f963337-c6f1-46e7-80c4-90494d21653c" I0514 13:25:43.792140 1 controller.go:120] Reconciling CSR: csr-jvrvt E0514 13:25:43.798079 1 controller.go:138] csr-jvrvt: Failed to list machines in API group machine.openshift.io/v1beta1: no matches for kind "Machine" in version "machine.openshift.io/v1beta1" E0514 13:25:43.798128 1 controller.go:329] "Reconciler error" err="Failed to list machines: no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\"" controller="certificatesigningrequest" controllerGroup="certificates.k8s.io" controllerKind="CertificateSigningRequest" CertificateSigningRequest="csr-jvrvt" namespace="" name="csr-jvrvt" reconcileID="decbc5d9-fa10-45d1-92f1-1c999df956ff"
Please review the following PR: https://github.com/openshift/cloud-provider-powervs/pull/68
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Just a bug to enable to automated update.
This is a clone of issue OCPBUGS-38632. The following is the description of the original issue:
—
Description of problem:
When we add a userCA bundle to a cluster that has MCPs with yum based rhel nodes, the MCP with rhel nodes are degraded.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.17.0-0.nightly-2024-08-18-131731 True False 101m Cluster version is 4.17.0-0.nightly-2024-08-18-131731
How reproducible:
Always In the CI we found this issue running test case "[sig-mco] MCO security Author:sregidor-NonHyperShiftHOST-High-67660-MCS generates ignition configs with certs [Disruptive] [Serial]" on prow job periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-gcp-ipi-workers-rhel8-fips-f28-destructive
Steps to Reproduce:
1. Create a certificate $ openssl genrsa -out privateKey.pem 4096 $ openssl req -new -x509 -nodes -days 3600 -key privateKey.pem -out ca-bundle.crt -subj "/OU=MCO qe/CN=example.com" 2. Add the certificate to the cluster # Create the configmap with the certificate $ oc create cm cm-test-cert -n openshift-config --from-file=ca-bundle.crt configmap/cm-test-cert created #Configure the proxy with the new test certificate $ oc patch proxy/cluster --type merge -p '{"spec": {"trustedCA": {"name": "cm-test-cert"}}}' proxy.config.openshift.io/cluster patched 3. Check the MCP status and the MCD logs
Actual results:
The MCP is degraded $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-3251b00997d5f49171e70f7cf9b64776 True False False 3 3 3 0 130m worker rendered-worker-05e7664fa4758a39f13a2b57708807f7 False True True 3 0 0 1 130m We can see this message in the MCP - lastTransitionTime: "2024-08-19T11:00:34Z" message: 'Node ci-op-jr7hwqkk-48b44-6mcjk-rhel-1 is reporting: "could not apply update: restarting coreos-update-ca-trust.service service failed. Error: error running systemctl restart coreos-update-ca-trust.service: Failed to restart coreos-update-ca-trust.service: Unit coreos-update-ca-trust.service not found.\n: exit status 5"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded In the MCD logs we can see: I0819 11:38:55.089991 7239 update.go:2665] Removing SIGTERM protection E0819 11:38:55.090067 7239 writer.go:226] Marking Degraded due to: could not apply update: restarting coreos-update-ca-trust.service service failed. Error: error running systemctl restart coreos-update-ca-trust.service: Failed to restart coreos-update-ca-trust.service: Unit coreos-update-ca-trust.service not found.
Expected results:
No degradation should happen. The certificate should be added without problems.
Additional info:
Description of problem:
The 4.17 PowerVS CI is failing due to the following issue: https://github.com/kubernetes-sigs/cluster-api-provider-ibmcloud/pull/2029 So we need to update to 9b077049 in the 4.17 release as well.
This is a clone of issue OCPBUGS-38183. The following is the description of the original issue:
—
Description of problem:
azure-disk-csi-driver doesnt use registryOverrides
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%
Steps to Reproduce:
1.set registry override on CPO 2.watch that azure-disk-csi-driver continues to use default registry 3.
Actual results:
azure-disk-csi-driver uses default registry
Expected results:
azure-disk-csi-driver mirrored registry
Additional info:
This is a clone of issue OCPBUGS-36293. The following is the description of the original issue:
—
Description of problem:
CAPA is leaking one EIP in the bootstrap life cycle when creating clustres on 4.16+ with BYO IPv4 Pool on config. The install logs is showing the message of duplicated EIP, there is a kind of race condition when the EIP is created and tried to be associated when the instance isn't ready (Running state): ~~~ time="2024-05-08T15:49:33-03:00" level=debug msg="I0508 15:49:33.785472 2878400 recorder.go:104] \"Failed to associate Elastic IP for \\\"ec2-i-03de70744825f25c5\\\": InvalidInstanceID: The pending instance 'i-03de70744825f25c5' is not in a valid state for this operation.\\n\\tstatus code: 400, request id: 7582391c-b35e-44b9-8455-e68663d90fed\" logger=\"events\" type=\"Warning\" object=[...]\"name\":\"mrb-byoip-32-kbcz9\",\"[...] reason=\"FailedAssociateEIP\"" time="2024-05-08T15:49:33-03:00" level=debug msg="E0508 15:49:33.803742 2878400 controller.go:329] \"Reconciler error\" err=<" time="2024-05-08T15:49:33-03:00" level=debug msg="\tfailed to reconcile EIP: failed to associate Elastic IP \"eipalloc-08faccab2dbb28d4f\" to instance \"i-03de70744825f25c5\": InvalidInstanceID: The pending instance 'i-03de70744825f25c5' is not in a valid state for this operation." ~~~ The EIP is deleted when the bootstrap node is removed after a success installation, although the bug impacts any new machine with Public IP set using BYO IPv4 provisioned by CAPA. Upstream issue has been opened: https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5038
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
always
Steps to Reproduce:
1. create install-config.yaml setting platform.aws.publicIpv4Pool=poolID 2. create cluster 3. check the AWS Console, EIP page filtering by your cluster, you will see the duplicated EIP, while only one is associated to the correct bootstrap instance
Actual results:
Expected results:
- installer/capa creates only one EIP for bootstrap when provisioning the cluster - no error messages for expected behavior (ec2 association errors in pending state)
Additional info:
CAPA issue: https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5038
With the current v2 implementation there is no validation on the flags, for example it is possible to use -v2 which is not valid. The valid is --v2 with double -
So it is required to create a validation to check if the flags are valid if it is not present in cobra framework.
In our plugin documentation, there is no discussion of how to translate messages in a dynamic plugin. The only documentation we have currently is in the enhancement and in the demo plugin readme:
https://github.com/openshift/console/tree/master/dynamic-demo-plugin#i18n
Without a reference, plugin developers won't know how to handle translations.
Description of problem:
Violation warning is not displayed for `minAvailable` in PDB Create/Edit form Addition info: A maxUnavailable of 0% or 0 or a minAvailable of 100% or equal to the number of replicas is permitted but can block nodes from being drained.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-38441. The following is the description of the original issue:
—
Description of problem:
Both TestAWSEIPAllocationsForNLB and TestAWSLBSubnets are flaking on verifyExternalIngressController waiting for DNS to resolve.
lb_eip_test.go:119: loadbalancer domain apps.eiptest.ci-op-d2nddmn0-43abb.origin-ci-int-aws.dev.rhcloud.com was unable to resolve:
Version-Release number of selected component (if applicable):
4.17.0
How reproducible:
50%
Steps to Reproduce:
1. Run TestAWSEIPAllocationsForNLB or TestAWSLBSubnets in CI
Actual results:
Flakes
Expected results:
Shouldn't flake
Additional info:
CI Search: FAIL: TestAll/parallel/TestAWSEIPAllocationsForNLB
CI Search: FAIL: TestAll/parallel/TestUnmanagedAWSEIPAllocations
Description of problem:
https://issues.redhat.com/browse/MGMT-15691 introduced the code restructuring related to external platform and oci via PR https://github.com/openshift/assisted-service/pull/5787 Assisted service needs to be re-vendored in the installer in 4.16 and 4.17 releases to make sure the assisted-service dependencies are consistent. The master branch or 4.18 do not need this revendoring as the master branch was recently revendored via https://github.com/openshift/installer/pull/9058
Version-Release number of selected component (if applicable):
4.17, 4.16
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When installing a fresh 4.16-rc.5 on AWS, the following logs are shown: time="2024-06-18T16:47:23+02:00" level=debug msg="I0618 16:47:23.596147 4921 logger.go:75] \"enabling EKS controllers and webhooks\" logger=\"setup\"" time="2024-06-18T16:47:23+02:00" level=debug msg="I0618 16:47:23.596154 4921 logger.go:81] \"EKS IAM role creation\" logger=\"setup\" enabled=false" time="2024-06-18T16:47:23+02:00" level=debug msg="I0618 16:47:23.596159 4921 logger.go:81] \"EKS IAM additional roles\" logger=\"setup\" enabled=false" time="2024-06-18T16:47:23+02:00" level=debug msg="I0618 16:47:23.596164 4921 logger.go:81] \"enabling EKS control plane controller\" logger=\"setup\"" time="2024-06-18T16:47:23+02:00" level=debug msg="I0618 16:47:23.596184 4921 logger.go:81] \"enabling EKS bootstrap controller\" logger=\"setup\"" time="2024-06-18T16:47:23+02:00" level=debug msg="I0618 16:47:23.596198 4921 logger.go:81] \"enabling EKS managed cluster controller\" logger=\"setup\"" time="2024-06-18T16:47:23+02:00" level=debug msg="I0618 16:47:23.596215 4921 logger.go:81] \"enabling EKS managed machine pool controller\" logger=\"setup\"" That is somehow strange and may have side effects. It seems the EKS CAPA is enabled by default (see additional info)
Version-Release number of selected component (if applicable):
4.16-rc.5
How reproducible:
Always
Steps to Reproduce:
1. Install an cluster (even an SNO works) on AWS using IPI
Actual results:
EKS feature enabled
Expected results:
EKS feature not enabled
Additional info:
https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/feature/feature.go#L99
Description of problem:
Deployment of spoke cluster with GitOps ZTP approach fails during nodes introspection
Jul 23 15:27:45 openshift-master-0 podman[3269]: 2024-07-23 15:27:45.412 1 ERROR ironic-python-agent Jul 23 15:27:45 openshift-master-0 podman[3269]: 2024-07-23 15:27:45.412 1 ERROR ironic-python-agent Traceback (most recent call last): Jul 23 15:27:45 openshift-master-0 podman[3269]: 2024-07-23 15:27:45.412 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/requests/adapters.py", line 439, in send Jul 23 15:27:45 openshift-master-0 podman[3269]: 2024-07-23 15:27:45.412 1 ERROR ironic-python-agent resp = conn.urlopen( Jul 23 15:27:45 openshift-master-0 podman[3269]: 2024-07-23 15:27:45.412 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 756, in urlopen Jul 23 15:27:45 openshift-master-0 ironic-agent[3305]: 2024-07-23 15:27:45.412 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/util/connection.py", line 96, in cre ate_connection Jul 23 15:27:45 openshift-master-0 podman[3269]: 2024-07-23 15:27:45.412 1 ERROR ironic-python-agent retries = retries.increment( Jul 23 15:27:45 openshift-master-0 podman[3269]: 2024-07-23 15:27:45.412 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/util/retry.py", line 574, in increment Jul 23 15:27:45 openshift-master-0 podman[3269]: 2024-07-23 15:27:45.412 1 ERROR ironic-python-agent raise MaxRetryError(_pool, url, error or ResponseError(cause)) Jul 23 15:27:45 openshift-master-0 podman[3269]: 2024-07-23 15:27:45.412 1 ERROR ironic-python-agent urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='10.46.182.10', port=5050): Max retries exceeded with url: /v1/continue (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f86e2817a60>: Failed to establish a new connection: [Errno 111] ECONNREFUSED'))
Version-Release number of selected component (if applicable):
advanced-cluster-management.v2.11.0 multicluster-engine.v2.6.0
How reproducible:
so far once
Steps to Reproduce:
1. Deploy dualstack SNO hub cluster 2. Install and configure hub cluster for GitOps ZTP deployment 3. Deploy multi node cluster with GitOps ZTP workflow
Actual results:
Deployment fails as nodes fail to be introspected
Expected results:
Deployment succeeds
Description of problem:
When using SecureBoot tuned reports the following error as debugfs access is restricted:
tuned.utils.commands: Writing to file '/sys/kernel/debug/sched/migration_cost_ns' error: '[Errno 1] Operation not permitted: '/sys/kernel/debug/sched/migration_cost_ns''
tuned.plugins.plugin_scheduler: Error writing value '5000000' to 'migration_cost_ns'
This issue has been reported with the following tickets:
As this is a confirmed limitation of the NTO due to the TuneD component, we should document this as a limitation in the OpenShift Docs:
https://docs.openshift.com/container-platform/4.16/nodes/nodes/nodes-node-tuning-operator.html
Expected Outcome:
Description of problem:
KAS labels on projects created should be consistent with OCP - enforce: privileged
Version-Release number of selected component (if applicable):
4.17.0
How reproducible:
See https://issues.redhat.com/browse/OCPBUGS-20526.
Steps to Reproduce:
See https://issues.redhat.com/browse/OCPBUGS-20526.
Actual results:
See https://issues.redhat.com/browse/OCPBUGS-20526.
Expected results:
See https://issues.redhat.com/browse/OCPBUGS-20526.
Additional info:
See https://issues.redhat.com/browse/OCPBUGS-20526.
Please review the following PR: https://github.com/openshift/machine-api-operator/pull/1244
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The crio test for CPU affinity is failing on the RT Kernel in 4.17. We need to investigate what changed in 4.17 to cause this to start failing.
We will need to revert https://github.com/openshift/origin/pull/28854 once a solution has been found.
This is a clone of issue OCPBUGS-42671. The following is the description of the original issue:
—
Description of problem:
Prometheus write_relabel_configs in remotewrite unable to drop metric in Grafana
Version-Release number of selected component (if applicable):
How reproducible:
Customer has tried both configurations to drop MQ metric with source_label(configuration 1) or without source_label(configuration 2) but it's not working. It seems to me that drop configuration is not working properly and is buggy. Configuration 1: ``` remoteWrite: - url: "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push" write_relabel_configs: - source_labels: ['__name__'] regex: 'ibmmq_qmgr_uptime' action: 'drop' basicAuth: username: name: kubepromsecret key: username password: name: kubepromsecret key: password ``` Configuration 2: ``` remoteWrite: - url: "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push" write_relabel_configs: - regex: 'ibmmq_qmgr_uptime' action: 'drop' basicAuth: username: name: kubepromsecret key: username password: name: kubepromsecret key: password ``` Customer wants to know what's the correct remote_write configuration to drop metric in Grafana ? Document links: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write https://docs.openshift.com/container-platform/4.14/observability/monitoring/configuring-the-monitoring-stack.html#configuring-remote-write-storage_configuring-the-monitoring-stack https://docs.openshift.com/container-platform/4.14/observability/monitoring/configuring-the-monitoring-stack.html#creating-user-defined-workload-monitoring-configmap_configuring-the-monitoring-stack
Steps to Reproduce:
1. 2. 3.
Actual results:
prometheus remote_write configurations NOT droppping metric in Grafana
Expected results:
prometheus remote_write configurations should drop metric in Grafana
Additional info:
This is a clone of issue OCPBUGS-38425. The following is the description of the original issue:
—
Description of problem:
When a HostedCluster is upgraded to a new minor version, its OLM catalog imagestreams are not updated to use the tag corresponding to the new minor version.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. Create a HostedCluster (4.15.z) 2. Upgrade the HostedCluster to a new minor version (4.16.z)
Actual results:
OLM catalog imagestreams remain at the previous version (4.15)
Expected results:
OLM catalog imagestreams are updated to new minor version (4.16)
Additional info:
Dependabot has stopped running on the HyperShift repo. This task is to track the reason why & the fix to get it working again.
Description of problem:
Customers using OVN-K localnet topology networks for virtualization often do not define a "subnets" field in their NetworkAttachmentDefinitions. Examples in the OCP documentation virtualization section do not include that field either.
When a cluster with such NADs is upgraded from 4.16 to 4.17, the ovnkube-control-plane pods crash when CNO is upgraded and the upgrade hangs in a failing state. Once in the failing state, the cluster upgrade can be recovered by adding a subnets field to the localnet NADs
Version-Release number of selected component (if applicable): 4.16.15 > 4.17.1
How reproducible:
Start with an OCP 4.16 cluster with OVN-K localnet NADs configured per the OpenShift Virtualization documentation and attempt to upgrade the cluster to 4.17.1.
Steps to Reproduce:
1. Deploy an OCP 4.16.15 cluster, the type shouldn't matter but all testing has been done on bare metal (SNO and HA topologies)
2. Configure an OVS bridge with localnet bridge mappings and create one or more NetworkAttachmentDefinitions using the localnet topology without configuring the "subnets" field
3. Observe that this is a working configuration in 4.16 although error-level log messages appear in the ovnkube-control-plane pod (see OCPBUGS-37561)
4. Delete the ovnkube-control-plane pod on 4.16 and observe that the log messages do not prevent you from starting ovnkube on 4.16
5. Trigger an upgrade to 4.17.1
6. Once ovnkube-control-plane is restarted as part of the upgrade, observe that the ovnkube-cluster-manager container is crashing with the following message where "vlan10" is the name of a NetworkAttachmentDefinition created earlier
failed to run ovnkube: failed to start cluster manager: initial sync failed: failed to sync network vlan10: [cluster-manager network manager]: failed to create network vlan10: no cluster network controller to manage topology
7. Edit all NetworkAttachmentDefinitions to include a subnets field
8. Wait or delete the ovnkube-control-plane pods and observe that the pods come up and the upgrade resumes and completes normally
Actual results: The upgrade fails and ovnkube-control-plane is left in a crashing state
Expected results: The upgrade succeeds and ovnkube-control-plane is running
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms: Tested on baremetal, but all using OVN-K localnet networks should be impacted
Is it an
If it is an internal RedHat testing failure:
This shouldn't be possible, it should have a locator pointing to node it's from.
{ "level": "Info", "display": false, "source": "KubeletLog", "locator": { "type": "", "keys": null }, "message": { "reason": "ReadinessFailed", "cause": "", "humanMessage": "Get \"https://10.177.121.252:6443/readyz\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)", "annotations": { "node": "ci-op-vxiib4hx-9e8b4-wwnfx-master-2", "reason": "ReadinessFailed" } }, "from": "2024-05-02T16:58:06Z", "to": "2024-05-02T16:58:06Z", "filename": "e2e-events_20240502-163726.json" },
Please review the following PR: https://github.com/openshift/cluster-monitoring-operator/pull/2358
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
when use targetCatalog, mirror failed with error: error: error rebuilding catalog images from file-based catalogs: error copying image docker://registry.redhat.io/abc/redhat-operator-index:v4.13 to docker://localhost:5000/abc/redhat-operator-index:v4.13: initializing source docker://registry.redhat.io/abc/redhat-operator-index:v4.13: (Mirrors also failed: [localhost:5000/abc/redhat-operator-index:v4.13: pinging container registry localhost:5000: Get "https://localhost:5000/v2/": http: server gave HTTP response to HTTPS client]): registry.redhat.io/abc/redhat-operator-index:v4.13: reading manifest v4.13 in registry.redhat.io/abc/redhat-operator-index: unauthorized: access to the requested resource is not authorized
Version-Release number of selected component (if applicable):
oc-mirror 4.16
How reproducible:
always
Steps to Reproduce:
1) Use following isc to do mirror2mirror for v1: kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 storageConfig: local: path: /tmp/case60597 mirror: operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.13 targetCatalog: abc/redhat-operator-index packages: - name: servicemeshoperator `oc-mirror --config config.yaml docker://localhost:5000 --dest-use-http`
Actual results:
1) mirror failed with error:
info: Mirroring completed in 420ms (0B/s)
error: error rebuilding catalog images from file-based catalogs: error copying image docker://registry.redhat.io/abc/redhat-operator-index:v4.13 to docker://localhost:5000/abc/redhat-operator-index:v4.13: initializing source docker://registry.redhat.io/abc/redhat-operator-index:v4.13: (Mirrors also failed: [localhost:5000/abc/redhat-operator-index:v4.13: pinging container registry localhost:5000: Get "https://localhost:5000/v2/": http: server gave HTTP response to HTTPS client]): registry.redhat.io/abc/redhat-operator-index:v4.13: reading manifest v4.13 in registry.redhat.io/abc/redhat-operator-index: unauthorized: access to the requested resource is not authorized
Expected results:
1) no error.
Additional information:
compared with oc-mirror 4.15.9, can't reproduce this issue
This is a clone of issue OCPBUGS-42939. The following is the description of the original issue:
—
Description of problem:
4.18 efs controller, node pods are left behind after uninstalling driver
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-08-075347
How reproducible:
Always
Steps to Reproduce:
1. Install 4.18 EFS operator, driver on cluster and check the efs pods are all up and Running 2. Uninstall EFs driver and check the controller, node pods gets deleted
Execution on 4.16 and 4.18 clusters
4.16 cluster oc create -f og-sub.yaml oc create -f driver.yaml oc get pods | grep "efs" aws-efs-csi-driver-controller-b8858785-72tp9 4/4 Running 0 4s aws-efs-csi-driver-controller-b8858785-gvk4b 4/4 Running 0 6s aws-efs-csi-driver-node-2flqr 3/3 Running 0 9s aws-efs-csi-driver-node-5hsfp 3/3 Running 0 9s aws-efs-csi-driver-node-kxnlv 3/3 Running 0 9s aws-efs-csi-driver-node-qdshm 3/3 Running 0 9s aws-efs-csi-driver-node-ss28h 3/3 Running 0 9s aws-efs-csi-driver-node-v9zwx 3/3 Running 0 9s aws-efs-csi-driver-operator-65b55bf877-4png9 1/1 Running 0 2m53s oc get clustercsidrivers | grep "efs" efs.csi.aws.com 2m26s oc delete -f driver.yaml oc get pods | grep "efs" aws-efs-csi-driver-operator-65b55bf877-4png9 1/1 Running 0 4m40s 4.18 cluster oc create -f og-sub.yaml oc create -f driver.yaml oc get pods | grep "efs" aws-efs-csi-driver-controller-56d68dc976-847lr 5/5 Running 0 9s aws-efs-csi-driver-controller-56d68dc976-9vklk 5/5 Running 0 11s aws-efs-csi-driver-node-46tsq 3/3 Running 0 18s aws-efs-csi-driver-node-7vpcd 3/3 Running 0 18s aws-efs-csi-driver-node-bm86c 3/3 Running 0 18s aws-efs-csi-driver-node-gz69w 3/3 Running 0 18s aws-efs-csi-driver-node-l986w 3/3 Running 0 18s aws-efs-csi-driver-node-vgwpc 3/3 Running 0 18s aws-efs-csi-driver-operator-7cc9bf69b5-hj7zv 1/1 Running 0 2m55s oc get clustercsidrivers efs.csi.aws.com 2m19s oc delete -f driver.yaml oc get pods | grep "efs" aws-efs-csi-driver-controller-56d68dc976-847lr 5/5 Running 0 4m58s aws-efs-csi-driver-controller-56d68dc976-9vklk 5/5 Running 0 5m aws-efs-csi-driver-node-46tsq 3/3 Running 0 5m7s aws-efs-csi-driver-node-7vpcd 3/3 Running 0 5m7s aws-efs-csi-driver-node-bm86c 3/3 Running 0 5m7s aws-efs-csi-driver-node-gz69w 3/3 Running 0 5m7s aws-efs-csi-driver-node-l986w 3/3 Running 0 5m7s aws-efs-csi-driver-node-vgwpc 3/3 Running 0 5m7s aws-efs-csi-driver-operator-7cc9bf69b5-hj7zv 1/1 Running 0 7m44s oc get clustercsidrivers | grep "efs" => Nothing is there
Actual results:
The EFS controller, node pods are left behind
Expected results:
After uninstalling driver the EFS controller, node pods should get deleted
Additional info:
On 4.16 cluster this is working fine EFS Operator logs: oc logs aws-efs-csi-driver-operator-7cc9bf69b5-hj7zv E1009 07:13:41.460469 1 base_controller.go:266] "LoggingSyncer" controller failed to sync "key", err: clustercsidrivers.operator.openshift.io "efs.csi.aws.com" not found Discussion: https://redhat-internal.slack.com/archives/C02221SB07R/p1728456279493399
Description of problem:
A breaking API change (Catalog -> ClusterCatalog) is blocking downstreaming of operator-framework/catalogd and operator-framework/operator-controller
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
Downstreaming script fails. https://prow.ci.openshift.org/?job=periodic-auto-olm-v1-downstreaming
Actual results:
Downstreaming fails.
Expected results:
Downstreaming succeeds.
Additional info:
This is a clone of issue OCPBUGS-39531. The following is the description of the original issue:
—
-> While upgrading the cluster from 4.13.38 -> 4.14.18, it is stuck on CCO, clusterversion is complaining about
"Working towards 4.14.18: 690 of 860 done (80% complete), waiting on cloud-credential".
While checking further we see that CCO deployment is yet to rollout.
-> ClusterOperator status.versions[name=operator] isn't a narrow "CCO Deployment is updated", it's "the CCO asserts the whole CC component is updated", which requires (among other things) a functional CCO Deployment. Seems like you don't have a functional CCO Deployment, because logs have it stuck talking about asking for a leader lease. You don't have Kube API audit logs to say if it's stuck generating the Lease request, or waiting for a response from the Kube API server.
This is a clone of issue OCPBUGS-38012. The following is the description of the original issue:
—
Description of problem:
Customers are unable to scale-up the OCP nodes when the initial setup is done with OCP 4.8/4.9 and then upgraded to 4.15.22/4.15.23 At first customer observed that the node scale-up failed and the /etc/resolv.conf was empty on the nodes. As a workaround, customer copy/paste the resolv.conf content from a correct resolv.conf and then it continued with setting up the new node. However then they observed the rendered MachineConfig assembled with the 00-worker, and suspected that something can be wrong with the on-prem-resolv-prepender.service service definition. As a workaround, customer manually changed this service definition which helped them to scale up new nodes.
Version-Release number of selected component (if applicable):
4.15 , 4.16
How reproducible:
100%
Steps to Reproduce:
1. Install OCP vSphere IPI cluster version 4.8 or 4.9 2. Check "on-prem-resolv-prepender.service" service definition 3. Upgrade it to 4.15.22 or 4.15.23 4. Check if the node scaling is working 5. Check "on-prem-resolv-prepender.service" service definition
Actual results:
Unable to scaleup node with default service definition. After manually making changes in the service definition , scaling is working.
Expected results:
Node sclaing should work without making any manual changes in the service definition.
Additional info:
on-prem-resolv-prepender.service content on the clusters build with 4.8 / 4.9 version and then upgraded to 4.15.22 / 4.25.23 : ~~~ [Unit] Description=Populates resolv.conf according to on-prem IPI needs # Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe After=crio-wipe.service [Service] Type=oneshot Restart=on-failure RestartSec=10 StartLimitIntervalSec=0 ExecStart=/usr/local/bin/resolv-prepender.sh EnvironmentFile=/run/resolv-prepender/env ~~~ After manually correcting the service definition as below, scaling works on 4.15.22 / 4.15.23 : ~~~ [Unit] Description=Populates resolv.conf according to on-prem IPI needs # Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe After=crio-wipe.service StartLimitIntervalSec=0 -----------> this [Service] Type=oneshot #Restart=on-failure -----------> this RestartSec=10 ExecStart=/usr/local/bin/resolv-prepender.sh EnvironmentFile=/run/resolv-prepender/env ~~~ Below is the on-prem-resolv-prepender.service on a freshly intsalled 4.15.23 where sclaing is working fine : ~~~ [Unit] Description=Populates resolv.conf according to on-prem IPI needs # Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe After=crio-wipe.service StartLimitIntervalSec=0 [Service] Type=oneshot Restart=on-failure RestartSec=10 ExecStart=/usr/local/bin/resolv-prepender.sh EnvironmentFile=/run/resolv-prepender/env ~~~ Observed this in the rendered MachineConfig which is assembled with the 00-worker
Description of problem:
Since 4.16.0 pods with memory limits tend to OOM very frequently when writing files larger than memory limit to PVC
Version-Release number of selected component (if applicable):
4.16.0-rc.4
How reproducible:
100% on certain types of storage (AWS FSx, certain LVMS setups, see additional info)
Steps to Reproduce:
1. Create pod/pvc that writes a file larger than the container memory limit (attached example) 2. 3.
Actual results:
OOMKilled
Expected results:
Success
Additional info:
For simplicity, I will focus on BM setup that produces this with LVM storage. This is also reproducible on AWS clusters with NFS backed NetApp ONTAP FSx. Further reduced to exclude the OpenShift layer, LVM on a separate (non root) disk: Prepare disk lvcreate -T vg1/thin-pool-1 -V 10G -n oom-lv mkfs.ext4 /dev/vg1/oom-lv mkdir /mnt/oom-lv mount /dev/vg1/oom-lv /mnt/oom-lv Run container podman run -m 600m --mount type=bind,source=/mnt/oom-lv,target=/disk --rm -it quay.io/centos/centos:stream9 bash [root@2ebe895371d2 /]# curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-x86_64-9-20240527.0.x86_64.qcow2 -o /disk/temp % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 47 1157M 47 550M 0 0 111M 0 0:00:10 0:00:04 0:00:06 111MKilled (Notice the process gets killed, I don't think podman ever whacks the whole container over this though) The same process on the same hardware on a 4.15 node (9.2) does not produce an OOM (vs 4.16 which is RHEL 9.4) For completeness, I will provide some details about the setup behind the LVM pool, though I believe it should not impact the decision about whether this is an issue: sh-5.1# pvdisplay --- Physical volume --- PV Name /dev/sdb VG Name vg1 PV Size 446.62 GiB / not usable 4.00 MiB Allocatable yes PE Size 4.00 MiB Total PE 114335 Free PE 11434 Allocated PE 102901 PV UUID <UUID> Hardware: SSD (INTEL SSDSC2KG480G8R) behind a RAID 0 of a PERC H330 Mini controller At the very least, this seems like a change in behavior but tbh I am leaning towards an outright bug.
It's been independently verified that setting /sys/kernel/mm/lru_gen/enabled = 0 avoids the oomkills. So verifying that nodes get this value applied is the main testing concern at this point, new installs, upgrades, and new nodes scaled after an upgrade.
If we want to go so far as to verify that the oomkills don't happen the kernel QE team have a simplified reproducer here which involves mounting an NFS volume and using podman to create a container with a memory limit and writing data to that NFS volume.
This is a clone of issue OCPBUGS-43898. The following is the description of the original issue:
—
Description of problem:
OCP 4.17 requires permissions to tag network interfaces (ENIs) on instance creation in support of the Egress IP feature. ROSA HCP uses managed IAM policies, which are reviewed and gated by AWS. The current policy AWS has applied does not allow us to tag ENIs out of band, only ones that have 'red-hat-managed: true`, which are going to be tagged during instance creation. However, in order to support backwards compatibility for existing clusters, we need to roll out a CAPA patch that allows us to call `RunInstances` with or without the ability to tag ENIs. Once we backport this to the Z streams, upgrade clusters and rollout the updated policy with AWS, we can then go back and revert the backport. For more information see https://issues.redhat.com/browse/SDE-4496
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/aws-pod-identity-webhook/pull/193
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-olm-operator/pull/55
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
While debugging https://docs.google.com/document/d/10kcIQPsn2H_mz7dJx3lbZR2HivjnC_FAnlt2adc53TY/edit#heading=h.egy1agkrq2v1, we came across the log: 2023-07-31T16:51:50.240749863Z W0731 16:51:50.240586 1 tasks.go:72] task 3 of 15: Updating Prometheus-k8s failed: [unavailable (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline, degraded (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline] After some searching, we understood that the log is trying to say that ValidatePrometheus timed out waiting for prometheus to become ready. The
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
See here https://redhat-internal.slack.com/archives/C02BQCCFZPX/p1690892059971129?thread_ts=1690873617.023399&cid=C02BQCCFZPX for how to get the function time out.
Actual results:
Expected results:
- Clearer logs. - Some info that we are logging makes more sense to be part of the error, example: https://github.com/openshift/cluster-monitoring-operator/blob/af831de434ce13b3edc0260a468064e0f3200044/pkg/client/client.go#L890 - Make info as "unavailable (unknown):" clearer as we cannot understand want it means without referring to code.
Additional info:
- Do the same for the other functions that wait for other components if using the same wait mechanism (PollUntilContextTimeout...) - https://redhat-internal.slack.com/archives/C02BQCCFZPX/p1690873617023399 for more details. see https://redhat-internal.slack.com/archives/C0VMT03S5/p1691069196066359?thread_ts=1690827144.818209&cid=C0VMT03S5 for the slack discussion.
Please review the following PR: https://github.com/openshift/oauth-server/pull/149
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Oc-mirror should fail with error when operator not found
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202403070215.p0.gc4f8295.assembly.stream.el9-c4f8295", GitCommit:"c4f829512107f7d0f52a057cd429de2030b9b3b3", GitTreeState:"clean", BuildDate:"2024-03-07T03:46:24Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
always
Steps to Reproduce:
1) With following imagesetconfig: cat config-e.yaml apiVersion: mirror.openshift.io/v1alpha2 kind: ImageSetConfiguration mirror: operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.14 packages: - name: cincinnati-operator - name: cluster-logging channels: - name: stable minVersion: 5.7.7 maxVersion: 5.7.7 `oc-mirror --config config-e.yaml file://out2 --v2` 2) Check the operator version [root@preserve-fedora36 app1]# oc-mirror list operators --catalog registry.redhat.io/redhat/redhat-operator-index:v4.14 --package cluster-logging --channel stable-5.7 VERSIONS 5.7.6 5.7.7 5.7.0 5.7.1 5.7.10 5.7.2 5.7.4 5.7.5 5.7.9 5.7.11 5.7.3 5.7.8 [root@preserve-fedora36 app1]# oc-mirror list operators --catalog registry.redhat.io/redhat/redhat-operator-index:v4.14 --package cluster-logging --channel stable VERSIONS 5.8.0 5.8.1 5.8.2 5.8.3 5.8.4
Actual results:
2) No error when operator not found oc-mirror --config config-e.yaml file://out2 --v2 --v2 flag identified, flow redirected to the oc-mirror v2 version. PLEASE DO NOT USE that. V2 is still under development and it is not ready to be used. 2024/03/25 05:07:57 [INFO] : mode mirrorToDisk 2024/03/25 05:07:57 [INFO] : local storage registry will log to /app1/0321/out2/working-dir/logs/registry.log 2024/03/25 05:07:57 [INFO] : starting local storage on localhost:55000 2024/03/25 05:07:57 [INFO] : copying cincinnati response to out2/working-dir/release-filters 2024/03/25 05:07:57 [INFO] : total release images to copy 0 2024/03/25 05:07:57 [INFO] : copying operator image registry.redhat.io/redhat/redhat-operator-index:v4.14 2024/03/25 05:08:00 [INFO] : manifest 6839c41621e7d3aa2be40499ed1d69d833bc34472689688d8efd4e944a32469e 2024/03/25 05:08:00 [INFO] : label /configs 2024/03/25 05:08:16 [INFO] : related images length 2 2024/03/25 05:08:16 [INFO] : images to copy (before duplicates) 4 2024/03/25 05:08:16 [INFO] : total operator images to copy 4 2024/03/25 05:08:16 [INFO] : total additional images to copy 0 2024/03/25 05:08:16 [INFO] : images to mirror 4 2024/03/25 05:08:16 [INFO] : batch count 1 2024/03/25 05:08:16 [INFO] : batch index 0 2024/03/25 05:08:16 [INFO] : batch size 4 2024/03/25 05:08:16 [INFO] : remainder size 0 2024/03/25 05:08:16 [INFO] : starting batch 0 2024/03/25 05:08:27 [INFO] : completed batch 0 2024/03/25 05:08:42 [INFO] : start time : 2024-03-25 05:07:57.7405637 +0000 UTC m=+0.058744792 2024/03/25 05:08:42 [INFO] : collection time : 2024-03-25 05:08:16.069731565 +0000 UTC m=+18.387912740 2024/03/25 05:08:42 [INFO] : mirror time : 2024-03-25 05:08:42.4006485 +0000 UTC m=+44.71882960
Expected results:
2) For channel stable, we can’t find the 5.7.7 version for cluster-logging, the mirror should fail with error.
Description of problem:
In case the interface changes, we might miss updating AWS and not realize it.
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
No issue currently but could potentially break in the future.
Expected results:
Additional info:
Under heavy load(?) crictl can fail and return errors which iptables-alerter does not handle correctly, and as a result, it may accidentally end up checking for iptables rules in hostNetwork pods, and then logging events about it.
This is a clone of issue OCPBUGS-37782. The following is the description of the original issue:
—
Description of problem:
ci/prow/security is failing on google.golang.org/grpc/metadata
Version-Release number of selected component (if applicable):
4.15
How reproducible:
always
Steps to Reproduce:
1. run ci/pro/security job on 4.15 pr 2. 3.
Actual results:
Medium severity vulnerability found in google.golang.org/grpc/metadata
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/305
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Tooltip on Pipeline whenexpression is not shows in Pipeline visualization.
When expression tooltip is not shows on hover
Should show When expression tooltip on hover
Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/273
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-32053. The following is the description of the original issue:
—
Description of problem:
The single page docs are missing the "oc adm policy add-cluster-role-to* and remove-cluster-role-from-* commands. These options exist in these docs: https://docs.openshift.com/container-platform/4.14/authentication/using-rbac.html but not in these docs: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/cli_tools/index#oc-adm-policy-add-role-to-user
Please review the following PR: https://github.com/openshift/cluster-etcd-operator/pull/1263
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
The REST API returns 422 if the input doesn't match the regex defined in the swagger, whereas our code returns 400 for input errors. There may be other cases where the generated errors are inconsistent with ours. We should change 422 to 400 and review the rest.
Please review the following PR: https://github.com/openshift/kube-rbac-proxy/pull/99
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/image-registry/pull/401
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The specified network tags are not applied to control-plane machines.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-multi-2024-06-04-172027
How reproducible:
Always
Steps to Reproduce:
1. "create install-config" 2. edit install-config.yaml to insert network tags and featureSet setting (see [1]) 3. "create cluster" and make sure it succeeds 4. using "gcloud" to check the network tags of the cluster machines (see [2])
Actual results:
The specified network tags are not applied to control-plane machines, although the compute/worker machines do have the specified network tags.
Expected results:
Both control-plane machines and compute/worker machines should be applied to the specified network tags.
Additional info:
QE's Flexy-install job: /Flexy-install/288061/ VARIABLES_LOCATION private-templates/functionality-testing/aos-4_16/ipi-on-gcp/versioned-installer_techpreview LAUNCHER_VARS installer_payload_image: quay.io/openshift-release-dev/ocp-release-nightly:4.16.0-0.nightly-multi-2024-06-04-172027 num_workers: 2 control_plane_tags: ["installer-qe-tag01", "installer-qe-tag02"] compute_tags: ["installer-qe-tag01", "installer-qe-tag03"]
This is a clone of issue OCPBUGS-37506. The following is the description of the original issue:
—
Description of problem:
Install Azure fully private IPI cluster by using CAPI with payload built from cluster bot including openshift/installer#8727,openshift/installer#8732, install-config: ================= platform: azure: region: eastus outboundType: UserDefinedRouting networkResourceGroupName: jima24b-rg virtualNetwork: jima24b-vnet controlPlaneSubnet: jima24b-master-subnet computeSubnet: jima24b-worker-subnet publish: Internal featureSet: TechPreviewNoUpgrade Checked storage account created by installer, its property allowBlobPublicAccess is set to True. $ az storage account list -g jima24b-fwkq8-rg --query "[].[name,allowBlobPublicAccess]" -o tsv jima24bfwkq8sa True This is not consistent with terraform code, https://github.com/openshift/installer/blob/master/data/data/azure/vnet/main.tf#L74 At least, storage account should have no public access for fully private cluster.
Version-Release number of selected component (if applicable):
4.17 nightly build
How reproducible:
Always
Steps to Reproduce:
1. Create fully private cluster 2. Check storage account created by installer 3.
Actual results:
storage account have public access on fully private cluster.
Expected results:
storage account should have no public access on fully private cluster.
Additional info:
This is a clone of issue OCPBUGS-43041. The following is the description of the original issue:
—
Description of problem:
A slice of something like idPointers := make([]*string, len(ids)) should be corrected to idPointers := make([]*string, 0, len(ids)) When the initial size is not provided to the make for slice creating, the slice is made to length (last argument) and filled with the default value. For instance _ := make([]int, 5) creates an array {0, 0, 0, 0, 0}. If this appended to rather than accessing and setting the information by index, then there are extra values. 1. If we append to the array then we leave behind the default values (this could change the behavior of the function that the array is passed to). This could also pose as a malloc issue. 2. If we dont fill the array completely (ie. create a size of 5 and only fill 4 elements), then the same issue as above could come in to play.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
secrets-store-csi-driver with AWS provider does not work in HyperShift hosted cluster, pod can't mount the volume successfully.
Version-Release number of selected component (if applicable):
secrets-store-csi-driver-operator.v4.14.0-202308281544 in 4.14.0-0.nightly-2023-09-06-235710 HyperShift hosted cluster.
How reproducible:
Always
Steps to Reproduce:
1. Follow test case OCP-66032 "Setup" part to install secrets-store-csi-driver-operator.v4.14.0-202308281544 , secrets-store-csi-driver and AWS provider successfully: $ oc get po -n openshift-cluster-csi-drivers NAME READY STATUS RESTARTS AGE aws-ebs-csi-driver-node-7xxgr 3/3 Running 0 5h18m aws-ebs-csi-driver-node-fmzwf 3/3 Running 0 5h18m aws-ebs-csi-driver-node-rgrxd 3/3 Running 0 5h18m aws-ebs-csi-driver-node-tpcxq 3/3 Running 0 5h18m csi-secrets-store-provider-aws-2fm6q 1/1 Running 0 5m14s csi-secrets-store-provider-aws-9xtw7 1/1 Running 0 5m15s csi-secrets-store-provider-aws-q5lvb 1/1 Running 0 5m15s csi-secrets-store-provider-aws-q6m65 1/1 Running 0 5m15s secrets-store-csi-driver-node-4wdc8 3/3 Running 0 6m22s secrets-store-csi-driver-node-n7gkj 3/3 Running 0 6m23s secrets-store-csi-driver-node-xqr52 3/3 Running 0 6m22s secrets-store-csi-driver-node-xr24v 3/3 Running 0 6m22s secrets-store-csi-driver-operator-9cb55b76f-7cbvz 1/1 Running 0 7m16s 2. Follow test case OCP-66032 steps to create AWS secret, set up AWS IRSA successfully. 3. Follow test case OCP-66032 steps SecretProviderClass, deployment with the secretProviderClass successfully. Then check pod, pod is stuck in ContainerCreating: $ oc get po NAME READY STATUS RESTARTS AGE hello-openshift-84c76c5b89-p5k4f 0/1 ContainerCreating 0 10m $ oc describe po hello-openshift-84c76c5b89-p5k4f ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 11m default-scheduler Successfully assigned xxia-proj/hello-openshift-84c76c5b89-p5k4f to ip-10-0-136-205.us-east-2.compute.internal Warning FailedMount 11m kubelet MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials caused by: InvalidIdentityToken: Incorrect token audience status code: 400, request id: 92d1ff5b-36be-4cc5-9b55-b12279edd78e Warning FailedMount 11m kubelet MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials caused by: InvalidIdentityToken: Incorrect token audience status code: 400, request id: 50907328-70a6-44e0-9f05-80a31acef0b4 Warning FailedMount 11m kubelet MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials caused by: InvalidIdentityToken: Incorrect token audience status code: 400, request id: 617dc3bc-a5e3-47b0-b37c-825f8dd84920 Warning FailedMount 11m kubelet MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials caused by: InvalidIdentityToken: Incorrect token audience status code: 400, request id: 8ab5fc2c-00ca-45e2-9a82-7b1765a5df1a Warning FailedMount 11m kubelet MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials caused by: InvalidIdentityToken: Incorrect token audience status code: 400, request id: b76019ca-dc04-4e3e-a305-6db902b0a863 Warning FailedMount 11m kubelet MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials caused by: InvalidIdentityToken: Incorrect token audience status code: 400, request id: b395e3b2-52a2-4fc2-80c6-9a9722e26375 Warning FailedMount 11m kubelet MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials caused by: InvalidIdentityToken: Incorrect token audience status code: 400, request id: ec325057-9c0a-4327-80c9-a9b6233a64dd Warning FailedMount 10m kubelet MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials caused by: InvalidIdentityToken: Incorrect token audience status code: 400, request id: 405492b2-ed52-429b-b253-6a7c098c26cb Warning FailedMount 82s (x5 over 9m35s) kubelet Unable to attach or mount volumes: unmounted volumes=[secrets-store-inline], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition Warning FailedMount 74s (x5 over 9m25s) kubelet (combined from similar events): MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials caused by: InvalidIdentityToken: Incorrect token audience status code: 400, request id: c38bbed1-012d-4250-b674-24ab40607920
Actual results:
Hit above stuck issue.
Expected results:
Pod should be Running.
Additional info:
Compared another operator (cert-manager-operator) which also uses AWS IRSA: OCP-62500 , that case works well. So secrets-store-csi-driver-operator has bug.
OCP version: 4.15.0
We have monitoring alerts configured against a cluster in our longevity setup.
After receiving alerts for metal3 - we examined the graph for the pod.
The graph indicates a continuous steady growth of memory consumption.
Open Github Security Advisory for: containers/image
https://github.com/advisories/GHSA-6wvf-f2vw-3425
The ARO SRE team became aware of this advisory against our installer fork. Upstream installer is also pinning a vulnerable version of containerd.
Advisory recommends to update to versions 5.30.1
This is a clone of issue OCPBUGS-43520. The following is the description of the original issue:
—
Description of problem:
When installing a GCP cluster with the CAPI based method, the kube-api firewall rule that is created always uses a source range of 0.0.0.0/0. In the prior terraform based method, internal published clusters were limited to the network_cidr. This change opens up the API to additional sources, which could be problematic such as in situations where traffic is being routed from a non-cluster subnet.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Always
Steps to Reproduce:
1. Install a cluster in GCP with publish: internal 2. 3.
Actual results:
Kube-api firewall rule has source of 0.0.0.0/0
Expected results:
Kube-api firewall rule has a more limited source of network_cidr
Additional info:
This is a clone of issue OCPBUGS-38118. The following is the description of the original issue:
—
Description of problem:
IHAC who is facing issue while deploying nutanix IPI cluster 4.16.x with dhcp.ENV DETAILS: Nutanix Versions: AOS: 6.5.4 NCC: 4.6.6.3 PC: pc.2023.4.0.2 LCM: 3.0.0.1During the installation process after the bootstrap nodes and control-planes are created, the IP addresses on the nodes shown in the Nutanix Dashboard conflict, even when Infinite DHCP leases are set. The installation will work successfully only when using the Nutanix IPAM. Also 4.14 and 4.15 releases install successfully. IPS of master0 and master2 are conflicting, Please chk attachment. Sos-report of master0 and master1 : https://drive.google.com/drive/folders/140ATq1zbRfqd1Vbew-L_7N4-C5ijMao3?usp=sharing The issue was reported via the slack thread:https://redhat-internal.slack.com/archives/C02A3BM5DGS/p1721837567181699
Version-Release number of selected component (if applicable):
How reproducible:
Use the OCP 4.16.z installer to create an OCP cluster with Nutanix using DHCP network. The installation will fail. Always reproducible.
Steps to Reproduce:
1. 2. 3.
Actual results:
The installation will fail.
Expected results:
The installation succeeds to create a Nutanix OCP cluster with the DHCP network.
Additional info:
This is a clone of issue OCPBUGS-43448. The following is the description of the original issue:
—
Description of problem:
The cluster policy controller does not get the same feature flags that other components in the control plane are getting.
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Always
Steps to Reproduce:
1. Create hosted cluster 2. Get cluster-policy-controller-config configmap from control plane namespace
Actual results:
Default feature gates are not included in the config
Expected results:
Feature gates are included in the config
Additional info:
Description of problem:
When creating a serverless function in create serverless form, BuildConfig is not created
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1.Install Serverless operator 2.Add https://github.com/openshift-dev-console/kn-func-node-cloudevents in create serverless form 3.Create the function and check BuildConfig page
Actual results:
BuildConfig is not created
Expected results:
Should create BuildConfig
Additional info:
Please review the following PR: https://github.com/openshift/cluster-api-operator/pull/38
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/openstack-cinder-csi-driver-operator/pull/169
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The self-managed hypershift cli (hcp) reports an inaccurate OCP supported version. For example, if I have a hypershift-operator deployed which supports OCP v4.14 and I build the hcp cli from the latest source code, when I execute "hcp -v", the cli tool reports the following. $ hcp -v hcp version openshift/hypershift: 02bf7af8789f73c7b5fc8cc0424951ca63441649. Latest supported OCP: 4.16.0 This makes it appear that the hcp cli is capable of deploying OCP v4.16.0, when the backend is actually limited to v4.14.0. The cli needs to indicate what the server is capable of deploying. Otherwise it appears that v4.16.0 would be deployable in this scenario, but the backend would not allow that.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100%
Steps to Reproduce:
1. download an HCP client that does not match the hypershift-operator backend 2. execute 'hcp -v' 3. the reported "Latest supported OCP" is not representative of the version the hypershift-operator actually supports
Actual results:
Expected results:
hcp cli reports a latest OCP version that is representative of what the deployed hypershift operator is capable of deploying.
Additional info:
If not specified those are defined by an environment variable in the image service, so connected users are not aware to the images that exist in the deployment, It can cause a confusion when adding a new release image, the result is an error in infra-env, the error is clear but user will not understand what needs to be done
Failed to create image: The requested RHCOS version (4.14, arch: x86_64) does not have a matching OpenShift release image'
This is a clone of issue OCPBUGS-38249. The following is the description of the original issue:
—
openshift/api was bumped in CNO without running codegen. codegen needs to be run
Description of problem:
If a cluster admin creates a new MachineOSConfig that references a legacy pull secret, the canonicalized version of this secret that gets created is not updated whenever the original pull secret changes.
How reproducible:
Always
Steps to Reproduce:
.
Actual results:
The canonicalized version of the pull secret is never updated with the contents of the legacy-style pull secret.
Expected results:
Ideally, the canonicalized version of the pull secret should be updated since BuildController created it.
Additional info:
This occurs because when the legacy pull secret is initially detected, BuildController canonicalizes it and then updates the MachineOSConfig with the name of the canonicalized secret. The next time this secret is referenced, the original secret does not get read.
Please review the following PR: https://github.com/openshift/machine-api-provider-powervs/pull/78
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-api-provider-vsphere/pull/39
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
In the OCP upgrades from 4.13 to 4.14, the canary route configuration is changed as below:
Canary route configuration in OCP 4.13 $ oc get route -n openshift-ingress-canary canary -oyaml apiVersion: route.openshift.io/v1 kind: Route metadata: labels: ingress.openshift.io/canary: canary_controller name: canary namespace: openshift-ingress-canary spec: host: canary-openshift-ingress-canary.apps.<cluster-domain>.com <---- canary route configured with .spec.host Canary route configuration in OCP 4.14: $ oc get route -n openshift-ingress-canary canary -oyaml apiVersion: route.openshift.io/v1 kind: Route labels: ingress.openshift.io/canary: canary_controller name: canary namespace: openshift-ingress-canary spec: port: targetPort: 8080 subdomain: canary-openshift-ingress-canary <---- canary route configured with .spec.subdomain
After the upgrade, the following messages are printed in the ingress-operator pod:
2024-04-24T13:16:34.637Z ERROR operator.init controller/controller.go:265 Reconciler error {"controller": "canary_controller", "object": {"name":"default","namespace":"openshift-ingress-operator"}, "namespace": "openshift-ingress-operator", "name": "default", "reconcileID": "46290893-d755-4735-bb01-e8b707be4053", "error": "failed to ensure canary route: failed to update canary route openshift-ingress-canary/canary: Route.route.openshift.io \"canary\" is invalid: spec.subdomain: Invalid value: \"canary-openshift-ingress-canary\": field is immutable"}
The issue is resolved when the canary route is deleted.
See below the audit logs from the process:
# The route can't be updated with error 422: {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"4e8bfb36-21cc-422b-9391-ef8ff42970ca","stage":"ResponseComplete","requestURI":"/apis/route.openshift.io/v1/namespaces/openshift-ingress-canary/routes/canary","verb":"update","user":{"username":"system:serviceaccount:openshift-ingress-operator:ingress-operator","groups":["system:serviceaccounts","system:serviceaccounts:openshift-ingress-operator","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["ingress-operator-746cd8598-hq2st"],"authentication.kubernetes.io/pod-uid":["f3ebccdf-f3b3-420d-8ea5-e33d98945403"]}},"sourceIPs":["10.128.0.93","10.128.0.2"],"userAgent":"Go-http-client/2.0","objectRef":{"resource":"routes","namespace":"openshift-ingress-canary","name":"canary","uid":"3e179946-d4e3-45ad-9380-c305baefd14e","apiGroup":"route.openshift.io","apiVersion":"v1","resourceVersion":"297888"},"responseStatus":{"metadata":{},"status":"Failure","message":"Route.route.openshift.io \"canary\" is invalid: spec.subdomain: Invalid value: \"canary-openshift-ingress-canary\": field is immutable","reason":"Invalid","details":{"name":"canary","group":"route.openshift.io","kind":"Route","causes":[{"reason":"FieldValueInvalid","message":"Invalid value: \"canary-openshift-ingress-canary\": field is immutable","field":"spec.subdomain"}]},"code":422},"requestReceivedTimestamp":"2024-04-24T13:16:34.630249Z","stageTimestamp":"2024-04-24T13:16:34.636869Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"openshift-ingress-operator\" of ClusterRole \"openshift-ingress-operator\" to ServiceAccount \"ingress-operator/openshift-ingress-operator\""}} # Route is deleted manually "kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"70821b58-dabc-4593-ba6d-5e81e5d27d21","stage":"ResponseComplete","requestURI":"/aps/route.openshift.io/v1/namespaces/openshift-ingress-canary/routes/canary","verb":"delete","user":{"username":"system:admin","groups":["system:masters","syste:authenticated"]},"sourceIPs":["10.0.91.78","10.128.0.2"],"userAgent":"oc/4.13.0 (linux/amd64) kubernetes/7780c37","objectRef":{"resource":"routes","namespace:"openshift-ingress-canary","name":"canary","apiGroup":"route.openshift.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","details":{"ame":"canary","group":"route.openshift.io","kind":"routes","uid":"3e179946-d4e3-45ad-9380-c305baefd14e"},"code":200},"requestReceivedTimestamp":"2024-04-24T1324:39.558620Z","stageTimestamp":"2024-04-24T13:24:39.561267Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}} # Route is created again {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"92e6132a-aa1d-482d-a1dc-9ce021ae4c37","stage":"ResponseComplete","requestURI":"/aps/route.openshift.io/v1/namespaces/openshift-ingress-canary/routes","verb":"create","user":{"username":"system:serviceaccount:openshift-ingress-operator:ingres-operator","groups":["system:serviceaccounts","system:serviceaccounts:openshift-ingress-operator","system:authenticated"],"extra":{"authentication.kubernetesio/pod-name":["ingress-operator-746cd8598-hq2st"],"authentication.kubernetes.io/pod-uid":["f3ebccdf-f3b3-420d-8ea5-e33d98945403"]}},"sourceIPs":["10.128.0.93""10.128.0.2"],"userAgent":"Go-http-client/2.0","objectRef":{"resource":"routes","namespace":"openshift-ingress-canary","name":"canary","apiGroup":"route.opensift.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":201},"requestReceivedTimestamp":"2024-04-24T13:24:39.577255Z","stageTimestamp":"2024-04-24T1:24:39.584371Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"openshift-ingress-perator\" of ClusterRole \"openshift-ingress-operator\" to ServiceAccount \"ingress-operator/openshift-ingress-operator\""}}
Version-Release number of selected component (if applicable):
Ocp upgrade between 4.13 and 4.14
How reproducible:
Upgrade the cluster from OCP 4.13 to 4.14 and check the ingress operator pod logs
Steps to Reproduce:
1. Install cluster in OCP 4.13 2. Upgrade to OCP 4.14 3. Check the ingress operator logs
Actual results:
Reported errors above
Expected results:
The ingress canary route should be update without isssues
Additional info:
Update latest ironic projects in the ironic containers to get bug and security fixes
This is a clone of issue OCPBUGS-38925. The following is the description of the original issue:
—
Description of problem:
periodics are failing due to a change in coreos.
Version-Release number of selected component (if applicable):
4.15,4.16,4.17,4.18
How reproducible:
100%
Steps to Reproduce:
1. Check any periodic conformance jobs 2. 3.
Actual results:
periodic conformance fails with hostedcluster creation
Expected results:
periodic conformance test suceeds
Additional info:
This is a clone of issue OCPBUGS-42732. The following is the description of the original issue:
—
Description of problem:
The operator cannot succeed removing resources when networkAccess is set to Removed. It looks like the authorization error changes from bloberror.AuthorizationPermissionMismatch to bloberror.AuthorizationFailure after the storage account becomes private (networkAccess: Internal). This is either caused by weird behavior in the azure sdk, or in the azure api itself. The easiest way to solve it is to also handle bloberror.AuthorizationFailure here: https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/storage/azure/azure.go?plain=1#L1145 The error condition is the following: status: conditions: - lastTransitionTime: "2024-09-27T09:04:20Z" message: "Unable to delete storage container: DELETE https://imageregistrywxj927q6bpj.blob.core.windows.net/wxj-927d-jv8fc-image-registry-rwccleepmieiyukdxbhasjyvklsshhee\n--------------------------------------------------------------------------------\nRESPONSE 403: 403 This request is not authorized to perform this operation.\nERROR CODE: AuthorizationFailure\n--------------------------------------------------------------------------------\n\uFEFF<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>AuthorizationFailure</Code><Message>This request is not authorized to perform this operation.\nRequestId:ababfe86-301e-0005-73bd-10d7af000000\nTime:2024-09-27T09:10:46.1231255Z</Message></Error>\n--------------------------------------------------------------------------------\n" reason: AzureError status: Unknown type: StorageExists - lastTransitionTime: "2024-09-27T09:02:26Z" message: The registry is removed reason: Removed status: "True" type: Available
Version-Release number of selected component (if applicable):
4.18, 4.17, 4.16 (needs confirmation), 4.15 (needs confirmation)
How reproducible:
Always
Steps to Reproduce:
1. Get an Azure cluster 2. In the operator config, set networkAccess to Internal 3. Wait until the operator reconciles the change (watch networkAccess in status with `oc get configs.imageregistry/cluster -oyaml |yq '.status.storage'`) 4. In the operator config, set management state to removed: `oc patch configs.imageregistry/cluster -p '{"spec":{"managementState":"Removed"}}' --type=merge` 5. Watch the cluster operator conditions for the error
Actual results:
Expected results:
Additional info:
When the custom AMI feature was introduced, the Installer didn't support machine pools. Now that it does, and has done for a while, we should deprecate the field `platform.aws.amiID`.
The same affect is now achieved by setting `platform.aws.defaultMachinePlatform.amiID`.
Description of problem:
checked on 4.17.0-0.nightly-2024-08-13-031847, there are 2 Metrics tab in 4.17 developer console under "Observe" section, see picture: https://drive.google.com/file/d/1x7Jm2Q9bVDOdFcctjG6WOUtIv_nsD9Pd/view?usp=sharing
checked, https://github.com/openshift/monitoring-plugin/pull/138 is merged to 4.17, but https://github.com/openshift/console/pull/14105 is merged to 4.18, not merged to 4.17
example, code
const expectedTabs: string[] = ['Dashboards', 'Silences', 'Events']
merged to 4.18
but not merged to 4.17
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-08-13-031847
How reproducible:
always
Steps to Reproduce:
1. go to developer console, select one project and click Observe
Actual results:
2 Metrics tab in 4.17 developer console
Expected results:
only one
Additional info:
should back port https://github.com/openshift/console/pull/14105 to 4.17
Release a new stable branch of Gophercloud with the following changes:
After successfully creating a NAD of type: "OVN Kubernetes secondary localnet network", when viewing the object in the GUI, it will say that it is of type "OVN Kubernetes L2 overlay network".
When examining the objects YAML, it is still correctly configured as a NAD type of localnet.
Version-Release number of selected component:
OCP Virtualization 4.15.1
How reproducible:100%
Steps to Reproduce:
1. Create appropriate NNCP and apply
for example:
apiVersion: nmstate.io/v1 kind: NodeNetworkConfigurationPolicy metadata: name: nncp-br-ex-vlan-101 spec: nodeSelector: node-role.kubernetes.io/worker: '' desiredState: ovn: bridge-mappings: - localnet: vlan-101 bridge: br-ex state: present
2. Create localnet type NAD (from GUI or YAML)
For example:
apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: vlan-101 namespace: default spec: config: |2 { "name":"br-ex", "type":"ovn-k8s-cni-overlay", "cniVersion":"0.4.0", "topology":"localnet", "vlanID":101, "netAttachDefName":"default/vlan-101" }
3. View through the GUI by clicking on Networking–>NetworkAttachementDefinitions–>NAD you just created
4. When you look under type it will incorrectly display as Type: OVN Kubernetes L2 overlay Network
Actual results:
Type is displayed as OVN Kubernetes L2 overlay Network
If you examine the YAML for the NAD you will see that it is indeed still of type localnet
Please see attached screenshots for display of NAD type and the actual YAML of NAD.
At this point in time it looks as though this is just a display error.
Expected results:
Type should be displayed as OVN Kubernetes secondary localnet network
Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver-operator/pull/72
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
{ fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:409]: Unexpected error: <errors.aggregate | len:1, cap:1>: promQL query returned unexpected results: avg_over_time(cluster:telemetry_selected_series:count[43m25s]) >= 750 [ { "metric": { "prometheus": "openshift-monitoring/k8s" }, "value": [ 1721688289.455, "752.1379310344827" ] } ] [ <*errors.errorString | 0xc001a6d120>{ s: "promQL query returned unexpected results:\navg_over_time(cluster:telemetry_selected_series:count[43m25s]) >= 750\n[\n {\n \"metric\": {\n \"prometheus\": \"openshift-monitoring/k8s\"\n },\n \"value\": [\n 1721688289.455,\n \"752.1379310344827\"\n ]\n }\n]", }, ] occurred Ginkgo exit error 1: exit with code 1}
This test blocks PR merges in CMO
Description of problem:
Perf & scale team is running scale tests to to find out maximum supported egress ips and come across this issue. When we have 55339 egress ip objects (each egress ip object with one egress ip address) in 118 worker node baremetal cluster, multus-admission-controller pod is stuck in CrashLoopBackOff state. "oc describe pod" command output is copied here http://storage.scalelab.redhat.com/anilvenkata/multus-admission/multus-admission-controller-84b896c8-kmvdk.describe "oc describe pod" shows that the names of all 55339 egress ips are passed to container's exec command #cat multus-admission-controller-84b896c8-kmvdk.describe | grep ignore-namespaces | tr ',' '\n' | grep -c egressip 55339 and exec command is failing as this argument list is too long. # oc logs -n openshift-multus multus-admission-controller-84b896c8-kmvdk Defaulted container "multus-admission-controller" out of: multus-admission-controller, kube-rbac-proxy exec /bin/bash: argument list too long # oc get co network NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE network 4.14.16 True True False 35d Deployment "/openshift-multus/multus-admission-controller" update is rolling out (1 out of 3 updated) # oc describe pod -n openshift-multus multus-admission-controller-84b896c8-kmvdk > multus-admission-controller-84b896c8-kmvdk.describe # oc get pods -n openshift-multus | grep multus-admission-controller multus-admission-controller-6c58c66ff9-5x9hn 2/2 Running 0 35d multus-admission-controller-6c58c66ff9-zv9pd 2/2 Running 0 35d multus-admission-controller-84b896c8-kmvdk 1/2 CrashLoopBackOff 26 (2m56s ago) 110m As this environment has 55338 namespaces (each namespace with 1 pod and 1 eip object), it will hard to capture must gather.
Version-Release number of selected component (if applicable):
4.14.16
How reproducible:
always
Steps to Reproduce:
1. use kube-burner to create 55339 egress ip obejct, each object with one egress ip address. 2. We will see multus-admission-controller pod stuck in CrashLoopBackOff
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-38922. The following is the description of the original issue:
—
Description of problem:
With the Configuring a private storage endpoint on Azure by enabling the Image Registry Operator to discover VNet and subnet names[1], if creating cluster with internal Image Registry, it will create a storage account with private endpoint, so once the new pvc using the same skuName with this private storage account, it will hit the mount permission issue. [1] https://docs.openshift.com/container-platform/4.16/post_installation_configuration/configuring-private-cluster.html#configuring-private-storage-endpoint-azure-vnet-subnet-iro-discovery_configuring-private-cluster
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Always
Steps to Reproduce:
Creating cluster with flexy job: aos-4_17/ipi-on-azure/versioned-installer-customer_vpc-disconnected-fully_private_cluster-arm profile and specify enable_internal_image_registry: "yes" Create pod and pvc with azurefile-csi sc
Actual results:
pod failed to up due to mount error: mount //imageregistryciophgfsnrc.file.core.windows.net/pvc-facecce9-d4b5-4297-b253-9a6200642392 on /var/lib/kubelet/plugins/kubernetes.io/csi/file.csi.azure.com/b4b5e52fb1d21057c9644d0737723e8911d9519ec4c8ddcfcd683da71312a757/globalmount failed with mount failed: exit status 32 Mounting command: mount Mounting arguments: -t cifs -o mfsymlinks,cache=strict,nosharesock,actimeo=30,gid=1018570000,file_mode=0777,dir_mode=0777, //imageregistryciophgfsnrc.file.core.windows.net/pvc-facecce9-d4b5-4297-b253-9a6200642392 /var/lib/kubelet/plugins/kubernetes.io/csi/file.csi.azure.com/b4b5e52fb1d21057c9644d0737723e8911d9519ec4c8ddcfcd683da71312a757/globalmount Output: mount error(13): Permission denied
Expected results:
Pod should be up
Additional info:
We can have some simple WA like using storageclass with networkEndpointType: privateEndpoint or specify another storage account, but using the pre-defined storageclass azurefile-csi will fail. And the automation is not easy to walk around. I'm not sure if CSI Driver could check if the reused storage account has the private endpoint before using the existing storage account.
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/288
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Because of a bug in upstream CAPA, the Load Balancer ingress rules are continuously revoked and then authorized, causing unnecessary AWS API calls and cluster provision delays.
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
A constant loop of revoke-authorize of ingress rules.
Expected results:
Rules should be revoked only when needed (for example, when the installer removes the allow-all ssh rule). In the other cases, rules should be authorized only once.
Additional info:
Upstream issue created: https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5023 PR submitted upstream: https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/5024
Description of problem:
In PowerVS, when I try and deploy a 4.17 cluster, I see the following ProbeError event: Liveness probe error: Get "https://192.168.169.11:10258/healthz": dial tcp 192.168.169.11:10258: connect: connection refused
Version-Release number of selected component (if applicable):
release-ppc64le:4.17.0-0.nightly-ppc64le-2024-06-14-211304
How reproducible:
Always
Steps to Reproduce:
1. Create a cluster
Description of problem:
MultipleDefaultStorageClasses alert has incorrect rules because it does not deactivate right after user fixes the cluster to have only 1 storage class but is active for another ~5 minutes after the fix is applied.
Version-Release number of selected component (if applicable):
OCP 4.11+
How reproducible:
always (platform independent, reproducible with any driver and storage class)
Steps to Reproduce:
Set additional storage class as default ``` $ oc patch storageclass gp2-csi -p '{"metadata": {"annotations": {"storageclass.kubernetes.io/is-default-class": "true"}}}'storageclass.storage.k8s.io/gp2-csi patched ``` Check that prometheus metrics is now > 1 ``` $ oc exec -c prometheus -n openshift-monitoring prometheus-k8s-0 -- curl -s --data-urlencode "query=default_storage_class_count" http://localhost:9090/api/v1/query | jq -r '.data.result[0].value[1]'2 ``` Wait at least 5 minutes for alert to be `pending`, after 10 minutes the alert starts `firing` ``` $ oc exec -c prometheus -n openshift-monitoring prometheus-k8s-0 -- curl -s http://localhost:9090/api/v1/alerts | jq -r '.data.alerts[] | select(.labels.alertname == "MultipleDefaultStorageClasses") | "\(.labels.alertname) - \(.state)"'MultipleDefaultStorageClasses - firing ``` Annotate storage class as non default, making sure there's only one default now ``` $ oc patch storageclass gp2-csi -p '{"metadata": {"annotations": {"storageclass.kubernetes.io/is-default-class": "false"}}}'storageclass.storage.k8s.io/gp2-csi patched ``` Alert is still present for 5 minutes but should have disappeared immediately - this is the actual bug ``` $ oc exec -c prometheus -n openshift-monitoring prometheus-k8s-0 -- curl -s http://localhost:9090/api/v1/alerts | jq -r '.data.alerts[] | select(.labels.alertname == "MultipleDefaultStorageClasses") | "\(.labels.alertname) - \(.state)"'MultipleDefaultStorageClasses - firing ``` After 5 minutes alert is gone ``` $ oc exec -c prometheus -n openshift-monitoring prometheus-k8s-0 -- curl -s http://localhost:9090/api/v1/alerts | jq -r '.data.alerts[] | select(.labels.alertname == "MultipleDefaultStorageClasses") | "\(.labels.alertname) - \(.state)" ``` Root cause -> the alerting rule is set to get `max_over_time` but it should be `min_over_time` here: https://github.com/openshift/cluster-storage-operator/blob/7b4d8861d8f9364d63ad9a58347c2a7a014bff70/manifests/12_prometheusrules.yaml#L19
Additional info:
To verify changes follow the same procedure and verify that the alert is gone right after the settings are fixed (meaning there's only 1 default storage class again). Changes are tricky to test -> on a live cluster, changing the Prometheus rule won't work as it will get reconciled by CSO, but if CSO is scaled down to prevent this then metrics are not collected. I'd suggest testing this by editing CSO code, scaling down CSO+CVO and running CSO locally, see README with instructions how to do it: https://github.com/openshift/cluster-storage-operator/blob/master/README.md
Component Readiness has found a potential regression in [sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial] [Suite:openshift/conformance/serial].
Probability of significant regression: 100.00%
Sample (being evaluated) Release: 4.17
Start Time: 2024-05-24T00:00:00Z
End Time: 2024-05-30T23:59:59Z
Success Rate: 13.33%
Successes: 2
Failures: 13
Flakes: 0
Base (historical) Release: 4.15
Start Time: 2024-02-01T00:00:00Z
End Time: 2024-02-28T23:59:59Z
Success Rate: 100.00%
Successes: 42
Failures: 0
Flakes: 0
This issue is actively blocking payloads as no techpreview serial jobs can pass: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-overall-analysis-all/1796109729718603776
Sippy test data shows this permafailing around the time the kube rebase merged.
Test failure output:
[sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial] [Suite:openshift/conformance/serial] expand_less 46s { fail [github.com/openshift/origin/test/extended/etcd/etcd_storage_path.go:534]: test failed: no test data for resource.k8s.io/v1alpha2, Kind=ResourceClaimParameters. Please add a test for your new type to etcdStorageData. no test data for resource.k8s.io/v1alpha2, Kind=ResourceClassParameters. Please add a test for your new type to etcdStorageData. no test data for resource.k8s.io/v1alpha2, Kind=ResourceSlice. Please add a test for your new type to etcdStorageData. etcd data does not match the types we saw: seen but not in etcd data: [ resource.k8s.io/v1alpha2, Resource=resourceclassparameters resource.k8s.io/v1alpha2, Resource=resourceslices resource.k8s.io/v1alpha2, Resource=resourceclaimparameters] Ginkgo exit error 1: exit with code 1}
The provisioning CR is now created with a paused annotation (since https://github.com/openshift/installer/pull/8346)
On baremetal IPI installs, this annotation is removed at the conclusion of bootstrapping.
On assisted/ABI installs there is nothing to remove it, so cluster-baremetal-operator never deploys anything.
Description of problem:
Update the PowerVS CAPI provider to v0.8.0
This is a clone of issue OCPBUGS-41852. The following is the description of the original issue:
—
Description of problem:
update the tested instance type for IBMCloud
Version-Release number of selected component (if applicable):
4.17
How reproducible:
1. Some new instance type need to be added 2. match the memory and cpu limitation
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
https://docs.openshift.com/container-platform/4.16/installing/installing_ibm_cloud_public/installing-ibm-cloud-customizations.html#installation-ibm-cloud-tested-machine-types_installing-ibm-cloud-customizations
Description of problem:
An IBM Cloud DNS zone does not go into "Active" state unless a permitted network is added to it and so if we try to use a DNS Zone which does not have a VPC attached as a permitted network, the installer fails with the error "failed to get DNS Zone id" when such a zone is attempted to be used. We already have code to attach a permitted network to a DNS Zone, but it cannot be used unless the DNS Zone is in "Active" state. The zone does not even show up in the install-config survey
Version-Release number of selected component (if applicable):
4.16, 4.17
How reproducible:
In the scenario where the user attempts to create a private cluster without attaching a permitted network to the DNS Zone.
Steps to Reproduce:
1. Create an IBM Cloud DNS zone in a DNS instance. 2. openshift-install create install-config [OPTIONS] 3. User created DNS zone won't show up in the selection for DNS Zone 4. Proceed anyway choosing another private DNS zone. 5. Edit the generated install-config and change basedomain to your zone. 6. openshift-install create manifests [OPTIONS] 7. The above step will fail with "failed to get DNS Zone id".
Actual results:
DNS Zone is not visible in the survey and creating manifests fails.
Expected results:
The DNS zone without permitted networks shows up in the survey and the installation completes.
Removed default behavior to disable automated cleaning when in non-converged flow in this PR https://github.com/openshift/assisted-service/pull/5319
This causes issues now when a customer disables the converged flow but doesn't set automatedCleaningMode to disabled manually on their BMH.
Description of problem:
in the doc installing_ibm_cloud_public/installing-ibm-cloud-customizations.html have not the tested instance type list
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1.https://docs.openshift.com/container-platform/4.15/installing/installing_ibm_cloud_public/installing-ibm-cloud-customizations.html have not list the tested vm
Actual results:
have not list the tested type
Expected results:
list the tested instance type as https://docs.openshift.com/container-platform/4.15/installing/installing_azure/installing-azure-customizations.html#installation-azure-tested-machine-types_installing-azure-customizations
Additional info:
Description of problem:
[0]
$ omc -n openshift-cluster-storage-operator logs vsphere-problem-detector-operator-78cbc7fdbb-2g9mx | grep -i -e datastore.go -e E0508 2024-05-08T07:44:05.842165300Z I0508 07:44:05.839356 1 datastore.go:329] checking datastore ds:///vmfs/volumes/vsan:526390016b19d2b5-21ae3fd76fa61150/ for permissions 2024-05-08T07:44:05.842165300Z I0508 07:44:05.839504 1 datastore.go:125] CheckStorageClasses: thin-csi: storage policy openshift-storage-policy-tc01-rpdd7: unable to find datastore with URL ds:///vmfs/volumes/vsan:526390016b19d2b5-21ae3fd76fa61150/ 2024-05-08T07:44:05.842165300Z I0508 07:44:05.839522 1 datastore.go:142] CheckStorageClasses checked 7 storage classes, 1 problems found 2024-05-08T07:44:05.848251057Z E0508 07:44:05.848212 1 operator.go:204] failed to run checks: StorageClass thin-csi: storage policy openshift-storage-policy-tc01-rpdd7: unable to find datastore with URL ds:///vmfs/volumes/vsan:526390016b19d2b5-21ae3fd76fa61150/ [...]
[1] https://github.com/openshift/vsphere-problem-detector/compare/release-4.13...release-4.14
[2] https://github.com/openshift/vsphere-problem-detector/blame/release-4.14/pkg/check/datastore.go#L328-L344
[3] https://github.com/openshift/vsphere-problem-detector/pull/119
[4] https://issues.redhat.com/browse/OCPBUGS-28879
4.17.0-0.nightly-2024-05-16-195932 and 4.16.0-0.nightly-2024-05-17-031643 both have resource quota issues like
failed to create iam: LimitExceeded: Cannot exceed quota for OpenIdConnectProvidersPerAccount: 100 status code: 409, request id: f69bf82c-9617-408a-b281-92c1ef0ec974
failed to create infra: failed to create VPC: VpcLimitExceeded: The maximum number of VPCs has been reached. status code: 400, request id: f90dcc5b-7e66-4a14-aa22-cec9f602fa8e
Seth has indicated he is working to clean things up in https://redhat-internal.slack.com/archives/C01CQA76KMX/p1715913603117349?thread_ts=1715557887.529169&cid=C01CQA76KMX
Description of problem:
4.16.0-0.nightly-2024-05-14-095225, "logtostderr is removed in the k8s upstream and has no effect any more." log in kube-rbac-proxy-main/kube-rbac-proxy-self/kube-rbac-proxy-thanos containers
$ oc -n openshift-monitoring logs -c kube-rbac-proxy-main openshift-state-metrics-7f78c76cc6-nfbl4 W0514 23:19:50.052015 1 deprecated.go:66] ==== Removed Flag Warning ======================logtostderr is removed in the k8s upstream and has no effect any more.=============================================== ... $ oc -n openshift-monitoring logs -c kube-rbac-proxy-self openshift-state-metrics-7f78c76cc6-nfbl4 ... W0514 23:19:50.177692 1 deprecated.go:66] ==== Removed Flag Warning ======================logtostderr is removed in the k8s upstream and has no effect any more.=============================================== ... $ oc -n openshift-monitoring get pod openshift-state-metrics-7f78c76cc6-nfbl4 -oyaml | grep logtostderr -C3 spec: containers: - args: - --logtostderr - --secure-listen-address=:8443 - --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 - --upstream=http://127.0.0.1:8081/ -- name: kube-api-access-v9hzd readOnly: true - args: - --logtostderr - --secure-listen-address=:9443 - --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 - --upstream=http://127.0.0.1:8082/ $ oc -n openshift-monitoring logs -c kube-rbac-proxy-thanos prometheus-k8s-0 W0515 02:55:54.209496 1 deprecated.go:66] ==== Removed Flag Warning ======================logtostderr is removed in the k8s upstream and has no effect any more.=============================================== ... $ oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep logtostderr -C3 - --config-file=/etc/kube-rbac-proxy/config.yaml - --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 - --allow-paths=/metrics - --logtostderr=true - --tls-min-version=VersionTLS12 env: - name: POD_IP
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-14-095225
How reproducible:
always
Steps to Reproduce:
1. see the description
Actual results:
logtostderr is removed in the k8s upstream and has no effect any more
Expected results:
no such info
Additional info:
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/150
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/162
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-42106. The following is the description of the original issue:
—
Description of problem:
Test Platform has detected a large increase in the amount of time spent waiting for pull secrets to be initialized. Monitoring the audit log, we can see nearly continuous updates to the SA pull secrets in the cluster (~2 per minute for every SA pull secret in the cluster). Controller manager is filled with entries like: - "Internal registry pull secret auth data does not contain the correct number of entries" ns="ci-op-tpd3xnbx" name="deployer-dockercfg-p9j54" expected=5 actual=4" - "Observed image registry urls" urls=["172.30.228.83:5000","image-registry.openshift-image-registry.svc.cluster.local:5000","image-registry.openshift-image-registry.svc:5000","registry.build01.ci.openshift.org","registry.build01.ci.openshift.org" In this "Observed image registry urls" log line, notice the duplicate entries for "registry.build01.ci.openshift.org" . We are not sure what is causing this but it leads to duplicate entry, but when actualized in a pull secret map, the double entry is reduced to one. So the controller-manager finds the cardinality mismatch on the next check. The duplication is evident in OpenShiftControllerManager/cluster: dockerPullSecret: internalRegistryHostname: image-registry.openshift-image-registry.svc:5000 registryURLs: - registry.build01.ci.openshift.org - registry.build01.ci.openshift.org But there is only one hostname in config.imageregistry.operator.openshift.io/cluster: routes: - hostname: registry.build01.ci.openshift.org name: public-routes secretName: public-route-tls
Version-Release number of selected component (if applicable):
4.17.0-rc.3
How reproducible:
Constant on build01 but not on other build farms
Steps to Reproduce:
1. Something ends up creating duplicate entries in the observed configuration of the openshift-controller-manager. 2. 3.
Actual results:
- Approximately 400K secret patches an hour on build01 vs ~40K on other build farms. Intialization times have increased by two orders of magnitude in new ci-operator namespaces. - The openshift-controller-manager is hot looping and experiencing client throttling.
Expected results:
1. Initialization of pull secrets in a namespace should take < 1 seconds. On build01, it can take over 1.5 minutes. 2. openshift-controller-manager should not possess duplicate entries. 3. If duplicate entries are a configuration error, openshift-controller-manager should de-dupe the entries. 4. There should be alerting when the openshift-controller-manager experiences client-side throttling / pathological behavior.
Additional info:
Please review the following PR: https://github.com/openshift/csi-external-attacher/pull/74
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
In order to safely roll out the new method of calculating the cpu system reserved values [1] , we would have to introduce versioning in auto node sizing. This way even if the new method ends up reserving more CPU, existing customers won't see any dip in the amount of CPU available for their workloads.
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled
Multus is now a Pod and will be captured by normal oc adm must-gather command.
The multus.log file is removed since 4.16 and doesn't exist anymore.
The https://github.com/pkg/errors repo has been archived on Dec, 2021. See also https://github.com/pkg/errors/issues/245.
We should probably use `fmt.Errorf("... %w", err)` instead.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Description of problem:
After deploying a node with ZTP and applying a PerformanceProfile, the PerformanceProfile continually transitions between conditions (Available, Upgradeable, Progressing, Degraded) and the cluster tuning operator logs show a cycle of Reconciling/Updating the profile every 15 minutes or so. No apparent impact to the cluster, but it is generating a lot of noise and concern with one of our partners
Version-Release number of selected component (if applicable):
Observed in 4.14.5 and 4.14.25
How reproducible:
Always
Steps to Reproduce:
1.Deploy a node via ZTP 2.Apply a performanceprofile via ACM/policies
Actual results:
PerformanceProfile is applied, but logs show repeated reconcile/update attempts, generating noise in the logs
Expected results:
PerformaneProfile is applied and reconciled, but without the Updates and state transitions.
Additional info:
Logs show that the MachineConfig for the perf profile is getting updated every 15 minutes, but nothing has been changed i.e. no change in the applied PerformanceProfile: I0605 18:52:08.786257 1 performanceprofile_controller.go:390] Reconciling PerformanceProfile I0605 18:52:08.786568 1 resources.go:41] checking staged MachineConfig "rendered-master-478a5ff5b6e20bdac08368b380ae69c8" I0605 18:52:08.786899 1 performanceprofile_controller.go:461] using "crun" as high-performance runtime class container-runtime for profile "openshift-node-performance-profile" I0605 18:52:08.788604 1 resources.go:109] Update machine-config "50-performance-openshift-node-performance-profile" I0605 18:52:08.812015 1 status.go:83] Updating the performance profile "openshift-node-performance-profile" status I0605 18:52:08.823836 1 performanceprofile_controller.go:390] Reconciling PerformanceProfile I0605 18:52:08.824049 1 resources.go:41] checking staged MachineConfig "rendered-master-478a5ff5b6e20bdac08368b380ae69c8" I0605 18:52:08.824994 1 performanceprofile_controller.go:461] using "crun" as high-performance runtime class container-runtime for profile "openshift-node-performance-profile" I0605 18:52:08.826478 1 resources.go:109] Update machine-config "50-performance-openshift-node-performance-profile" I0605 18:52:29.069218 1 performanceprofile_controller.go:390] Reconciling PerformanceProfile I0605 18:52:29.069349 1 resources.go:41] checking staged MachineConfig "rendered-master-478a5ff5b6e20bdac08368b380ae69c8" I0605 18:52:29.069571 1 performanceprofile_controller.go:461] using "crun" as high-performance runtime class container-runtime for profile "openshift-node-performance-profile" I0605 18:52:29.074617 1 resources.go:109] Update machine-config "50-performance-openshift-node-performance-profile" I0605 18:52:29.088866 1 status.go:83] Updating the performance profile "openshift-node-performance-profile" status I0605 18:52:29.096390 1 performanceprofile_controller.go:390] Reconciling PerformanceProfile I0605 18:52:29.096506 1 resources.go:41] checking staged MachineConfig "rendered-master-478a5ff5b6e20bdac08368b380ae69c8" I0605 18:52:29.096834 1 performanceprofile_controller.go:461] using "crun" as high-performance runtime class container-runtime for profile "openshift-node-performance-profile" I0605 18:52:29.097912 1 resources.go:109] Update machine-config "50-performance-openshift-node-performance-profile" # oc get performanceprofile -o yaml <...snip...> status: conditions: - lastHeartbeatTime: "2024-06-05T19:09:08Z" lastTransitionTime: "2024-06-05T19:09:08Z" message: cgroup=v1; status: "True" type: Available - lastHeartbeatTime: "2024-06-05T19:09:08Z" lastTransitionTime: "2024-06-05T19:09:08Z" status: "True" type: Upgradeable - lastHeartbeatTime: "2024-06-05T19:09:08Z" lastTransitionTime: "2024-06-05T19:09:08Z" status: "False" type: Progressing - lastHeartbeatTime: "2024-06-05T19:09:08Z" lastTransitionTime: "2024-06-05T19:09:08Z" status: "False" type: Degraded
Caught by the test (among others):
[sig-network] there should be reasonably few single second disruptions for kube-api-http2-localhost-new-connections
Beginning with CI payload 4.17.0-0.ci-2024-07-20-200703 and continuing into nightlies with 4.17.0-0.nightly-2024-07-21-065611 the aggregated tests started recording weird disruption results. Most of the runs (as in the sample above) report success but the aggregated test doesn't count them all, e.g.
Test Failed! suite=[root openshift-tests], testCase=[sig-network] there should be reasonably few single second disruptions for openshift-api-http2-service-network-reused-connections Message: Passed 2 times, failed 0 times, skipped 0 times: we require at least 6 attempts to have a chance at success name: '[sig-network] there should be reasonably few single second disruptions for openshift-api-http2-service-network-reused-connections' testsuitename: openshift-tests summary: 'Passed 2 times, failed 0 times, skipped 0 times: we require at least 6 attempts to have a chance at success'
https://github.com/openshift/origin/pull/28277 is the sole PR in that first payload and it certainly seems related, so I will put up a revert.
Starting with payload 4.17.0-0.nightly-2024-06-25-103421 we are seeing aggregated failures for aws due to
[sig-network-edge][Conformance][Area:Networking][Feature:Router][apigroup:route.openshift.io][apigroup:config.openshift.io] The HAProxy router should pass the http2 tests [apigroup:image.openshift.io][apigroup:operator.openshift.io] [Suite:openshift/conformance/parallel/minimal]
This test was recently reenabled for aws via https://github.com/openshift/origin/pull/28515
Description of problem:
compact agent e2e jobs are consistently failing the e2e test (when they manage to install):
[sig-node] Managed cluster should verify that nodes have no unexpected reboots [Late] [Suite:openshift/conformance/parallel]
Examining CI search I noticed that this failure is also occurring on many other jobs:
Version-Release number of selected component (if applicable):
In CI search, we can see failures in 4.15-4.17
How reproducible:
See CI search results
Steps to Reproduce:
1. 2. 3.
Actual results:
fail e2e
Expected results:
Pass
Additional info:
Description of problem:
Using IPI on AWS. When we create a new worker using a 4.1 cloud image, the "node-valid-hostname.service" service fails with this error # systemctl status node-valid-hostname.service × node-valid-hostname.service - Wait for a non-localhost hostname Loaded: loaded (/etc/systemd/system/node-valid-hostname.service; enabled; preset: disabled) Active: failed (Result: timeout) since Mon 2023-10-16 08:37:50 UTC; 1h 13min ago Main PID: 1298 (code=killed, signal=TERM) CPU: 330msOct 16 08:32:50 localhost.localdomain mco-hostname[1298]: waiting for non-localhost hostname to be assigned Oct 16 08:32:50 localhost.localdomain systemd[1]: Starting Wait for a non-localhost hostname... Oct 16 08:37:50 localhost.localdomain systemd[1]: node-valid-hostname.service: start operation timed out. Terminating. Oct 16 08:37:50 localhost.localdomain systemd[1]: node-valid-hostname.service: Main process exited, code=killed, status=15/TERM Oct 16 08:37:50 localhost.localdomain systemd[1]: node-valid-hostname.service: Failed with result 'timeout'. Oct 16 08:37:50 localhost.localdomain systemd[1]: Failed to start Wait for a non-localhost hostname. The configured hostname is: sh-5.1# hostname localhost.localdomain
Version-Release number of selected component (if applicable):
IPI on AWS 4.15.0-0.nightly-2023-10-15-214132
How reproducible:
Always
Steps to Reproduce:
1. Create a machineset using a 4.1 cloud image 2. Scale the machineset to create a new worker node 3. When the worker node is added, check the hostname and the service
Actual results:
The "node-valid-hostname.service" is failed and the configured hostname is sh-5.1# hostname localhost.localdomain
Expected results:
No service should fail and the new worker should have a valid hostname.
Additional info:
Cluster API Provider IBM (CAPI) provides the ability to override the endpoints it interacts with. When we start CAPI, we should pass along any endpoint overrides from the install config.
Service endpoints were added to the install config here
CAPI accepts endpoint overrides as described here
Pass any endpoint overrides we can from the installer to CAPI
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
The openshift/origin/text/extended/router/http2.go tests don't run on AWS. We disabled this sometime ago. Let's enable this to see if it is still an issue. I have been running the http2 tests on AWS this week and I have not run into the original issue highlighted by the original bugzilla bug.
// platformHasHTTP2LoadBalancerService returns true where the default // router is exposed by a load balancer service and can support http/2 // clients. func platformHasHTTP2LoadBalancerService(platformType configv1.PlatformType) bool { switch platformType { case configv1.AzurePlatformType, configv1.GCPPlatformType: return true case configv1.AWSPlatformType: e2e.Logf("AWS support waiting on https://bugzilla.redhat.com/show_bug.cgi?id=1912413") fallthrough default: return false } }
Description of problem:
kube-state-metrics failing with "network is unreachable" needing to be restarted manually
Version-Release number of selected component (if applicable):
How reproducible:
I have not been able to reproduce this in a lab cluster
Steps to Reproduce:
N/A
Actual results:
Metrics not being served
Expected results:
kube-state-metrics probe fails and the pod restarts
Additional info:
Even if the –namespace arg is specified to hypershift install render, the openshift-config-managed-trusted-ca-bundle configmap's namespace is always set to "hypershift".
Description of problem:
FDP released a new OVS 3.4 version, that will be used on the host.
We want to maintain the same version in the container.
This is mostly needed for OVN observability feature.
Description of problem:
Configuring mTLS on default IngressController breaks ingress canary check & console health checks which in turn makes the ingress and console cluster operators into a degraded state.
OpenShift release version:
OCP-4.9.5
Cluster Platform:
UPI on Baremetal (Disconnected cluster)
How reproducible:
Configure mutual TLS/mTLS using default IngressController as described in the doc(https://docs.openshift.com/container-platform/4.9/networking/ingress-operator.html#nw-mutual-tls-auth_configuring-ingress)
Steps to Reproduce (in detail):
1. Create a config map that is in the openshift-config namespace.
2. Edit the IngressController resource in the openshift-ingress-operator project
3.Add the spec.clientTLS field and subfields to configure mutual TLS:
~~~
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
name: default
namespace: openshift-ingress-operator
spec:
clientTLS:
clientCertificatePolicy: Required
clientCA:
name: router-ca-certs-default
allowedSubjectPatterns:
Expected results:
mTLS setup should work properly without degrading the Ingress and Console operators.
Impact of the problem:
Instable cluster with Ingress and Console operators into Degraded state.
Additional info:
The following is the Error message for your reference:
The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
// Canary checks looking for required tls certificate.
2021-11-19T17:17:58.237Z ERROR operator.canary_controller wait/wait.go:155 error performing canary route check
// Console operator:
RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.bruce.openshift.local): Get "https://console-openshift-console.apps.bruce.openshift.local": remote error: tls: certificate required
Along with disruption monitoring via external endpoint we should add in-cluster monitors which run the same checks over:
These tests should be implemented as deployments with anti-affinity landing on different nodes. Deployments are selected so that the nodes could properly be drained. These deployments are writing to host disk and on restart the pod will pick up existing data. When a special configmap is created the pod will stop collecting disruption data.
External part of the test will create deployments (and necessary RBAC objects) when test is started, create stop configmap when it ends and collect data from the nodes. The test will expose them on intervals chart, so that the data could be used to find the source of disruption
This is a clone of issue OCPBUGS-36670. The following is the description of the original issue:
—
Description of problem:
Using payload built with https://github.com/openshift/installer/pull/8666/ so that master instances can be provisioned from gen2 image, which is required when configuring security type in install-config. Enable TrustedLaunch security type in install-config: ================== controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: azure: encryptionAtHost: true settings: securityType: TrustedLaunch trustedLaunch: uefiSettings: secureBoot: Enabled virtualizedTrustedPlatformModule: Enabled Launch capi-based installation, installer failed after waiting 15min for machines to provision... INFO GalleryImage.ID=/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima08conf01-9vgq5-rg/providers/Microsoft.Compute/galleries/gallery_jima08conf01_9vgq5/images/jima08conf01-9vgq5 INFO GalleryImage.ID=/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima08conf01-9vgq5-rg/providers/Microsoft.Compute/galleries/gallery_jima08conf01_9vgq5/images/jima08conf01-9vgq5-gen2 INFO Created manifest *v1beta1.AzureMachine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-bootstrap INFO Created manifest *v1beta1.AzureMachine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-0 INFO Created manifest *v1beta1.AzureMachine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-1 INFO Created manifest *v1beta1.AzureMachine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-2 INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-bootstrap INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-0 INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-1 INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-2 INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-bootstrap INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master INFO Waiting up to 15m0s (until 6:26AM UTC) for machines [jima08conf01-9vgq5-bootstrap jima08conf01-9vgq5-master-0 jima08conf01-9vgq5-master-1 jima08conf01-9vgq5-master-2] to provision... ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: control-plane machines were not provisioned within 15m0s: client rate limiter Wait returned an error: context deadline exceeded INFO Shutting down local Cluster API control plane... INFO Stopped controller: Cluster API INFO Stopped controller: azure infrastructure provider INFO Stopped controller: azureaso infrastructure provider INFO Local Cluster API system has completed operations In openshift-install.log, time="2024-07-08T06:25:49Z" level=debug msg="\tfailed to reconcile AzureMachine: failed to reconcile AzureMachine service virtualmachine: failed to create or update resource jima08conf01-9vgq5-rg/jima08conf01-9vgq5-bootstrap (service: virtualmachine): PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima08conf01-9vgq5-rg/providers/Microsoft.Compute/virtualMachines/jima08conf01-9vgq5-bootstrap" time="2024-07-08T06:25:49Z" level=debug msg="\t--------------------------------------------------------------------------------" time="2024-07-08T06:25:49Z" level=debug msg="\tRESPONSE 400: 400 Bad Request" time="2024-07-08T06:25:49Z" level=debug msg="\tERROR CODE: BadRequest" time="2024-07-08T06:25:49Z" level=debug msg="\t--------------------------------------------------------------------------------" time="2024-07-08T06:25:49Z" level=debug msg="\t{" time="2024-07-08T06:25:49Z" level=debug msg="\t \"error\": {" time="2024-07-08T06:25:49Z" level=debug msg="\t \"code\": \"BadRequest\"," time="2024-07-08T06:25:49Z" level=debug msg="\t \"message\": \"Use of TrustedLaunch setting is not supported for the provided image. Please select Trusted Launch Supported Gen2 OS Image. For more information, see https://aka.ms/TrustedLaunch-FAQ.\"" time="2024-07-08T06:25:49Z" level=debug msg="\t }" time="2024-07-08T06:25:49Z" level=debug msg="\t}" time="2024-07-08T06:25:49Z" level=debug msg="\t--------------------------------------------------------------------------------" time="2024-07-08T06:25:49Z" level=debug msg=" > controller=\"azuremachine\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"AzureMachine\" AzureMachine=\"openshift-cluster-api-guests/jima08conf01-9vgq5-bootstrap\" namespace=\"openshift-cluster-api-guests\" name=\"jima08conf01-9vgq5-bootstrap\" reconcileID=\"bee8a459-c3c8-4295-ba4a-f3d560d6a68b\"" Looks like that capi-based installer missed to enable security features during creating gen2 image, which can be found in terraform code. https://github.com/openshift/installer/blob/master/data/data/azure/vnet/main.tf#L166-L169 Gen2 image definition created by terraform: $ az sig image-definition show --gallery-image-definition jima08conf02-4mrnz-gen2 -r gallery_jima08conf02_4mrnz -g jima08conf02-4mrnz-rg --query 'features' [ { "name": "SecurityType", "value": "TrustedLaunch" } ] It's empty when querying from gen2 image created by using CAPI. $ az sig image-definition show --gallery-image-definition jima08conf01-9vgq5-gen2 -r gallery_jima08conf01_9vgq5 -g jima08conf01-9vgq5-rg --query 'features' $
Version-Release number of selected component (if applicable):
4.17 payload built from cluster-bot with PR https://github.com/openshift/installer/pull/8666/
How reproducible:
Always
Steps to Reproduce:
1. Enable security type in install-config 2. Create cluster by using CAPI 3.
Actual results:
Install failed.
Expected results:
Install succeeded.
Additional info:
It impacts installation with security type ConfidentialVM or TrustedLaunch enabled.
METAL-904 / https://github.com/openshift/cluster-baremetal-operator/pull/406 changed CBO to create a secret that contains the ironic image and service URLs.
If this secret is present we should use it instead of using vendored (and possibly out of date) CBO code to find these values ourselves.
The ironic agent image in the secret should take precedence over detecting the image assuming the spoke CPU architecture matches the hub. If the architecture is different we'll need to fall back to some other solution (annotations, or default image).
This secret will only be present in hub versions that have this new version of the CBO code so we also will need to maintain our current solutions for older hub cluster versions.
This is a clone of issue OCPBUGS-42535. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info: