Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
This outcome tracks the overall CoreOS Layering story as well as the technical items needed to converge CoreOS with RHEL image mode. This will provide operational consistency across the platforms.
ROADMAP for this Outcome: https://docs.google.com/document/d/1K5uwO1NWX_iS_la_fLAFJs_UtyERG32tdt-hLQM8Ow8/edit?usp=sharing
In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.
The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.
On-cluster, automated RHCOS Layering builds are important for multiple reasons:
This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.
Description of problem:
In clusters with OCB functionality enabled, sometimes the machine-os-builder pod is not restarted when we update the imageBuilderType. What we have observed is that the pod is restarted if a build is running, but it is not restarted if we are not building anything.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-09-12-195514 True False 88m Cluster version is 4.14.0-0.nightly-2023-09-12-195514
How reproducible:
Always
Steps to Reproduce:
1. Create the configuration resources needed by the OCB functionality. To reproduce this issue we use an on-cluster-build-config configmap with an empty imageBuilderType oc patch cm/on-cluster-build-config -n openshift-machine-config-operator -p '{"data":{"imageBuilderType": ""}}' 2. Create a infra pool and label it so that it can use OCB functionality apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: infra spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]} nodeSelector: matchLabels: node-role.kubernetes.io/infra: "" oc label mcp/infra machineconfiguration.openshift.io/layering-enabled= 3. Wait for the build pod to finish. 4. Once the build has finished and it has been cleaned, update the imageBuilderType so that we use "custom-pod-builder" type now. oc patch cm/on-cluster-build-config -n openshift-machine-config-operator -p '{"data":{"imageBuilderType": "custom-pod-builder"}}'
Actual results:
We waited for one hour, but the pod is never restarted. $ oc get pods |grep build machine-os-builder-6cfbd8d5d-xk6c5 1/1 Running 0 56m $ oc logs machine-os-builder-6cfbd8d5d-xk6c5 |grep Type I0914 08:40:23.910337 1 helpers.go:330] imageBuilderType empty, defaulting to "openshift-image-builder" $ oc get cm on-cluster-build-config -o yaml |grep Type imageBuilderType: custom-pod-builder
Expected results:
When we update the imageBuilderType value, the machine-os-builder pod should be restarted.
Additional info:
Test and verify: MCO-1042: ocb-api implementation in MCO
Areas of concern:
Done when:
MCO-1042 PR can be verified and merged.
Description of problem:
MachineConfigs that use 3.4.0 ignition with a kernelArguments are not currently allowed by MCO. In on-cluster build pools, when we create a 3.4.0 MC with kernelArguments, the pool is not degraded. No new rendered MC is created either.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-09-06-065940
How reproducible:
Always
Steps to Reproduce:
1. Enable on-cluster build in the "worker" pool 2. Create a MC using 3.4.0 ignition version with kernelArguments apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: creationTimestamp: "2023-09-07T12:52:11Z" generation: 1 labels: machineconfiguration.openshift.io/role: worker name: mco-tc-66376-reject-ignition-kernel-arguments-worker resourceVersion: "175290" uid: 10b81a5f-04ee-4d7b-a995-89f319968110 spec: config: ignition: version: 3.4.0 kernelArguments: shouldExist: - enforcing=0
Actual results:
The build process is triggered and new image is built and deployed. The pool is never degraded.
Expected results:
MCs with igition 3.4.0 kernelArguments are not currently allowed. The MCP should be degraded reporting a message similar to this one (this is the error reported if we deploy the MC in the master pool, which is a normal pool): oc get mcp -o yaml .... - lastTransitionTime: "2023-09-07T12:16:55Z" message: 'Node sregidor-s10-7pdvl-master-1.c.openshift-qe.internal is reporting: "can''t reconcile config rendered-master-57e85ed95604e3de944b0532c58c385e with rendered-master-24b982c8b08ab32edc2e84e3148412a3: ignition kargs section contains changes"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded
Additional info:
When the image is deployed (it shouldn't be deployed) the kernel argument enforcing=0 is not present: sh-5.1# cat /proc/cmdline BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-05f51fadbc7fe74fa1e2ba3c0dbd0268c6996f0582c05dc064f137e93aa68184/vmlinuz-5.14.0-284.30.1.el9_2.x86_64 ostree=/ostree/boot.0/rhcos/05f51fadbc7fe74fa1e2ba3c0dbd0268c6996f0582c05dc064f137e93aa68184/0 ignition.platform.id=gcp console=tty0 console=ttyS0,115200n8 root=UUID=95083f10-c02f-4d94-a5c9-204481ce3a91 rw rootflags=prjquota boot=UUID=0440a909-3e61-4f7c-9f8e-37fe59150665 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=1
Description of problem:
When opting into on-cluster builds on both the worker and control plane MachineConfigPools, the maxUnavailable value on the MachineConfigPools is not respected when the newly built image is rolled out to all of the nodes in a given pool.
Version-Release number of selected component (if applicable):
How reproducible:
Sometimes reproducible. I'm still working on figuring out what conditions need to be present for this to occur.
Steps to Reproduce:
1. Opt an OpenShift cluster in on-cluster builds by following these instructions: https://github.com/openshift/machine-config-operator/blob/master/docs/OnClusterBuildInstructions.md 2. Ensure that both the worker and control plane MachineConfigPools are opted in.
Actual results:
Multiple nodes in both the control plane and worker MachineConfigPools are drained and cordoned simultaneously, irrespective of the maxUnavailable value. This is particularly problematic for control plane nodes since draining more than one control plane node at a time can cause etcd issues, in addition to PDBs (Pod Disruption Budgets) which can make the config change take substantially longer or block completely. I've mostly seen this issue affect control plane nodes, but I've also seen it impact both control plane and worker nodes.
Expected results:
I would have expected the new OS image to be rolled out in a similar fashion as new MachineConfigs are rolled out. In other words, a single node (or nodes up to maxUnavailable for non-control-plane nodes) is cordoned, drained, updated, and uncordoned at a time.
Additional info:
I suspect the bug may be someplace within the NodeController since that's the part of the MCO that controls which nodes update at a given time. That said, I've had difficulty reliably reproducing this issue, so finding a root cause could be more involved. This also seems to be mostly confined to the initial opt-in process. Subsequent updates seem to follow the original "rules" more closely.
The current OCB approach is a private MCO only API. making a public API would introduce the following benefits:
1. Transparent update information linked with the proposed MachineOSUpdater API
2. Follow the MCO migration to openshift/api. We should not have private APIs anymore in the MCO. Especially if the feature is publicly used.
3. Consolidate build information into one place that both the MCO and other users can pull from
the general proposal of changes here are as follows:
1. Move global build settings to ControllerConfig object or to this object. These include `finalImagePushSecret` and `finalImagePullspec`
2. create MachineOSBuild CRD which will included Dockerfile field, MachineConfig to build from etc.
3. Add these fields to MCP as well. Rather than thinking of this as two sources of truth, you can view the MCP fields as triggers to create or modify an existing MachineOSBuild object. This is similar to the mechanism that OpenShift BuildV1 uses with its BuildConfigs and Builds CRDs; the BuildConfig houses all of the necessary configs and a new Build is created with those configs. One does not need a BuildConfig to do a build, but one can use a BuildConfig to launch multiple builds.
Making these changes will enforce a system for builds rather than the appendage that the build API is currently to the MCO. The aim here is visibility rather than hidden operations.
Description of problem:
In a cluster with a pool using OCB functionality, if we update the imageBuilderType value while an openshift-image-builder pod is building an image, the build fails. It can fail in 2 ways: 1. Removing the running pod that is building the image, and what we get is a failed build reporting "Error (BuildPodDeleted)" 2. The machine-os-builder pod is restarted but the build pod is not removed. Then the build is never removed.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-09-12-195514 True False 154m Cluster version is 4.14.0-0.nightly-2023-09-12-195514
How reproducible:
Steps to Reproduce:
1. Create the needed resources to make OCB functionality work (on-cluster-build-config configmap, the secrets and the imageSpec) We reproduced it using imageBuilderType="" oc patch cm/on-cluster-build-config -n openshift-machine-config-operator -p '{"data":{"imageBuilderType": ""}}' 2. Create an infra pool and label it so that it can use OCB functionality apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: infra spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]} nodeSelector: matchLabels: node-role.kubernetes.io/infra: "" oc label mcp/infra machineconfiguration.openshift.io/layering-enabled= 3. Wait until the triggered build has finished. 4. Create a new MC to trigger a new build. This one, for example: kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: test-machine-config spec: config: ignition: version: 3.1.0 storage: files: - contents: source: data:text/plain;charset=utf-8;base64,dGVzdA== filesystem: root mode: 420 path: /etc/test-file.test 5. Just after a new build pod is created, configure the on-cluster-build-config configmap to use the "custom-pod-builder" imageBuilderType oc patch cm/on-cluster-build-config -n openshift-machine-config-operator -p '{"data":{"imageBuilderType": "custom-pod-builder"}}'
Actual results:
We have observed 2 behaviors after step 5: 1. The machine-os-builder pod is restarted and the build is never removed. build.build.openshift.io/build-rendered-infra-b2473d404d9ddfa1536d2fb32b54d855 Docker Dockerfile Running 10 seconds ago NAME READY STATUS RESTARTS AGE pod/build-rendered-infra-b2473d404d9ddfa1536d2fb32b54d855-build 1/1 Running 0 12s pod/machine-config-controller-5bdd7b66c5-dl4hh 2/2 Running 0 90m pod/machine-config-daemon-5wbw4 2/2 Running 0 90m pod/machine-config-daemon-fqr8x 2/2 Running 0 90m pod/machine-config-daemon-g77zd 2/2 Running 0 83m pod/machine-config-daemon-qzmvv 2/2 Running 0 83m pod/machine-config-daemon-w8mnz 2/2 Running 0 90m pod/machine-config-operator-7dd564556d-mqc5w 2/2 Running 0 92m pod/machine-config-server-28lnp 1/1 Running 0 89m pod/machine-config-server-5csjz 1/1 Running 0 89m pod/machine-config-server-fv4vk 1/1 Running 0 89m pod/machine-os-builder-6cfbd8d5d-2f7kd 0/1 Terminating 0 3m26s pod/machine-os-builder-6cfbd8d5d-h2ltd 0/1 ContainerCreating 0 1s NAME TYPE FROM STATUS STARTED DURATION build.build.openshift.io/build-rendered-infra-b2473d404d9ddfa1536d2fb32b54d855 Docker Dockerfile Running 12 seconds ago 2. The build pod is removed and the build fails with Error (BuildPodDeleted): NAME TYPE FROM STATUS STARTED DURATION build.build.openshift.io/build-rendered-infra-b2473d404d9ddfa1536d2fb32b54d855 Docker Dockerfile Running 10 seconds ago NAME READY STATUS RESTARTS AGE pod/build-rendered-infra-b2473d404d9ddfa1536d2fb32b54d855-build 1/1 Terminating 0 12s pod/machine-config-controller-5bdd7b66c5-dl4hh 2/2 Running 0 159m pod/machine-config-daemon-5wbw4 2/2 Running 0 159m pod/machine-config-daemon-fqr8x 2/2 Running 0 159m pod/machine-config-daemon-g77zd 2/2 Running 8 152m pod/machine-config-daemon-qzmvv 2/2 Running 16 152m pod/machine-config-daemon-w8mnz 2/2 Running 0 159m pod/machine-config-operator-7dd564556d-mqc5w 2/2 Running 0 161m pod/machine-config-server-28lnp 1/1 Running 0 159m pod/machine-config-server-5csjz 1/1 Running 0 159m pod/machine-config-server-fv4vk 1/1 Running 0 159m pod/machine-os-builder-6cfbd8d5d-g62b6 1/1 Running 0 2m11s NAME TYPE FROM STATUS STARTED DURATION build.build.openshift.io/build-rendered-infra-b2473d404d9ddfa1536d2fb32b54d855 Docker Dockerfile Running 12 seconds ago ..... NAME TYPE FROM STATUS STARTED DURATION build.build.openshift.io/build-rendered-infra-b2473d404d9ddfa1536d2fb32b54d855 Docker Dockerfile Error (BuildPodDeleted) 17 seconds ago 13s
Expected results:
Updating the imageBuilderType while a build is running should not result in the OCB functionlity in a broken status.
Additional info:
Must-gather files are provided in the first commen in this ticket.
There are a few situations in which a cluster admin might want to trigger a rebuild of their OS image in addition to situations where cluster state may dictate that we should perform a rebuild. For example, if the custom Dockerfile changes or the machine-config-osimageurl changes, it would be desirable to perform a rebuild in that case. To that end, this particular story covers adding the foundation for a rebuild mechanism in the form of an annotation that can be applied to the target MachineConfigPool. What is out of scope for this story is applying this annotation in response to a change in cluster state (e.g., custom Dockerfile change).
Done When:
Description of problem:
When a MCP has the on-cluster-build functionality enabled, when we configure a valid imageBuilderType in the on-cluster-build configmap, and later on we update this configmap with an invalid imageBuilderType the machine-config ClusterOperator is not degraded.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-09-12-195514 True False 3h56m Cluster version is 4.14.0-0.nightly-2023-09-12-195514
How reproducible:
Always
Steps to Reproduce:
1. Create a valid OCB configmap, and 2 valid secrets. Like this: apiVersion: v1 data: baseImagePullSecretName: mco-global-pull-secret finalImagePullspec: quay.io/mcoqe/layering finalImagePushSecretName: mco-test-push-secret imageBuilderType: "" kind: ConfigMap metadata: creationTimestamp: "2023-09-13T15:10:37Z" name: on-cluster-build-config namespace: openshift-machine-config-operator resourceVersion: "131053" uid: 1e0c66de-7a9a-4787-ab98-ce987a846f66 3. Label the "worker" MCP in order to enable the OCB functionality in it. $ oc label mcp/worker machineconfiguration.openshift.io/layering-enabled= 4. Wait for the machine-os-builder pod to be created, and for the build to be finished. Just the wait for the pods, do not wait for the MCPs to be updated. As soon as the build pod has finished the build, go to step 5. 5. Patch the on-cluster-build configmap to use a valid imageBuilderType oc patch cm/on-cluster-build-config -n openshift-machine-config-operator -p '{"data":{"imageBuilderType": "fake"}}'
Actual results:
The machine-os-builder pod crashes $ oc get pods NAME READY STATUS RESTARTS AGE machine-config-controller-5bdd7b66c5-6l7sz 2/2 Running 2 (45m ago) 63m machine-config-daemon-5ttqh 2/2 Running 0 63m machine-config-daemon-l95rj 2/2 Running 0 63m machine-config-daemon-swtc6 2/2 Running 2 57m machine-config-daemon-vq594 2/2 Running 2 57m machine-config-daemon-zrf4f 2/2 Running 0 63m machine-config-operator-7dd564556d-9smk4 2/2 Running 2 (45m ago) 65m machine-config-server-9sxjv 1/1 Running 0 62m machine-config-server-m5sdl 1/1 Running 0 62m machine-config-server-zb2hr 1/1 Running 0 62m machine-os-builder-6cfbd8d5d-t6g8w 0/1 CrashLoopBackOff 6 (3m11s ago) 9m16s But the machine-config ClusterOperator is not degraded $ oc get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE machine-config 4.14.0-0.nightly-2023-09-12-195514 True False False 63m
Expected results:
The machine-config ClusterOperator should become degraded when an invalid imageBuilderType is configured.
Additional info:
If we configure an invalid imageBuilderType directly (not by patching/editing the configmap), then the machine-config CO is degraded, but when we edit the configmap it is not. A link to the must-gather file is provided in the first comment in this issue PS: If we wait for the MCPs to be updated in step 4, the machine-os-builder pod is not restarted with the new "fake" imageBuilderType, but the machine-config CO is not degraded either, and it should. Does it make sense?
Only start the buildcontroller if the tech preview feature gate is enabled.
Proposed title of this feature request
Add support to OpenShift Telemetry to report the provider that has been added via "platform: external"
What is the nature and description of the request?
There is a new platform we have added support in OpenShift 4.14 called "external" which has been added for partners to enable and support their own integrations with OpenShift rather than making RH to develop and support this.
When deploying OpenShift using "platform: external: we don't have the ability right now to identify the provider where the platform has been deployed which is key for the product team to analyze demand and other metrics.
Why does the customer need this? (List the business requirements)
OpenShift Product Management needs this information to analyze adoption of these new platforms as well as other metrics specifically for these platforms to help us to make decisions for the product development.
List any affected packages or components.
Telemetry for OpenShift
There is some additional information in the following Slack thread --> https://redhat-internal.slack.com/archives/CEG5ZJQ1G/p1698758270895639
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision PowerVS infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for OpenStack deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision OpenStack infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
\
Rebase Installer onto the development branch of cluster-api-provider-openstack to provide CI signal to the CAPO maintainers.
Right now when trying one installation with this work https://github.com/openshift/installer/pull/7939 the bootstrap machine is not getting deleted. We need to ensure it's gone once bootstrap is finalized.
Essentially: bring the upstream-master branch of shiftstack/cluster-api-provider-openstack under the github.com/openshift organisation.
We need to get CI on this PR in good shape https://github.com/openshift/installer/pull/7939 so we can look for reviews
This is needed to identify if masters are schedulable and to upload the rhos image to glance.
Today we expose two main APIs for HyperShift, namely `HostedCluster` and `NodePool`. We also have metrics to gauge adoption by reporting the # of hosted clusters and nodepools.
But we are still missing other metrics to be able to make correct inference about what we see in the data.
Today we have hypershift_hostedcluster_nodepools as a metric exposed to provide information on the # of nodepools used per cluster.
Additional NodePools metrics such as hypershift_nodepools_size and hypershift_nodepools_available_replicas are available but not ingested in Telemetry.
In addition to knowing how many nodepools per hosted cluster, we would like to expose the knowledge of the nodepool size.
This will help inform our decision making and provide some insights on how the product is being adopted/used.
The main goal of this epic is to show the following NodePools metrics on Telemeter, ideally as recording rules:
The implementation involves creating updates to the following GitHub repositories:
similar PRs:
https://github.com/openshift/hypershift/pull/1544
https://github.com/openshift/cluster-monitoring-operator/pull/1710
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer - specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision GCP infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
When testing GCP using the CAPG provider (not Terraform) in 4.16, it was found that the master VM instances were not distributed across instance groups but were all assigned to the same instance group.
Here is a (partial) CAPG install vs a installation completed using Terraform. The capg installation (bfournie-capg-test-5ql8j) has VMs all using us-east1-b
$ gcloud compute instances list | grep bfournie bfournie-capg-test-5ql8j-bootstrap us-east1-b n2-standard-4 10.0.0.4 34.75.212.239 RUNNING bfournie-capg-test-5ql8j-master-0 us-east1-b n2-standard-4 10.0.0.5 RUNNING bfournie-capg-test-5ql8j-master-1 us-east1-b n2-standard-4 10.0.0.6 RUNNING bfournie-capg-test-5ql8j-master-2 us-east1-b n2-standard-4 10.0.0.7 RUNNING bfournie-test-tf-pdrsw-master-0 us-east4-a n2-standard-4 10.0.0.4 RUNNING bfournie-test-tf-pdrsw-worker-a-vxjbk us-east4-a n2-standard-4 10.0.128.2 RUNNING bfournie-test-tf-pdrsw-master-1 us-east4-b n2-standard-4 10.0.0.3 RUNNING bfournie-test-tf-pdrsw-worker-b-ksxfg us-east4-b n2-standard-4 10.0.128.3 RUNNING bfournie-test-tf-pdrsw-master-2 us-east4-c n2-standard-4 10.0.0.5 RUNNING bfournie-test-tf-pdrsw-worker-c-jpzd5 us-east4-c n2-standard-4 10.0.128.4 RUNNING
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
When using the CAPG provider the ServiceAccounts created by the installer for the master and worker nodes do not have the role bindings added correctly.
For example this query shows that the SA for the master nodes has no role bindings.
$ gcloud projects get-iam-policy openshift-dev-installer --flatten="bindings[].members" --format='table(bindings.role)' --filter='bindings.members:bfournie-capg-test-lk5t5-m@openshift-dev-installer.iam.gserviceaccount.com' $
I want to destroy the load balancers created by capg
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
CAPG has been updated to 1.6, see https://github.com/kubernetes-sigs/cluster-api-provider-gcp/releases/tag/v1.6.0
We need to pick this up to get the latest features including disk encryption.
Machines for GCP need to be generated for use in CAPI. This will be similar to the AWS machine implementation
(https://github.com/openshift/installer/blob/master/pkg/asset/machines/aws/awsmachines.go) added in
https://github.com/openshift/installer/pull/7771
I want to create the public and private DNS records using one of the CAPI interface SDK hooks.
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As an installer user, I want my gcp creds used for install to be used by the CAPG controller when provisioning resources.
Acceptance Criteria:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
When installing on GCP, I want control-plane (including bootstrap) machines to bootstrap using ignition.
I want bootstrap ignition to be secured so that sensitive data is not publicly available.
Description of criteria:
Destroying bootstrap ignition can be handled separately.
This requires/does not require a design proposal.
This requires/does not require a feature gate.
https://issues.redhat.com/browse/CORS-3217 covers the upstream chagnes to CAPG needed to add disk encrytion. In addition changes will be needed in the installer to set the GCPMachine disk encryption based on the machinepool settings.
Notes on the required changes are at https://docs.google.com/document/d/1kVgqeCcPOrq4wI5YgcTZKuGJo628dchjqCrIrVDS83w/edit?usp=sharing
Once the upstream changes from CORS-3217 have been accepted:
I want to create a load balancer to provide split-horizon DNS for the cluster.
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Description of problem:
The bootstrap machine never contains a public IP address. When the publish strategy is set to External, the bootstrap machine should contain a public ip address.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Create the GCP Infrastructure controller in /pkg/clusterapi/system.go.
It will be based on the AWS controller in that file, which was added in https://github.com/openshift/installer/pull/7630.
I want the installer to create the service accounts that would be assigned to control plane and compute machines, similar to what is done in terraform now.
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
When a GCP cluster is created using CAPI, upon destroy the addresses associated with the apiserver LoadBalancer are not removed. For example here are addresses left over after previous installations
$ gcloud compute addresses list --uri | grep bfournie https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-27kzq-apiserver https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-6jrwz-apiserver https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-gn6g7-apiserver https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-h96j2-apiserver https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-k7fdj-apiserver https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-nh4z5-apiserver https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-nls2h-apiserver https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-qrhmr-apiserver
Here is one of the addresses:
$ gcloud compute addresses describe https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-27kzq-apiserver address: 34.107.255.76 addressType: EXTERNAL creationTimestamp: '2024-04-15T15:17:56.626-07:00' description: '' id: '2697572183218067835' ipVersion: IPV4 kind: compute#address labelFingerprint: 42WmSpB8rSM= name: bfournie-capg-test-27kzq-apiserver networkTier: PREMIUM selfLink: https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-27kzq-apiserver status: RESERVED [bfournie@bfournie installer-patrick-new]$ gcloud compute addresses describe https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-6jrwz-apiserver address: 34.149.208.133 addressType: EXTERNAL creationTimestamp: '2024-03-27T09:35:00.607-07:00' description: '' id: '1650865645042660443' ipVersion: IPV4 kind: compute#address labelFingerprint: 42WmSpB8rSM= name: bfournie-capg-test-6jrwz-apiserver networkTier: PREMIUM selfLink: https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-6jrwz-apiserver status: RESERVED
Now that https://issues.redhat.com/browse/CORS-3447 providing the ability to override the APIServer instance group to be compatible with MAPI, we need to set the override in the installer when the Internal LoadBalancer is created.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
once e2e tests are passing, include gcp capi installer in tech preview feature set.
similar to https://github.com/openshift/api/pull/1880
When GCP workers are created they are not able to pull ignition over the internal subnet as its not allowed by the firewall rules created by CAPG. The allow-<infraID>-cluster allows all TCP traffic with tags for <infraID>-node and <infraID>-control-plane but the workers that are created have tags <infraID>-worker.
We need to either add the worker tags to this firewall rule or add node tags to the worker. We should decide on a general use of CAPG firewall rules.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision AWS infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Use cases to ensure:
As a (user persona), I want to be able to:
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
iam role is correctly attached to control plane node when installconfig.controlPlane.platform.aws.iamRole is specified
As a (user persona), I want to be able to:
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
when installconfig.controlPlane.platform.aws.metadataService is set, the metadataservice is correctly configured for control plane machines
security group ids are added to control plane nodes when installconfig.controlPlane.platform.aws.additionalSecurityGroupIDs is specified
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
CAPA shows
I0312 18:00:13.602972 109 s3.go:220] "Deleting S3 object" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-2" namespace="openshift-cluster-api-guests" name="rdossant-installer-03-jjf6b-master-2" reconcileID="9cda22be-5acd-4670-840f-8a6708437385" machine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-2" cluster="openshift-cluster-api-guests/rdossant-installer-03-jjf6b" bucket="openshift-bootstrap-data-rdossant-installer-03-jjf6b" key="control-plane/rdossant-installer-03-jjf6b-master-2" I0312 18:00:13.608919 109 s3.go:220] "Deleting S3 object" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-0" namespace="openshift-cluster-api-guests" name="rdossant-installer-03-jjf6b-master-0" reconcileID="1ed0ad52-ffc1-4b62-97e4-876f8e8c3242" machine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-0" cluster="openshift-cluster-api-guests/rdossant-installer-03-jjf6b" bucket="openshift-bootstrap-data-rdossant-installer-03-jjf6b" key="control-plane/rdossant-installer-03-jjf6b-master-0" [...] E0312 18:04:25.282967 109 awsmachine_controller.go:576] "controllers/AWSMachine: unable to delete secrets" err=< deleting bootstrap data object: deleting S3 object: NotFound: Not Found status code: 404, request id: 9QYY3QSWKBBDZ7R8, host id: 2f3HawFbPheaptP9E+WRbu3fhEXTMwyZQ1DBPGBG7qlg74ssQR0XISM4OSlxvrn59GeFREtN4hp9C+S5LgQD2g== > E0312 18:04:25.284197 109 controller.go:329] "Reconciler error" err=< deleting bootstrap data object: deleting S3 object: NotFound: Not Found status code: 404, request id: 9QYY3QSWKBBDZ7R8, host id: 2f3HawFbPheaptP9E+WRbu3fhEXTMwyZQ1DBPGBG7qlg74ssQR0XISM4OSlxvrn59GeFREtN4hp9C+S5LgQD2g== > controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-0" namespace="openshift-cluster-api-guests" name="rdossant-installer-03-jjf6b-master-0" reconcileID="7fac94a1-772a-4c7b-a631-5ef7fc015d5b" E0312 18:04:25.286152 109 awsmachine_controller.go:576] "controllers/AWSMachine: unable to delete secrets" err=< deleting bootstrap data object: deleting S3 object: NotFound: Not Found status code: 404, request id: 9QYPFY0EQBM42VYH, host id: nJZakAhLrbZ1xrSNX3tyk0IKmMgFjsjMSs/D9nzci90GfRNNfUnvwZTbcaUBQYiuSlY5+aysCuwejWpvi8FmGusbQCK1Qtjr9pjqDQfxzY4= > E0312 18:04:25.287353 109 controller.go:329] "Reconciler error" err=< deleting bootstrap data object: deleting S3 object: NotFound: Not Found status code: 404, request id: 9QYPFY0EQBM42VYH, host id: nJZakAhLrbZ1xrSNX3tyk0IKmMgFjsjMSs/D9nzci90GfRNNfUnvwZTbcaUBQYiuSlY5+aysCuwejWpvi8FmGusbQCK1Qtjr9pjqDQfxzY4= > controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-2" namespace="openshift-cluster-api-guests" name="rdossant-installer-03-jjf6b-master-2" reconcileID="b6c792ad-5519-48d5-a994-18dda76d8a93" E0312 18:04:25.291383 109 awsmachine_controller.go:576] "controllers/AWSMachine: unable to delete secrets" err=< deleting bootstrap data object: deleting S3 object: NotFound: Not Found status code: 404, request id: 9QYGWSJDR35Q4GWX, host id: Qnltg++ia3VapXjtENZOQIwfAxbxfwVLPlC0DwcRBx+L60h52ENiNqMOkvuNwJyYnPxbo/CaawzMT11oIKGO9g== > E0312 18:04:25.292132 109 controller.go:329] "Reconciler error" err=< deleting bootstrap data object: deleting S3 object: NotFound: Not Found status code: 404, request id: 9QYGWSJDR35Q4GWX, host id: Qnltg++ia3VapXjtENZOQIwfAxbxfwVLPlC0DwcRBx+L60h52ENiNqMOkvuNwJyYnPxbo/CaawzMT11oIKGO9g== > controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-1" namespace="openshift-cluster-api-guests" name="rdossant-installer-03-jjf6b-master-1" reconcileID="92e1f8ed-b31f-4f75-9083-59aad15efe79" E0312 18:04:25.679859 109 awsmachine_controller.go:576] "controllers/AWSMachine: unable to delete secrets" err=< deleting bootstrap data object: deleting S3 object: NotFound: Not Found status code: 404, request id: 9QYSBZGYPC7SNJEX, host id: EplmtNQ+RxmbU88z+4App6YEVvniJpyCeMiMZuUegJIMqZgbkA1lmCjHntSLDm4eA857OdhtHsn+zD6AX7uelGIsogzN2ZziiAZXZrbIIEg= > E0312 18:04:25.680663 109 controller.go:329] "Reconciler error" err=< deleting bootstrap data object: deleting S3 object: NotFound: Not Found status code: 404, request id: 9QYSBZGYPC7SNJEX, host id: EplmtNQ+RxmbU88z+4App6YEVvniJpyCeMiMZuUegJIMqZgbkA1lmCjHntSLDm4eA857OdhtHsn+zD6AX7uelGIsogzN2ZziiAZXZrbIIEg= > controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-0" namespace="openshift-cluster-api-guests" name="rdossant-installer-03-jjf6b-master-0" reconcileID="9e436c67-aca0-409c-9179-0ce4cccce9ad"
Even though we are not creating s3 buckets for the master nodes. That's preventing the bootstrap process from finishing.
Because of the assumption that subnets have auto-assign public IPs turned on, which is how CAPA configures the subnets it creates, supplying your own VPC where that is not the case causes the bootstrap node to not get a public IP and therefore not be able to download the release image (no internet connection).
The bootstrap node needs a public IP because the public subnets are connected only to the internet gateway, which does not provide NAT.
Destroy all bootstrap resources created through the new non-terraform provider.
Acceptance Criteria:
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Issue:
Steps to reproduce:
Actual results:
Expected results:
References:
Goal:
Issue:
Steps to reproduce:
Actual results:
Expected results:
References:
CAPA creates 4 security groups:
$ aws ec2 describe-security-groups --region us-east-2 --filters "Name = group-name, Values = *rdossant*" --query "SecurityGroups[*].[GroupName]" --output text rdossant-installer-03-tvcbd-lb rdossant-installer-03-tvcbd-controlplane rdossant-installer-03-tvcbd-apiserver-lb rdossant-installer-03-tvcbd-node
Given that the maximum number of SGs in a network interface is 16, we should update the max number validation in the installer:
https://github.com/openshift/installer/blob/master/pkg/types/aws/validation/machinepool.go#L66
Patrick says:
I think we want to update this to cap the user limit to 10 additional security groups:
More context: https://redhat-internal.slack.com/archives/C68TNFWA2/p1697764210634529?thread_ts=1697471429.293929&cid=C68TNFWA2
when installconfig.platform.aws.userTags is specified, all taggable resources should have the specified user tags.
When using Wavelength zones, networks are cidr'd differently than in vanilla installs. Ensure wavelength support
Private hosted zone and cross-account shared vpc works when installconfig.platform.aws.hostedZone is specified
AWS Local zone support works as expected when local zones are specified
The schema check[1] in the LB reconciliation is hardcoded to check the primary Load Balancer only, it will result to always filter the subnets from the schema for the primary, ignoring additional Load Balancers ("SecondaryControlPlaneLoadBalancer")
How to reproduce:
Actual results:
Expected results:
References:
bootstrap ignition is not deleted when installconfig.platform.aws.preserveBootstrapIgnition is specified
As an Openshift admin i want to leverage /dev/fuse in unprivileged containers so that to successfully integrate cloud storage into OpenShift application in a secure, efficient, and scalable manner. This approach simplifies application architecture and allows developers to interact with cloud storage as if it were a local filesystem, all while maintaining strong security practices.
Epic Goal
Why is this important?
Acceptance Criteria
Done Checklist
As a multiarch CI-focused engineer, I want to create a workflow in `openshift/release` that will enable creating the backend nodes for a cluster installation.
Customer has escalated the following issues where ports don't have TLS support. This Feature request lists all the components port raised by the customer.
Details here https://docs.google.com/document/d/1zB9vUGB83xlQnoM-ToLUEBtEGszQrC7u-hmhCnrhuXM/edit
Currently, we are serving the metrics as http on 9537 we need to upgrade to use TLS
Related to https://docs.google.com/document/d/1zB9vUGB83xlQnoM-ToLUEBtEGszQrC7u-hmhCnrhuXM/edit
Reduce the resource footprint of LVMS in regards to CPU, Memory, Image count and size by collapsing current various containers and deployments into a small number of highly integrated ones.
tbd
The idea was creating during a ShiftWeek project. Potential saving / reductions are documented here: https://docs.google.com/presentation/d/1j646hJDVNefFfy1Z7glYx5sNBnSZymDjCbUQVOZJ8CE/edit#slide=id.gdbe984d017_0_0
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Resource Requirements should be added/updated to the documentation in the requirements section.
Interoperability with MicroShift is a challenge, as it allows way more detailed configuration of topolvm with direct access to lvmd.conf
Enable developers to perform actions in a faster and better way than before.
Developers will be able to reduce the time or clicks spent on the UI to perform specific set of actions.
Search->Resources option should show 5 of the recently searched resources across all sessions by the user.
The recently searched resources should should be clearly visible and separated from rest.
Pinning resources capability should be removed.
Getting Started menu on Add page can be collapsed and expanded
Use Cases (Optional):
__
Developers have to repeatedly search for resources and pin them separately in order to view details of a resource that have seen in the past.
Provide developers with the ability to see the last 5 resources that they have seen in the past so they can quickly view their details without any further actions.
Provides a better user experience for developers when using the console.
As a user, I want the console to remember the resources I have recently searched so that I don't have to type the names of the same resources I use frequently in the Search Page.
The Getting Started menu on Add page can cannot be restored after users click on the "X" symbol when its hidden.
Users should be able to collapse and expand the Getting Started menu on Add Page.
The current behavior causes confusion.
As of now, getting started section can be hidden and enable back using the button but user can close that button and it will not show back. It is confusing for the users. So add the expandable section instead of hide and show button similar to Functions List page
Check Functions list page in Dev perspective for the design
GA support for a generic interface for administrators to define custom reboot/drain suppression rules.
Follow up epic to https://issues.redhat.com/browse/MCO-507, aiming to graduate the feature from tech preview and GA'ing the functionality.
This status was added as a legacy field and isn't currently used for anything, nor should it be there. We'd like to remove this, so:
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision OpenShift on the existing supported providers' infrastructure without the use of Terraform.
This feature will be used to track all the CAPI preparation work that is common for all the supported providers
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
PoC & design for running CAPI control plane using binaries.
As a CAPI install user, I want to be able to:
so that I can achieve
Description of criteria:
This is intended to be platform agnostic. If there is a common way for obtaining ip addresses from capi manifests, this should be sufficient. Otherwise, this should enable other platforms to implement their specific logic.
This requires/does not require a design proposal.
This requires/does not require a feature gate.
I want hack/build.sh to embed the kube-apiserver and etcd dependencies in openshift-install without making external network calls so that ART/OSBS can build the installer with CAPI dependencies.
Description of criteria:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Extract the needed binaries during the Installer container image build and copy them to an appropriate location so they can be used by CAPI.
We need build configs for the binaries so they are built as dependencies of the Installer images in the release pipeline.
Write CAPI manifests to disk during create manifests so that they can be user edited and users can also provide their own set of manifests. In general, we think of manifests as an escape hatch that should be used when a feature is missing from the install config, and users accept the degraded user experience of editing manifests in order to achieve non-install-config-supported functionality.
Acceptance criteria:
Manifests should be generated correctly (and applied correctly to the control plane):
There is some WIP for this, but there are issues with the serialization/deserialization flow when writing the GVK in the manifests.
all capi artifacts should be collected in a hidden dir
they should always be written (regardless of success or fail)
they should be collected in the installer log bundle
Right now the CAPI providers will run indefinitely. We need to stop the installer if installs fail, based on either a timeout or more sophisticated analysis.
We cannot execute cross-compiled binaries, otherwise we get an error saying:
hack/build-cluster-api.sh: line 26: /go/src/github.com/openshift/installer/cluster-api/bin/darwin_amd64/kube-apiserver: cannot execute binary file: Exec format error
The check should be skipped in those cases.
Fit provisioning via the CAPI system into the infrastructure.Provider interface
so that:
We want CAPI providers to be built and embedded in the installer binary when ART builds the installer with OSBS.
Currently CAPI providers are only built when the OPENSHIFT_INSTALL_CLUSTER_API env var is set for hack/build.sh. One way of resolving this would be to always build the providers, but gate the kas/etcd dependencies on the environment variable. Some prior art for that here: https://github.com/openshift/installer/pull/8273
The main issue with enabling CAPI provider builds will be the effect on build time, which is already long-running and somewhat unstable.
The 100.88.0.0/14 IPv4 subnet is currently reserved for the transit switch in OVN-Kubernetes for east west traffic in the OVN Interconnect architecture. We need to make this value configurable so that users can avoid conflicts with their local infrastructure. We need to support this config both prior to installation and post installation (day 2).
This epic will include stories for the upstream ovn-org work, getting that work downstream, an api change, and a cno change to consume the new api
Scope of this card is to track the work around getting the required pieces in for transit switch subnet in CNO that will let users do custom configurations to transit switch subnet on both day0 (install) and day2 (post-install).
This card will complement https://issues.redhat.com/browse/SDN-4156
You can create the cluster-bot cluster with Ben's PR and do CNO changes locally and test them out.
After the upstream pr merges it needs to get into openshift ovn-k via a downstream merge
Allow setting custom tags to machines created during the installation of an OpenShift cluster on vSphere.
Just as labeling is important in Kubernetes for organizing API objects and compute workloads (pods/containers), the same is true for the Kube/OCP node VMs running on the underlying infrastructure in any hosted or cloud platform.
Reporting, auditing, troubleshooting and internal organization processes all require ways of easily filtering on and referencing servers by naming, labels or tags. Ensuring appropriate tagging is added to all OCP nodes in VMware ensures those troubleshooting, reporting or auditing can easily identify and filter Openshift node VMs.
For example we can use tags for prod vs. non-prod, VMs that should have backup snapshots vs. those that shouldn't, VMs that fall under certain regulatory constraints, etc.
Reporting, auditing, troubleshooting and internal organization processes all require ways of easily filtering on and referencing servers by naming, labels or tags. Ensuring appropriate tagging is added to all OCP nodes in VMware ensures those troubleshooting, reporting or auditing can easily identify and filter Openshift node VMs.
For example we can use tags for prod vs. non-prod, VMs that should have backup snapshots vs. those that shouldn't, VMs that fall under certain regulatory constraints, etc.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As a openshift administrator when i run must-gather in large cluster it tends to be in GB's which fill up my master node . I want an ability to define the time stamp of duration where problem happened and in that way we can have targeted log gathered that can take less space
have "--since and --until " option in mustgather
Epic Goal*
As a openshift administrator when i run must-gather in large cluster it tends to be in GB's which fill up my master node . I want an ability to define the time stamp of duration where problem happened and in that way we can have targeted log gathered that can take less space
Why is this important? (mandatory)
Reduce must-gather size
Scenarios (mandatory)
Must-gather running over a cluster with many logs can produces tens of GBs of data in cases where only few MBs is needed. Such a huge must-gather archive takes too long to collect and too long to upload. Which makes the process impractical.
Dependencies (internal and external) (mandatory)
The default must-gather images need to implement this functionality. Custom images will be asked to implement the same.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Must-gather contain only logs from requested time interval
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Must-gather currently does not allow to limit the amount of data collected by a customer. Which can lead into collection of tens of GBs even when only a limited set of data (e.g. for the last 2 days) is required. In some cases ending with a master node down.
Suggested solution:
Acceptance criteria:
rotated-pod-logs does not interact with since/since-time. This causes inspect (and must-gather) outputs to considerably increase in size, even when users attempt to filter it by using time constraints.
Suggested solution:
Acceptance criteria:
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
We have decided to remove IPI and UPI support for Alibaba Cloud, which until recently has been in Tech Preview due to the following reasons:
(1) Low customer interest of using Openshift on Alibaba Cloud
(2) Removal of Terraform usage
(3) MAPI to CAPI migration
(4) CAPI adoption for installation (Day 1) in OpenShift 4.16 and Day 2 (post-OpenShift 4.16)
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
Impacted areas based on CI:
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Self-managed |
Classic (standalone cluster) | Classic (standalone) |
Hosted control planes | N/A |
Multi node, Compact (three node), or Single node (SNO), or all | All |
Connected / Restricted Network | All |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | All |
Operator compatibility | N/A |
Backport needed (list applicable versions) | N/A |
Other (please specify) | N/A |
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Epic Goal*
Per OCPSTRAT-1042, we are removing Alicloud IPI/UPI support in 4.16 and removing code in 4.16. This epic track the necessary actions to remove the Alicloud Disk CSI driver
Why is this important? (mandatory)
Since we are removing support of Alicloud as a supported provider we need to clean up all the storage artefacts related to Alicloud.
Dependencies (internal and external) (mandatory)
Alicloud IPI/UPI support removal must be confirmed. IPI/UPI code should be remove.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
We have decided to remove IPI and UPI support for Alibaba Cloud, which until recently has been in Tech Preview due to the following reasons:
(1) Low customer interest of using Openshift on Alibaba Cloud
(2) Removal of Terraform usage
(3) MAPI to CAPI migration
(4) CAPI adoption for installation (Day 1) in OpenShift 4.16 and Day 2 (post-OpenShift 4.16)
Impacted areas based on CI:
alibaba-cloud-csi-driver/openshift-alibaba-cloud-csi-driver-release-4.16.yaml
alibaba-disk-csi-driver-operator/openshift-alibaba-disk-csi-driver-operator-release-4.16.yaml
cloud-provider-alibaba-cloud/openshift-cloud-provider-alibaba-cloud-release-4.16.yaml
cluster-api-provider-alibaba/openshift-cluster-api-provider-alibaba-release-4.16.yaml
cluster-cloud-controller-manager-operator/openshift-cluster-cloud-controller-manager-operator-release-4.16.yaml
machine-config-operator/openshift-machine-config-operator-release-4.16.yaml
Add support for snapshots into kubevirt-csi when the underlying infra csi driver supports snapshots.
Add support for snapshots into kubevirt-csi when the underlying infra csi driver supports snapshots.
Who | What | Reference |
---|---|---|
DEV | Upstream roadmap issue (or individual upstream PRs) | <link to GitHub Issue> |
DEV | Upstream documentation merged | <link to meaningful PR> |
DEV | gap doc updated | <name sheet and cell> |
DEV | Upgrade consideration | <link to upgrade-related test or design doc> |
DEV | CEE/PX summary presentation | label epic with cee-training and add a <link to your support-facing preso> |
QE | Test plans in Polarion | <link or reference to Polarion> |
QE | Automated tests merged | <link or reference to automated tests> |
DOC | Downstream documentation merged | <link to meaningful PR> |
This task is to add infra snapshot support into upstream kubevirt-csi driver.
This should include unit and functional testing upstream.
This is a new section to the configuring storage for HCP OpenShift Virtualization section in the ACM docs. https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.10/html/clusters/cluster_mce_overview#configuring-storage-kubevirt
There's a new feature landing in 4.16 that gives HCP Openshift Virtualization's kubevirt-csi component the ability to perform csi snapshots. We need this feature to be documented downstream.
The downstream documentation should follow closely to upstream hypershift documentation that is being introduced in this issue, https://issues.redhat.com/browse/CNV-36075
Streamline and secure CLI interactions, improve backend validation processes, and refine user guidance and documentation for the hcp command and related functionalities.
Future:
As a customer, I would like to deploy OpenShift On OpenStack, using the IPI workflow where my control plane would have 3 machines and each machine would have use a root volume (a Cinder volume attached to the Nova server) and also an attached ephemeral disk using local storage, that would only be used by etcd.
As this feature will be TechPreview in 4.15, this will only be implemented as a day 2 operation for now. This might or might not change in the future.
We know that etcd requires storage with strong performance capabilities and currently a root volume backed by Ceph has difficulties to provide these capabilities.
By also attaching local storage to the machine and mounting it for etcd would solve the performance issues that we saw when customers were using Ceph as the backend for the control plane disks.
Gophercloud already accepts to create a server with multiple ephemeral disks:
We need to figure out how we want to address that in CAPO, probably involving a new API; that later would be used in openshift (MAPO, and probably installer).
We'll also have to update the OpenStack Failure Domain in CPMS.
ARO (Azure) has conducted some benckmarks and is now recommending to put etcd on a separated data disk:
https://docs.google.com/document/d/1O_k6_CUyiGAB_30LuJFI6Hl93oEoKQ07q1Y7N2cBJHE/edit
Also interesting thread: https://groups.google.com/u/0/a/redhat.com/g/aos-devel/c/CztJzGWdsSM/m/jsPKZHSRAwAJ
Add it to the OOTB runtime classes to allow access without a custom MC
Explore providing a consistent developer experience within OpenShift and Kubernetes via OpenShift console.
Enable running OpenShift console on Kubernetes clusters and explore the user experience that can be consistent across clusters.
Install and run OpenShift console on Kubernetes
Enable selection of menu items in Dev and Admin perspective
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
Fully working OpenShift console on Kubernetes.
Provide a simple way to install and access OpenShift console on Kubernetes.
None
N/A
Enable running OpenShift console on Kubernetes clusters and explore the user experience that can be consistent across clusters.
Explore providing a consistent developer experience within OpenShift and Kubernetes via OpenShift console.
Out of Scope: The OpenShift console is fully working on Kubernetes.
Description of problem:
When running the console against a non-OpenShift/non-OKD cluster the admin perspective almost work fine. But the developer perspective has a lot of small issues.
Version-Release number of selected component (if applicable):
Almost all versions
How reproducible:
Always
Steps to Reproduce:
You need a non-OpenShift cluster. You might can test this any another kubernetes distribution. I used kubeadm to create a local cluster, but it take some time until I can start a Pod locally. (That's the precondition.)
You might want test kind or k3s instead.
To run the console on your local machine against a non-OpenShift k8s cluster on your local machine you can use this script: https://github.com/jerolimov/openshift/commit/e6fe0924807017ff1320cfc8d82bde23c162eba3
Actual results:
Expected results:
Additional info:
Graduce the new PV access mode ReadWriteOncePod as GA.
Such PV/PVC can be used only in a single pod on a single node compared to the traditional ReadWriteOnce access mode, where such a PV/PVC can be used on a single node by many pods.
The customers can start using the new ReadWriteOncePod access mode.
This new mode allows customers to provision and attach PV and get the guarantee that it cannot be attached to another local pod.
This new mode should support the same operations as regular ReadWriteOnce PVs therefore it should pass the regression tests. We should also ensure that this PV can't be accessed by another local-to-node pod.
As a user I want to attach a PV to a pod and ensure that it can't be accessed by another local pod.
We are getting this feature from upstream as GA. We need to test it and fully support it.
Check that there is no limitations / regression.
Remove tech preview warning. No additional change.
N/A
Support upstream feature "New RWO access mode " in OCP as GA, i.e. test it and have docs for it.
This is continuation of STOR-1171 (Beta/Tech Preview in 4.14), now we just need to mark it as GA and remove all TechPreview notes from docs.
Currently it's not possible to modify the BMC address of a BareMetalHost after it has been set.
Users need to be able to update the BMC entries in the BareMetalHost objects after they have been created .
There are at least two scenarios in which this causes problems:
1. When deploying baremetal clusters with the Assisted Installer, Metal3 is deployed and BareMetalHost objects are created but with empty BMC entries. We can modify these BMC entries after the cluster has come up to bring the nodes "under management", but if you make a mistake like I did with the URI then it's not possible to fix that mistake.
2. BMC addresses may change over time - whilst it may appear unlikely, network readdressing, or changes in DNS names may mean that BMC addresses will need to be updated.
Currently it's not possible to modify the BMC address of a BareMetalHost after it has been set. It's understood that this was an initial design choice, and there's a webhook that prevents any modification of the object once it has been set for the first time. There are at least two scenarios in which this causes problems:
1. When deploying baremetal clusters with the Assisted Installer, Metal3 is deployed and BareMetalHost objects are created but with empty BMC entries. We can modify these BMC entries after the cluster has come up to bring the nodes "under management", but if you make a mistake like I did with the URI then it's not possible to fix that mistake.
2. BMC addresses may change over time - whilst it may appear unlikely, network readdressing, or changes in DNS names may mean that BMC addresses will need to be updated.
Thanks!
Currently, the baremetal operator does not allow the BMC address of a node to be updated after BMH creation. This ability needs to be added in BMO.
Currently the maximum number of snapshots per volume in vSphere CSI is set to 3 and cannot be configured. Customers find this default limit too low and are asking us to make this setting configurable.
Maximum number of snapshot is 32 per volume
Customers can override the default (three) value and set it to a custom value.
Make sure we document (or link) the VMWare recommendations in terms of performances.
https://kb.vmware.com/s/article/1025279
The setting can be easily configurable by the OCP admin and the configuration is automatically updated. Test that the setting is indeed applied and the maximum number of snapshots per volume is indeed changed.
No change in the default
As an OCP admin I would like to change the maximum number of snapshots per volumes.
Anything outside of
The default value can't be overwritten, reconciliation prevents it.
Make sure the customers understand the impact of increasing the number of snapshots per volume.
https://kb.vmware.com/s/article/1025279
Document how to change the value as well as a link to the best practice. Mention that there is a 32 hard limit. Document other limitations if any.
N/A
Epic Goal*
The goal of this epic is to allow admins to configure the maximum number of snapshots per volume in vSphere CSI and find an way how to add such extension to OCP API.
Possible future candidates:
Why is this important? (mandatory)
Currently the maximum number of snapshots per volume in vSphere CSI is set to 3 and cannot be configured. Customers find this default limit too low and are asking us to make this setting configurable.
Maximum number of snapshot is 32 per volume
https://kb.vmware.com/s/article/1025279
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
1) Write OpenShift enhancement (STOR-1759)
2) Extend ClusterCSIDriver API (TechPreview) (STOR-1803)
3) Update vSphere operator to use the new snapshot options (STOR-1804)
4) Promote feature from Tech Preview to Accessible-by-default (STOR-1839)
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Configure the maximum number of snapshot to a higher value. Check the config has been updated and verify that the maximum number of snapshots per volume maps to the new setting value.
Drawbacks or Risk (optional)
Setting this config setting with a high value can introduce performances issues. This needs to be documented.
https://kb.vmware.com/s/article/1025279
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
To support volume provisioning and usage in multi-zonal clusters, the deployment should match certain requirements imposed by CSI driver - https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-4E5F8F65-8845-44EE-9485-426186A5E546.html
The requirements have slightly changed in 3.1.0 - https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-162E7582-723B-4A0F-A937-3ACE82EAFD31.html
We need to ensure that, cluster is compliant with topology requirement and if not, vsphere-problem-detector should detect invalid configuration and create warnings and alerts.
A patch to the vSphere CSI driver was added to accept both the old and new tagging methods in order to avoid regressions. A warning is thrown if the old way is used.
Ensure that if customer's configuration is not compliant, OCP raises a warning. The goal is to validate customer's config. Improve VPD to detect these misconfigurations.
Ensure the driver keeps working with the old configuration. Raise a warning if customers are still using the old tagging way.
This feature should be able to detect any anomalies when customers are configuring vSphere topology.
Driver should work with both new and old way to define zones
This epic is not about testing topology which is already supported.
The vSphere CSI driver changed the way tag are applied to nodes and clusters. This feature ensures that the customer's config and what is expected from the driver.
This will help customers get the guarantee that their configuration is compliant especially for those who are used to the old way of configuring topology
Update the topology documentation to match the new driver requirements.
OCP on vSphere
Epic Goal*
To support volume provisioning and usage in multi-zonal clusters, the deployment should match certain requirements imposed by CSI driver - https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-4E5F8F65-8845-44EE-9485-426186A5E546.html
The requirements have slightly changed in 3.1.0 - https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-162E7582-723B-4A0F-A937-3ACE82EAFD31.html
We need to ensure that, cluster is compliant with topology requirement and if not, vsphere-problem-detector should detect invalid configuration and create warnings and alerts.
What is our purpose in implementing this? What new capability will be available to customers?
Why is this important? (mandatory)
This is important because clusters could be misconfigured and it will be tricky to detect if volume provisioning is not working because of misconfiguration or there are some other errors. Having a way to validate customer's topology will ensure that we have right topology.
We already have checks in VPD, but we need to enhance those checks to ensure we are compliant.
Scenarios (mandatory)
In 4.15: make cluster Upgradeable=False when:
4.16: mark the cluster degraded in the conditions above.
These scenarios will result in invalid cluster configuration.
Dependencies (internal and external) (mandatory)
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Support the SMB CSI driver through an OLM operator as tech preview. The SMB CSI driver allows OCP to consume SMB/CIFS storage with a dynamic CSI driver. This enables customers to leverage their existing storage infrastructure with either SAMBA or Microsoft environment.
https://github.com/kubernetes-csi/csi-driver-smb
Customers can start testing connecting OCP to their backend exposing CIFS. This can allow to consume net new volume or consume existing data produced outside OCP.
Driver already exists and is under the storage SIG umbrella. We need to make sure the driver is meeting OCP quality requirement and if so develop an operator to deploy and maintain it.
Review and clearly define all driver limitations and corner cases.
Review the different authentication method.
Windows containers support.
Only storage class login/password authentication method. Other methods can be reviewed and considered for GA.
Customers are expecting to consume storage and possibly existing data via SMB/CIFS. As of today vendor's drivers support is really limited in terms of CIFS support whereas this protocol is widely used on premise especially with MS/AD customers.
Need to understand what customers expect in terms of authentication.
How to extend this feature to windows containers.
Document the operator and driver installation, usage capabilities and limitations.
Future: How to manage interoperability with windows containers (not for TP)
Add a new provider here:
https://github.com/openshift/api/blob/6d48d55c0598ec78adacdd847dcf934035ec2e1b/operator/v1/types_csi_cluster_driver.go#L86
It should be "smb.csi.k8s.io", from the CSIDriver manifest:
https://github.com/kubernetes-csi/csi-driver-smb/blob/master/deploy/csi-smb-driver.yaml
Hypershift-provisioned clusters, regardless of the cloud provider support the proposed integration for OLM-managed integration outlined in OCPBU-559 and OCPBU-560.
There is no degradation in capability or coverage of OLM-managed operators support short-lived token authentication on cluster, that are lifecycled via Hypershift.
Currently, Hypershift lacks support for CCO.
Currently, Hypershift will be limited to deploying clusters in which the cluster core operators are leveraging short-lived token authentication exclusively.
If we are successful, no special documentation should be needed for this.
Outcome Overview
Operators on guest clusters can take advantage of the new tokenized authentication workflow that depends on CCO.
Success Criteria
CCO is included in HyperShift and its footprint is minimal while meeting the above outcome.
Expected Results (what, how, when)
Post Completion Review – Actual Results
After completing the work (as determined by the "when" in Expected Results above), list the actual results observed / measured during Post Completion review(s).
Every guest cluster should have a running CCO pod with its kubeconfig attached to it.
Enchancement doc: https://github.com/openshift/enhancements/blob/master/enhancements/cloud-integration/tokenized-auth-enablement-operators-on-cloud.md
Provide a way to automatically recover a cluster with expired etcd server and peer certificates.
A cluster has etcd serving, peer, and serving-metrics certificates that are expired. There should be a way to either trigger certificate rotation or have a process that automatically does the rotation.
Deliver rotation and recovery requirements from OCPSTRAT-714
Epic Goal*
Provide a way to automatically recover a cluster with expired etcd server and peer certs
Why is this important? (mandatory)
Currently, the EtcdCertSigner controller, which is part of the CEO, renews the aforementioned certificates roughly every 3 years. However, if the cluster is offline for a period longer than the certificate's validity, upon restarting the cluster, the controller won't be able to renew the certificates since the operator won't be running at all.
We have scenarios where the customer, partner, or service delivery needs to recover a cluster that is offline, suspended, or shutdown, and as part of the process requires a supported way to force certificate and key rotation or replacement.
See the following doc for more use cases of when such clusters need to be recovered:
https://docs.google.com/document/d/198C4xwi5td_V-yS6w-VtwJtudHONq0tbEmjknfccyR0/edit
Required to enable emergency certificate rotation.
https://issues.redhat.com/browse/API-1613
https://issues.redhat.com/browse/API-1603
Scenarios (mandatory)
A cluster has etcd serving, peer and serving-metrics certificates that are expired. There should be a way to either trigger certificate rotation or have a process that automatically does the rotation.
This does not cover the expiration of etcd-signer certificates at this time.
That will be covered under https://issues.redhat.com/browse/ETCD-445
Dependencies (internal and external) (mandatory)
While the etcd team will implement the automatic recovery for the etcd certificates, other control-plane operators will be handling their own certificate recovery.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
When a openshift etcd cluster that has expired etcd server and peer certs is restarted and is able to regenerate those certs.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Given the scope creep of the work required to enable an offline cert rotation (or an automated restore), we are going to rely on online cert rotation to ensure that etcd certs don't expire during a cluster shutdown/hibernation.
Slack thread for background:
https://redhat-internal.slack.com/archives/C851TKLLQ/p1712533437483709?thread_ts=1712526244.614259&cid=C851TKLLQ
The estimated maximum shutdown period is 9 months. The refresh rate for the etcd certs can be increased so that there are always e.g 10 months left on the cert validity in the worst case i.e we shutdown right before the controller does its rotation.
The etc-ca must be rotatable both on-demand and automatically when expiry approaches.
Requirements (aka. Acceptance Criteria):
Deliver rotation and recovery requirements from OCPSTRAT-714
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Refactoring in ETCD-512 creates a rotation due to incompatible certificate creation processes. We should update the render [1] the same way the controller manages the certificates. Keep in mind that important information are always stored in annotations, which means we also need to update the manifest template itself (just exchanging file bytes isn't enough).
AC:
[1] https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/cmd/render/render.go#L347-L365
This spike explores using the library-go cert rotation utils in the etcd-operator to replace or augment the existing etcdcertsigner controller.
https://github.com/openshift/library-go/blob/master/pkg/operator/certrotation/client_cert_rotation_controller.go
https://github.com/openshift/cluster-etcd-operator/pull/1177
The goal of this spike is to evaluate if the library-go cert rotation util gives us rotation capabilities for the signer cert along with the peer and server certs.
There are a couple of issues to explore with the use of the library-go cert signer controller:
Given the manual rotation of signers for 4.16, we should add an alert that proactively tells customers to run the manual rotation procedure.
AC:
After merging ETCD-512, we need to ensure the certs are regenerated when the signer changes.
Current logic in library-go only changes when the bundle is updated, which is not sufficient of a criteria for the etcd rotation.
Some initial take: https://github.com/openshift/library-go/pull/1674
discussion in: https://redhat-internal.slack.com/archives/CC3CZCQHM/p1706889759638639
AC:
Refactoring in ETCD-512 does not clean up certificates that are dynamically generated. Imagine you're recreating all your master nodes everyday, we would create new peer/serving/metrics certificates for each node and never clean them up.
We should try to be conservative when cleaning them up, so keep them around for a certain retention period (7-10 days?) after the node went away.
AC:
Testing in ETCD-512 revealed that CEO does not react to changes in the CA bundle or the client certificates.
The current mounts are defined here:
https://github.com/openshift/cluster-etcd-operator/blob/60b7665a26610a095722d3b12b2bb08dcae6965f/manifests/0000_20_etcd-operator_06_deployment.yaml#L90-L106
A simple fix would be to watch the respective resources in a controller and exit the container on changes. This is how we did it with feature gates as well: (https://github.com/openshift/cluster-etcd-operator/blob/60b7665a26610a095722d3b12b2bb08dcae6965f/pkg/operator/starter.go#L174C1-L174C1)
If hot-reload would be feasible we should take a look at it, but it seems a larger refactoring.
AC:
To make better decision on rev rollouts and when to recert leafs, we should store the revision when a given CA was rotated.
AC:
As cluster admin I would like to configure machinesets to allocate instances from pre-existing Capacity Reservation in Azure.
I want to create a pool of reserved resources that can be shared between clusters of different teams based on their priorities. I want this pool of resources to remain available for my company and not get allocated to another Azure customer.
Additional background on the feature for considering additional use cases
Machine API support for Azure Capacity Reservation Groups
The customer would like to configure machinesets to allocate instances from pre-existing Capacity Reservation Groups, see Azure docs below
This would allow the customer to create a pool of reserved resources which can be shared between clusters of different priorities. Imagine a test and prod cluster where the demands of the prod cluster suddenly grow. The test cluster is scaled down freeing resources and the prod cluster is scaled up with assurances that those resources remain available, not allocated to another Azure customer.
MAPI/CAPI Azure
In this use case, there's no immediate need for install time support to designate reserved capacity group for control plane resources, however we should consider whether that's desirable from a completeness standpoint. We should also consider whether or not this should be added as an attribute for the installconfig compute machinepool or whether altering generated MachineSet manifests is sufficient, this appears to be a relatively new Azure feature which may or may not see wider customer demand. This customer's primary use case is centered around scaling up and down existing clusters, however others may have different uses for this feature.
Additional background on the feature for considering additional use cases
Update the vendor to update in cluster-control-plane-machine-set-operator repository for capacity reservation Changes.
As a developer I want to add support of capacity reservation group in openshift/machine-api-provider-azure so that azure VMs can be associated to a capacity reservation group during the VM creation.
CFE-1036 adds the support of Capacity Reservation in upstream CAPZ (PR). The same support is needed to be added downstream also. Please refer the upstream PR for adding support downstream.
As a developer I want to add the webhook validation for the "CapacityReservationGroupID" field of "AzureMachineProviderSpec" in openshift/machine-api-operator so that Azure capacity reservation can be supported.
CFE-1036 adds the support of Capacity Reservation in upstream CAPZ (PR). The same support is needed to be added downstream also. Please refer the upstream PR for adding support downstream.
Slack discussion regarding the same: https://redhat-internal.slack.com/archives/CBZHF4DHC/p1713249202780119?thread_ts=1712582367.529309&cid=CBZHF4DHC_
As a developer I want to add the field "CapacityReservationGroupID" to "AzureMachineProviderSpec" in openshift/api so that Azure capacity reservation can be supported.
CFE-1036 adds the support of Capacity Reservation in upstream CAPZ (PR). The same support is needed to be added downstream also. Please refer the upstream PR for adding support downstream.
Slack discussion regarding the same: https://redhat-internal.slack.com/archives/CBZHF4DHC/p1713249202780119?thread_ts=1712582367.529309&cid=CBZHF4DHC_
apiVersion: v1 kind: ConfigMap metadata: name: cluster-autoscaler-priority-expander namespace: kube-system data: priorities: |- 10: - .*t2\.large.* - .*t3\.large.* 50: - .*m4\.4xlarge.*
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/expander/priority/readme.md
As a user, I would like to specify different expanders for the cluster autoscaler. Having an API field on the ClusterAutoscaler resource to specify the expanders and their precedence will solve this issue for me.
The cluster autoscaler allows users to specify a list of expanders to use when creating new nodes. This list is expressed as a command line flag that takes multiple comma-separated options, eg "--expander=priority,least-waste,random".
We need to add a new field to the ClusterAutoscaler resource that allows users to specify an ordered list of expanders to use. We should limit values in that list to "priority", "least-waste", and "random" only. We should limit the length of the list to 3 items.
Due to historical reasons, the etcd operator uses a static definition of an 8GB limit. Nowadays, customers with dense cluster configurations regularly reach the limit. OpenShift should provide a mechanism for increasing the database size while maintaining a validated configuration.
This feature aims to provide validated selectable sizes for the etcd database, allowing cluster admins to opt-in for larger sizes.
Since using larger etcd database sizes may impact the defragmentation process, causing more noticeable transaction "pauses", this should be an opt-in configuration.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Yes |
Classic (standalone cluster) | Yes |
Hosted control planes | No |
Multi node, Compact (three node), or Single node (SNO), or all | Yes |
Connected / Restricted Network | Yes |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | Yes |
Operator compatibility | N/A |
Backport needed (list applicable versions) | N/A |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | No |
Other (please specify) | N/A |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Epic Goal*
Provide a way to change the etcd database size limit from the default 8GB which is non-configurable.
https://etcd.io/docs/v3.5/dev-guide/limit/#storage-size-limit
This will likely be done through the API as a new field in the `cluster` `etcds.operator.openshift.io` CustomResouce object. Similar to the etcd latency tuning profiles which allow a selectable set of configurations this limit should also be bound within reasonable limits or levels.
Why is this important? (mandatory)
Due to historical reasons, the etcd operator uses a static definition of an 8GB limit. Nowadays, customers with dense cluster configurations regularly reach the limit. OpenShift should provide a mechanism for increasing the database size while maintaining a validated configuration.
The 8GB limit was due to historical limitations of the bbolt store which have been improved upon for a while and there are examples and discussion upstream to suggest that the quota limit can be set higher.
See:
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
This is to track changes in OCP API for custom etcd db size https://issues.redhat.com/browse/ETCD-513
Include the version of the oc binary used, and the logs generated by the command into the must-gather directory when running the oc adm must-gather command.
The version of the oc binary could be included in the oc adm must-gather output [1], and if it's 2 or more minor versions than the running cluster, a warning should be shown.
1. Proposed title of this feature request
Include additional info into must-gather directory
2. What is the nature and description of the request?
Include the version of the oc binary used, and the logs generated by the command into the must-gather directory when running the oc adm must-gather command.
The version of the oc binary can help to identify if an issue could be caused because the oc version is different than the version of the cluster.
The logs generated by the oc adm must-gather command will help to identify if some information could not be collected, and also the exact image used
4. List any affected packages or components.
oc
Include logs generated by the command into the must-gather directory when running the oc adm must-gather command.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
The Azure File CSI driver currently lacks cloning and snapshot restore features. The goal of this feature is to support the cloning feature as technology preview. This will help support snapshots restore in a future release
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
As a user I want to easily clone Azure File volume by creating a new PVC with spec.DataSource referencing origin volume.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
This feature only applies to OCP running on Azure / ARO and File CSI.
The usual CSI cloning CI must pass.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | yes |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | all although SNO is rare on Azure |
Connected / Restricted Network | both |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86 |
Operator compatibility | Azure File CSI operator |
Backport needed (list applicable versions) | No |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | No |
Other (please specify) | ship downstream images with from forked azcopy |
High-level list of items that are out of scope. Initial completion during Refinement status.
Restoring snapshots are out of scope for now.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Update the CSI capability matrix and any language that mentions that Azure File CSI does not support cloning.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Not impact but benefit Azure / ARO customers.
Epic Goal*
Azure File added support for cloning volumes which relies on azcopy command upstream. We need to fork azcopy so we can build and ship downstream images with from forked azcopy. AWS driver does the same with efs-utils.
Upstream repo: https://github.com/Azure/azure-storage-azcopy
NOTE: using snapshots as a source is currently not supported: https://github.com/kubernetes-sigs/azurefile-csi-driver/blob/7591a06f5f209e4ef780259c1631608b333f2c20/pkg/azurefile/controllerserver.go#L732
Why is this important? (mandatory)
This is required for adding Azure File cloning feature support.
Scenarios (mandatory)
1. As a user I want to easily clone Azure File volume by creating a new PVC with spec.DataSource referencing origin volume.
Dependencies (internal and external) (mandatory)
1) Write OpenShift enhancement (STOR-1757)
2) Fork upstream repo (STOR-1716)
3) Add ART definition for OCP Component (STOR-1755)
4) Use the new image as base image for Azure File driver (STOR-1794)
5) Ensure e2e cloning tests are in CI (STOR-1818)
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Downstream Azure File driver image must include azcopy and cloning feature must be tested.
Drawbacks or Risk (optional)
No risks detected so far.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Once the azure-file clone is supported, we should add clone test in our pre-submit/periodic CI.
The "pvcDataSource: true" should be added.
Add support for standalone secondary networks for HCP kubevirt.
Advanced multus integration involves the following scenarios
1. Secondary network as single interface for VM
2. Multiple Secondary Networks as multiple interfaces for VM
Users of HCP KubeVirt should be able to create a guest cluster that is completely isolated on a secondary network outside of the default pod network.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | self-managed |
Classic (standalone cluster) | na |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | na |
Connected / Restricted Network | yes |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86 |
Operator compatibility | na |
Backport needed (list applicable versions) | na |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | na |
Other (please specify) | na |
ACM documentation should include how to configure secondary standalone networks.
This is a continuation of CNV-33392.
Multus Integration for HCP KubeVirt has three scenarios.
1. Secondary network as single interface for VM
2. Multiple Secondary Networks as multiple interfaces for VM
3. Secondary network + pod network (default for kubelet) as multiple interfaces for VM
Item 3 is the simplest use case because it does not require any addition considerations for ingress and load balancing. This scenario [item 3] is covered by CNV-33392.
Items [1,2] are what this epic is tracking, which we are considering advanced use cases.
Now that the hypershift kubevirt provider has a way to expose secondary network services generating endpointslices we should document it.
Also enable --attach-default-network option at "hcp" command tool is needed.
When creating a cluster with secondary network the ingress is broken since the created service cannot access the VMs secondary addresses.
We need to create a controller that manually create and update the service endpoints so it's always pointing to the VMs IPs.
When no default pod network is used, we need the LB mirroring that cloud-provider-kubevirt preforms to create custom endpoints that map to the secondary network. This will allow the LB service to route to the secondary network. Otherwise, the LB service will not be able to pass traffic if the pod network interface is not attached to the VM.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
Workload partitioning currently does not support mutating containers that have cpu limits. That can cause platform/control plane components use the wrong core budget (application instead of platform).
Apply the workload paritioning also for containers that have cpu limits.
1) Add another annotation to the crio drop in to represent the limit, "cpuquota"
[crio.runtime.workloads.management]
activation_annotation = "target.workload.openshift.io/management"
annotation_prefix = "resources.workload.openshift.io"
resources = { "cpushares" = 0, "cpuquota"=0, "cpuset" = "0-1,52-53" }
2) Modify the admission plugin to take cpu limits and add a cpuquota annotation. The cpu limits would be stripped. A new annotation or extend the existing one i.e.
annotations:
resources.workload.openshift.io/foo: {"cpushares": 20, "cpuquota": 50}
3) Modify crio to set cfs.quota for the contatiner to the value of cpuquota
High-level list of items that are out of scope. Initial completion during Refinement status.
n/a
The original premise was that all ocp pods did not set limits, however it has been found that at least one klusteret containers does set limits. This will be worked around in 4.8 but in 4.9 a proper solution is required to deal with these exceptions. In addition, the desire in the future is to support different workload types which would require limit support.
Requested by telco customers
Documentation Considerations
No docs changed expected
No impact on other projects expected
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Workload partitioning currently does not support mutating containers that have cpu limits. The workload partitioning admission plugin will skip those over those containers.
The original premise was that all ocp pods did not set limits, however it has been found that at least one klusteret containers does set limits. This will be worked around in 4.8 but in 4.9 a proper solution is required to deal with these exceptions. In addition, the desire in the future is to support different workload types which would require limit support.
Summary of changes:
1) Add another annotation to the crio drop in to represent the limit, "cpuquota"
[crio.runtime.workloads.management]
activation_annotation = "target.workload.openshift.io/management"
annotation_prefix = "resources.workload.openshift.io"
resources = { "cpushares" = 0, "cpuquota"=0, "cpuset" = "0-1,52-53" }
2) Modify the admission plugin to take cpu limits and add a cpuquota annotation. The cpu limits would be stripped. A new annotation or extend the existing one i.e.
annotations:
resources.workload.openshift.io/foo: {"cpushares": 20, "cpuquota": 50}
3) Modify crio to set cfs.quota for the contatiner to the value of cpuquota
Assumption:
4.8 -> 4.9 SNO upgrades are not supported therefore the upgrades scenario will not have to be dealt with.
The default OpenShift installation on AWS uses multiple IPv4 public IPs which Amazon will start charging for starting in February 2024. As a result, there is a requirement to find an alternative path for OpenShift to reduce the overall cost of a public cluster while this is deployed on AWS public cloud.
Provide an alternative path to reduce the new costs associated with public IPv4 addresses when deploying OpenShift on AWS public cloud.
There is a new path for "external" OpenShift deployments on AWS public cloud where the new costs associated with public IPv4 addresses have a minimum impact on the total cost of the required infrastructure on AWS.
Ongoing discussions on this topic are happening in Slack in the #wg-aws-ipv4-cost-mitigation private channel
Usual documentation will be required in case there are any new user-facing options available as a result of this feature.
*Resources which consumes public IPv4: bootstrap, API Public NLB, Nat Gateways
USER STORY:
DESCRIPTION:
<!--
Provide as many details as possible, so that any team member can pick it up
and start to work on it immediately without having to reach out to you.
-->
Required:
Nice to have:
...
ACCEPTANCE CRITERIA:
ENGINEERING DETAILS:
-
Networking Definition of Planned
Epic Template descriptions and documentation
Support EgressIP feature with ExternalTrafficPolicy=Local and External2Pod direct routing in OVNKubernetes.
We see a lot of customers using Multi-Egress Gateway with EgressIP.
Currently, connections which reaches pod via the OVN routing gateway are send back via EgressIP if it is associated with the specific namespace.
Multiple bugs have been reported by customers:
https://issues.redhat.com/browse/OCPBUGS-16792
https://issues.redhat.com/browse/OCPBUGS-7454
https://issues.redhat.com/browse/OCPBUGS-18400
Also, resulting in filing RFE, as it was too complicated to be fixed via a bug.
https://issues.redhat.com/browse/RFE-4614
https://issues.redhat.com/browse/RFE-3944
This is observed by multiple customers using MetalLB and F5 load balancers. We haven't really tested this combination.
From the initial discussion, looks like the fix is needed in OVN. Request the team to expedite this fix, given it has bunch of customers hitting it.
Additional information on each of the above items can be found here: Networking Definition of Planned
...
...
1. …
1. …
Epic Goal*
Rename “supported but not recommended” to "known issues"
Why is this important? (mandatory)
What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
oc adm upgrade {}include-not-recommended{-} today includes a Supported but not recommended updates: header when introducing those updates. It also renders the Recommended condition. OTA-1191 is about what to do with the -include-not-recommended flag. This ticket is about addressing the header and possibly about adjusting/contextualizing/something the Recommended condition type.
Here is a current output
$ oc adm upgrade --include-not-recommended Cluster version is 4.10.0-0.nightly-2021-12-23-153012 Upstream: https://raw.githubusercontent.com/wking/cincinnati-graph-data/cincinnati-graph-for-targeted-edge-blocking-demo/cincinnati-graph.json Channel: stable-4.10 Recommended updates: VERSION IMAGE 4.10.0-always-recommended quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000 Supported but not recommended updates: Version: 4.10.0-conditionally-recommended Image: quay.io/openshift-release-dev/ocp-release@sha256:1111111111111111111111111111111111111111111111111111111111111111 Recommended: Unknown Reason: EvaluationFailed Message: Exposure to SomeChannelThing is unknown due to an evaluation failure: client-side throttling: only 16.3µs has elapsed since the last match call completed for this cluster condition backend; this cached cluster condition request has been queued for later execution On clusters with the channel set to 'buggy', this imaginary bug can happen. https://bug.example.com/b Version: 4.10.0-fc.2 Image: quay.io/openshift-release-dev/ocp- release@sha256:85c6ce1cffe205089c06efe363acb0d369f8df7ad48886f8c309f474007e4faf Recommended: False Reason: ModifiedAWSLoadBalancerServiceTags Message: On AWS clusters for Services in the openshift-ingress namespace… This will not cause issues updating between 4.10 releases. This conditional update is just a demonstration of the conditional update system. https://bugzilla.redhat.com/show_bug.cgi?id=2039339
Definition of done:
After this change the output will look similar to below
$ oc adm upgrade --include-not-recommended Cluster version is 4.10.0-0.nightly-2021-12-23-153012 Upstream: https://raw.githubusercontent.com/wking/cincinnati-graph-data/cincinnati-graph-for-targeted-edge-blocking-demo/cincinnati-graph.json Channel: stable-4.10 Recommended updates: VERSION IMAGE 4.10.0-always-recommended quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000 Updates with known issues: Version: 4.10.0-conditionally-recommended Image: quay.io/openshift-release-dev/ocp-release@sha256:1111111111111111111111111111111111111111111111111111111111111111 Recommended: Unknown Reason: EvaluationFailed Message: Exposure to SomeChannelThing is unknown due to an evaluation failure: client-side throttling: only 16.3µs has elapsed since the last match call completed for this cluster condition backend; this cached cluster condition request has been queued for later execution On clusters with the channel set to 'buggy', this imaginary bug can happen. https://bug.example.com/b Version: 4.10.0-fc.2 Image: quay.io/openshift-release-dev/ocp-release@sha256:85c6ce1cffe205089c06efe363acb0d369f8df7ad48886f8c309f474007e4faf Recommended: False Reason: ModifiedAWSLoadBalancerServiceTags Message: On AWS clusters for Services in the openshift-ingress namespace… This will not cause issues updating between 4.10 releases. This conditional update is just a demonstration of the conditional update system. https://bugzilla.redhat.com/show_bug.cgi?id=2039339
OCP 4 clusters still maintain pinned boot images. We have numerous clusters installed that have boot media pinned to first boot images as early as 4.1. In the future these boot images may not be certified by the OEM and may fail to boot on updated datacenter or cloud hardware platforms. These "pinned" boot images should be updateable so that customers can avoid this problem and better still scale out nodes with boot media that matches the running cluster version.
In phase 1 provided tech preview for GCP.
In phase 2, GCP support goes to GA. Support for other IPI footprints is new and tech preview.
This will pick up stories left off from the initial Tech Preview(Phase 1): https://issues.redhat.com/browse/MCO-589
We'll want to add some tests to make sure the managing bootimages hasn't broken our existing functionality and that our new feature works. Proposed flow:
1/30/24: Updated based on enhancement discussions
This is the MCO side of the changes. Once the API PR lands, the MSBIC should start watching for the new API object.
It is also important to note that MachineSets having an ownerreference should not opted in to this mechanism, even if they are opt-ed in via the API. See discussion here: https://github.com/openshift/enhancements/pull/1496#discussion_r1463386593
Done when:
Update 3/26/24 - Moved ValidatingAdmissionPolicy bit into a separate story as that got a bit more involved.
The MachineSetBootImage Controller will create an alert if there are excessive failures to patch a MachineSet.
This will be implemented via a global knob in the API. This is required in addition to the feature gate as we expect customers to still want to toggle this feature when it leave tech preview.
Done when:
A ValidatingAdmissionPolicy should be implemented(via an MCO manifest) for changes to this new API object, so that the feature is not turned on in unsupported platforms. The only platform currently supported is GCP. The ValidationAdmissionPolicy is kube native and is behind its own feature gate, so this will have to be checked while applying these manifests. Here is what the YAML of what these manifests would look like:
--- apiVersion: admissionregistration.k8s.io/v1beta1 kind: ValidatingAdmissionPolicy metadata: name: "managed-bootimages-platform-check" spec: failurePolicy: Fail paramKind: apiVersion: config.openshift.io/v1 kind: Infrastructure matchConstraints: resourceRules: - apiGroups: ["operator.openshift.io"] apiVersions: ["v1"] operations: ["CREATE", "UPDATE"] resources: ["MachineConfiguration"] validations: - expression: "has(object.spec.ManagedBootImages) && param.status.platformStatus.Type == `GCP`" message: "This feature is only supported on these platforms: GCP" --- apiVersion: admissionregistration.k8s.io/v1beta1 kind: ValidatingAdmissionPolicyBinding metadata: name: "managed-bootimages-platform-check-binding" spec: policyName: "managed-bootimages-platform-check" validationActions: [Deny] paramRef: name: "cluster" parameterNotFoundAction: "Deny"
In 4.15, before conducting the live migration, CNO will check if a cluster is managed by the SD team. We need to remove this checking for supporting unmanaged clusters.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
As a cluster administrator, I would like to migrate my existing (that does not currently use Azure AD Workload Identity) cluster to use Azure AD Workload Identity
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
Many customers would like to migrate to Azure AD Workload Identity with minimal downtime but have numerous existing clusters and an aversion to supporting two concurrent operational requirements. Therefore they would like to migrate existing Azure clusters to take advantage of using Many customers would like to migrate to Azure Managed Identity but have numerous existing clusters and an aversion to supporting two concurrent operational requirements. Therefore they would like to migrate existing Azure clusters to Managed Identity in a safe manner after they have been upgraded to a version of OCP supporting that feature (4.14+) in a safe manner after they have been upgraded to a version of OCP supporting that feature (4.14+).
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Provide a documented method for migration to Azure AD Workload Identity for OpenShift 4.14+ with minimal downtime, and without customers having to start over with a new cluster using AZ Workload Identity and migrating over their workload. If there is risk of workload disruptive or downtime, we will keep to inform customers of this risk and have them accept this risk.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Self-managed |
Classic (standalone cluster) | Classic |
Hosted control planes | N/A |
Multi node, Compact (three node), or Single node (SNO), or all | All |
Connected / Restricted Network, or all | All |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | All applicable architectures |
Operator compatibility | |
Backport needed (list applicable versions) | 4.14+ |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | N/A |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Goal
Spike to evaluate if we can provide an automated way to support migration to Azure Managed Identity (preferred), or alternatively a manual method (second option) for customers to perform the migration themselves that is documented and supported, or not at all.
This spike will evaluate, scope the level of effort (sizing), and make recommendation on next steps.
Feature request
Support migration to Azure Managed Identity
Feature description
Many customers would like to migrate to Azure Managed Identity but have numerous existing clusters and an aversion to supporting two concurrent operational requirements. Therefore they would like to migrate existing Azure clusters to Managed Identity in a safe manner after they have been upgraded to a version of OCP supporting that feature (4.14+).
Why?
Provide a uniform operational experience for all clusters running versions which support Azure Managed Identity without having to decommission long running clusters
Other considerations
The cloud-credential-operator repository documentation can be updated for installing and/or migrating a cluster with Azure workload identity integration.
Add support for Johannesburg, South Africa (africa-south1) in GCP
As a user I'm able to deploy OpenShift in Johannesburg, South Africa (africa-south1) in GCP and this region is fully supported
A user can deploy OpenShift in GCP Johannesburg, South Africa (africa-south1) using all the supported installation tools for self-managed customers.
The support of this region is backported to the previuos OpenShift EUS release.
Google Cloud has added support for a new region in their public cloud offering and this region needs to be supported for OpenShift deployments as other regions.
The information of the new region needs to be added to the documentation so this is supported.
Add support for Dammam, Saudi Arabia, Middle East (me-central2) region in GCP
As a user I'm able to deploy OpenShift in Dammam, Saudi Arabia, Middle East (me-central2) region in GCP and this region is fully supported
A user can deploy OpenShift in GCP Dammam, Saudi Arabia, Middle East (me-central2) region using all the supported installation tools for self-managed customers.
The support of this region is backported to the previuos OpenShift EUS release.
Google Cloud has added support for a new region in their public cloud offering and this region needs to be supported for OpenShift deployments as other regions.
The information of the new region needs to be added to the documentation so this is supported.
For more details: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1710302543729539
Epic Goal*
Drive the technical part of the Kubernetes 1.29 upgrade, including rebasing openshift/kubernetes repositiry and coordination across OpenShift organization to get e2e tests green for the OCP release.
Why is this important? (mandatory)
OpenShift 4.17 cannot be released without Kubernetes 1.30
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
PRs:
Base RHCOS for OpenShift 4.16 on RHEL 9.4 content.
New RHEL minors bring additional features and performance optimizations, and most importantly, new hardware enablement. Customers and partners need to be able to install on the latest hardware.
We want to start looking at testing OpenShift on RHCOS built from RHEL 9.4 packages. While it is currently possible to do most of that testing work using OKD-SCOS, only the x86_64 architecture is available there. We'll publish OCP release images and boot images that include RHCOS builds made out of CentOS Stream packages to prepare for the RHEL 9.4 release.
Summary of the steps:
1. Add manifests for a new rhel-9.4 variant in openshift/os, based on the existing manifests for rhel-9.2 and c9s. We'll use as much C9S packages as possible and re-use existing OpenShift specific packages.
2. Update the staging pipeline configuration to add a 4.15-9.4 stream.
3. Trigger an RHCOS build and use it to replace an existing image from a nightly OCP release image. Upload this release image to the rhcos-devel namespace on registry.ci.openshift.org
4. Ask for as many people as possible to test this release image. Write an email to aos-devel, publish updated instructions in this Epic, publish the same instructions in the Slack channel.
5. Repeat steps 3 and 4 every two weeks
Discussed this with Michael Nguyen
This would work and ensure that we don't promote a release publicly with 9.4 GA content prior to RHEL 9.4 GA on April 30th
The thing we want to be careful about is checking in selection of 4.16.0-ec.6 which should be tracked via https://issues.redhat.com/browse/FDN-623 but we could just reach out to TRT team on #forum-ocp-release-oversight to confirm. This is because we don't want to switch to 9.4 GA content until the EC build which will be made public ahead of 9.4 GA has been selected.
Graduate the etcd tuning profiles feature delivered in https://issues.redhat.com/browse/ETCD-456 to GA
Remove the feature gate flag and ,ake the feature accessible to all customers
Requires fixes to apiserver to handle etcd client retries correctly
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | yes |
Classic (standalone cluster) | yes |
Hosted control planes | no |
Multi node, Compact (three node), or Single node (SNO), or all | Multi node and compact clusters |
Connected / Restricted Network | Yes |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | Yes |
Operator compatibility | N/A |
Backport needed (list applicable versions) | N/A |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | N/A |
Other (please specify) | N/A |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Epic Goal*
Graduate the etcd tuning profiles feature delivered in https://issues.redhat.com/browse/ETCD-456 to GA
https://github.com/openshift/api/pull/1538
https://github.com/openshift/enhancements/pull/1447
Why is this important? (mandatory)
Graduating the feature to GA makes it accessible to all customers and not hidden behind a feature gate.
As further outlined in the linked stories the major roadblock for this feature to GA is to ensure that the API server has the necessary capability to configure its etcd client for longer retries on platforms with slower latency profiles. See: https://issues.redhat.com/browse/OCPBUGS-18149
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Once the cluster is installed, we should be able to change the default latency profile on the API to a slower one and verify that etcd is rolled out with the updated leader election and heartbeat timeouts. During this rollout there should be no disruption or unavailability to the control-plane.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
The linked issue was slated to be released in 4.16, but missed the branch date due to a late merge conflict. This bug tracks the backporting of this slated feature to 4.16
Once https://issues.redhat.com/browse/ETCD-473 is done this story will track the work required to move the "operator/v1 etcd spec.hardwareSpeed" field from behind the feature gate to GA.
Graduate the etcd tuning profiles feature delivered in https://issues.redhat.com/browse/ETCD-456 to GA
Remove the feature gate flag and ,ake the feature accessible to all customers
Requires fixes to apiserver to handle etcd client retries correctly
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | yes |
Classic (standalone cluster) | yes |
Hosted control planes | no |
Multi node, Compact (three node), or Single node (SNO), or all | Multi node and compact clusters |
Connected / Restricted Network | Yes |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | Yes |
Operator compatibility | N/A |
Backport needed (list applicable versions) | N/A |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | N/A |
Other (please specify) | N/A |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Enable Hosted Control Planes guest clusters to support up to 500 worker nodes. This enable customers to have clusters with large amount of worker nodes.
Max cluster size 250+ worker nodes (mainly about control plane). See XCMSTRAT-371 for additional information.
Service components should not be overwhelmed by additional customer workloads and should use larger cloud instances and when worker nodes are lesser than the threshold it should use smaller cloud instances.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Managed |
Classic (standalone cluster) | N/A |
Hosted control planes | Yes |
Multi node, Compact (three node), or Single node (SNO), or all | N/A |
Connected / Restricted Network | Connected |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86_64 ARM |
Operator compatibility | N/A |
Backport needed (list applicable versions) | N/A |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | N/A |
Other (please specify) |
Check with OCM and CAPI requirements to expose larger worker node count.
Description of problem:
HyperShift operator crashes with size tagging enabled and a clustersizingconfiguration that does not have an effects section under the size configuration.
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
Always
Steps to Reproduce:
1. Install hypershift operator with size tagging enabled 2. Create a hosted cluster with request serving isolation topology 3.
Actual results:
HyperShift operator crashes
Expected results:
Cluster creation succeeds
Additional info:
When a HostedCluster is given a size label, it needs to ensure that request serving nodes exist for that size label and when they do, reschedule request serving pods to the appropriate nodes.
Description of criteria:
This does not require a design proposal.
This does not require a feature gate.
Description of problem
When provisioning an HCP on an MC enabled with sizing enabled (that has no existing HCPs) HCP install can be stuck trying to schedule the kube-apiserver for the hosted control plane. It seems the the placeholder deployment cannot be created, because of an empty selector value in the NodeAffinity:
operator-56b7ccb598-4hqz4 operator {"level":"error","ts":"2024-04-23T13:35:42Z","msg":"Reconciler error","controller":"DedicatedServingComponentSchedulerAndSi zer","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedCluster","HostedCluster":{"name":"dry3","namespace":"ocm-staging-2aqkcjamdtbcmjtp0lk1 il3vo9hfd4n1"},"namespace":"ocm-staging-2aqkcjamdtbcmjtp0lk1il3vo9hfd4n1","name":"dry3","reconcileID":"0772c093-ceef-46c1-a450-6bc8184ba633","error":"failed t o ensure placeholder deployment: Deployment.apps \"ocm-staging-2aqkcjamdtbcmjtp0lk1il3vo9hfd4n1-dry3\" is invalid: spec.template.spec.affinity.nodeAffinity.re quiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].values: Required value: must be specified when `operator` is 'In' or 'No tIn'","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/opt/app-root/src/vendor/sigs.k8s.io/controller-r untime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/opt/app-root/sr c/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start. func2.2\n\t/opt/app-root/src/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}
This appears to be due the the In selector of the nodeAffinity being populated with an empty list in the source: https://github.com/openshift/hypershift/blob/main/hypershift-operator/controllers/scheduler/dedicated_request_serving_nodes.go#L704
The value of unavailableNodePairs can be an empty list in the case that no HCPs exist on the cluster already, and therefore no nodes are labelled with both a cluster and a serving pair label. In this case, the empty list is passed in the NodeAffinity and results in the error above
As a service provider, I want to be able to:
so that I can achieve
Description of criteria:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Description of problem:
When a single placeholder is created for a request serving node (because a previous node existed of the same size), no machineset scale up occurs.
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
Always
Steps to Reproduce:
1. Setup a management cluster with request serving and autoscaling machinesets 2. Create a HostedCluster of size small 3. Manually scale up the corresponding large machineset (in same osdfleetmanager pair) 4. Increase the HostedCluster size to large
Actual results:
The large machinesets are never scaled up for the cluster
Expected results:
The hosted cluster moves up to large machines for request serving
Additional info:
When a single placeholder is created for request serving nodes, no machineset scale up occurs.
Description of problem:
For large cluster sizes, non-request serving pods such as OVN, etcd, etc. require more resources. Because these pods live in the same nodes as other hosted cluster's non-request serving pods, we can run into resource exhaustion unless requests for these pods are properly sized.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Create large hosted cluster (200+ nodes) 2. Observe resource usage of non-request-serving pods
Actual results:
Resource usage is large, while resource requests remain the same
Expected results:
Resource request grows corresponding to cluster size
Additional info:
As the HyperShift scheduler, I want to be able to:
so that I can achieve
As the HyperShift administrator, I want to be able to:
so that I can achieve
Description of criteria:
This requires a design proposal.
This does not require a feature gate.
As a HyperShift controller in the management plane, I want to be able to:
so that I can achieve
Description of criteria:
This requires a design proposal.
This does not require a feature gate.
Description of problem:
When using cluster size tagging and related request serving node scheduler, if a cluster is deleted while in the middle of resizing its request serving pods, the placeholder deployment that was created for it is not cleaned up.
Version-Release number of selected component (if applicable):
HyperShift main (4.16)
How reproducible:
Always
Steps to Reproduce:
1. Setup a management cluster with request-serving machinesets 2. Create a hosted cluster 3. Add workers to the hosted cluster so that it changes size 4. Delete the hosted cluster after the it's tagged with the new size but before nodes for its corresponding placeholder pods are created.
Actual results:
The placeholder deployment is never removed from the `hypershift-request-serving-autosizing-placeholder` namespace
Expected results:
The placeholder deployment is removed when the cluster is deleted.
Additional info:
Allow supporting RWX block PVCs with kubevirt csi when the underlying infra storageclass supports RWX Block
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | self-managed |
Classic (standalone cluster) | no |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
This feature should be documented as a capability of HCP OpenShift Virtualization in the ACM HCP docs
Currently, kubevirt-csi is limited to ReadWriteOnce for the pvcs within the guest cluster. This is true even when the infra storageclass supports RWX.
We should expand the ability for the guest cluster to use RWX block when the underlying infra storage class supports RWX block
Currently, kubevirt-csi is limited to ReadWriteOnce for the pvcs within the guest cluster. This is true even when the infra storageclass supports RWX.
We should expand the ability for the guest cluster to use RWX when the underlying infra storage class supports RWX
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift
prerequisite work Goals completed in OCPSTRAT-1122
{}Complete the design of the Cluster API (CAPI) architecture and build the core operator logic needed for Phase-1, incorporating the assets from different repositories to simplify asset management.
Phase 1 & 2 covers implementing base functionality for CAPI.
There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open
sets up CAPI ecosystem for vSphere
So far we haven't tested this provider at all. We have to run it and spot if there are any issues with it.
Steps:
Outcome:
vSphere provider is not present in downstream operator, someone has to add it there.
This will include adding a tech preview e2e job in release repo and going through the process described here https://github.com/openshift/cluster-capi-operator/blob/main/docs/provideronboarding.md
Enable selective management of HostedCluster resources via annotations, allowing hypershift operators to operate concurrently on a management cluster without interfering with each other. This feature facilitates testing new operator versions or configurations in a controlled manner, ensuring that production workloads remain unaffected.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | Applicable |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
Current hypershift operator functionality does not allow for selective management of HostedClusters, limiting the ability to test new operator versions or configurations in a live environment without affecting production workloads.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Upstream only for now
As a mgmt cluster admin, I want to be able to run multiple hypershift-operators that operate on a disjoint set of HostedClusters.
Create an Installer RHEL9-based build for FIPS-enabled OpenShift installations
As a user, I want to enable FIPS while deploying OpenShift on any platform that supports this standard, so the resultant cluster is compliant with FIPS security standards
Provide a dynamically linked build of the Installer for RHEL 9 in the release payload
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | yes |
Hosted control planes | n/a |
Multi node, Compact (three node), or Single node (SNO), or all | all |
Connected / Restricted Network | all |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | all |
Operator compatibility | |
Backport needed (list applicable versions) | no |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | OCM |
Other (please specify) |
Docs will need to guide the Installer binary to use for FIPS-enabled clusters
As a user with no FIPS requirement, I want to be able to use the same openshift-installer binary on both RHEL 8 and RHEL 9, as well as other common Linux distributions.
Currently to use baremetal IPI, a user must retrieve the openshift-baremetal-installer binary from the release payload. Historically, this was due to it needing to dynamically link to libvirt. This is no longer the case, so we can make baremetal IPI available in the standard openshift-installer binary.
libvirt is not a supported platform for openshift-installer. Nonetheless, it appears in the platforms list (likely inadvertently?) when running the openshift-baremetal-installer binary because the code for it was enabled in order to link against libvirt.
Now that linking against libvirt is no longer required, there is no reason to continue shipping this unsupported code.
We will need to come up with a separate build tag to distinguish between the openshift-baremetal-install (dynamic) and openshift-install (static) builds. Currently these are distinguished by the libvirt tag.
Allow the user to do oc release extract --command=openshift-install-fips to obtain an installer binary that is FIPS-ready.
The binary extracted will be the same one as is extracted when the command is openshift-baremetal-install; this name is provided for convenience.
As a user, I want to know how to download and use the correct installer binary to install a cluster with FIPS mode enabled. If I use the wrong binary or don't have FIPS enabled, I need instructions at the point I am trying to create a FIPS-mode cluster.
As a customer, I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This does not require a design proposal.
This does not require a feature gate.
Adding nodes to on-prem clusters in OpenShift in general is a complex task. We have numerous methods and the field keeps adding automation around these methods with a variety of solutions, sometimes unsupported (see "why is this important below"). Making cluster expansions easier will let users add nodes often and fast, leading to an much improved UX.
This feature adds nodes to any on-prem clusters, regardless of their installation method (UPI, IPI, Assisted, Agent), by booting an ISO image that will add the node to the cluster specified by the user, regardless of how the cluster was installed.
1. Create image:
$ export KUBECONFIG=kubeconfig-of-target-cluster $ oc adm node-image -o agent.iso --network-data=worker-n.nmstate --role=worker
2. Boot image
3. Check progress
$ oc adm add-node
An important goal of this feature is to unify and eliminate some of the existing options to add nodes, aiming to provide much simpler experience (See "Why is this important below"). We have official and field-documented ways to do this, that could be removed once this feature is in place, simplifying the experience, our docs and the maintenance of said official paths:
With this proposed workflow we eliminate the need of using the UPI method in the vast majority of the cases. We also eliminate the field-documented methods that keep popping up trying to solve this in multiple formats, and the need to recommend using MCE to all on-prem users, and finally we add a simpler option for IPI-deployed clusters.
In addition, all the built-in validations in the assisted service would be run, improving the installation the success rate and overall UX.
This work would have an initial impact on bare metal, vSphere, Nutanix and platform-agnostic clusters, regardless of how they were installed.
This feature is essential for several reasons. Firstly, it enables easy day2 installation without burdening the user with additional technical knowledge. This simplifies the process of scaling the cluster resources with new nodes, which today is overly complex and presents multiple options (https://docs.openshift.com/container-platform/4.13/post_installation_configuration/cluster-tasks.html#adding-worker-nodes_post-install-cluster-tasks).
Secondly, it establishes a unified experience for expanding clusters, regardless of their installation method. This streamlines the deployment process and enhances user convenience.
Another advantage is the elimination of the requirement to install the Multicluster Engine and Infrastructure Operator , which besides demanding additional system resources, are an overkill for use cases where the user simply wants to add nodes to their existing cluster but aren't managing multiple clusters yet. This results in a more efficient and lightweight cluster scaling experience.
Additionally, in the case of IPI-deployed bare metal clusters, this feature eradicates the need for nodes to have a Baseboard Management Controller (BMC) available, simplifying the expansion of bare metal clusters.
Lastly, this problem is often brought up in the field, where examples of different custom solutions have been put in place by redhatters working with customers trying to solve the problem with custom automations, adding to inconsistent processes to scale clusters.
This feature will solve the problem cluster expansion for OCI. OCI doesn't have MAPI and CAPI isn't in the mid term plans. Mitsubishi shared their feedback making solving the problem of lack of cluster expansion a requirement to Red Hat and Oracle.
We already have the basic technologies to do this with the assisted-service and the agent-based installer, which already do this work for new clusters, and from which we expect to leverage the foundations for this feature.
Day 2 node addition with agent image.
Yet Another Day 2 Node Addition Commands Proposal
Enable day2 add node using agent-install: AGENT-682
Modify dev-scripts and add a new job that tries to add a node to an existing cluster (using the scripts)
The first and second CSRs pending approval have the node name (hostname) embedded in their specs. monitor-add-nodes should only show CSRs pending approval for a specific node. Currently it shows all CSRs pending approval.
We can assume the hostname is provided in node-config.yaml. This allows us to tie a hostname to an IP address as both are required in node-config.yaml. If not provided, we can make best efforts to find the ip address of the node name using nslookup or an equivalent in golang.
Promote secure authentication methods in HCP CLI by deprecating the use of long-term `AWS credentials` and favoring STS assume role flows. This change aims to enhance security by encouraging the adoption of short-term token-based authentication, reducing the risk associated with promoting insecure usage patterns.
Goals (aka. expected user outcomes)
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | self-managed |
Classic (standalone cluster) | |
Hosted control planes | Yes |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | all |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | all |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | CLI |
Other (please specify) |
Admin initiates a setup command via the HCP CLI. The CLI automatically uses STS to assume a temporary role that has permission to create the necessary roles and policies for the new environment. Once the roles are in place, the CLI seamlessly continues the setup process.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
In light of security best practices and evolving compliance requirements, transitioning to STS assume role flows is important. This shift aims to align the HCP CLI with industry standards and security best practices.
Support must be provided to assist customers in transitioning from using long-term AWS credentials to STS. This includes comprehensive documentation and responsive support channels.
Revise documentation sections related to authentication to remove references to long-term credential usage `--aws-creds` / deprecate it and emphasize STS assume role processes. Include examples and common troubleshooting tips.
With this feature MCE will be an additional operator ready to be enabled with the creation of clusters for both the AI SaaS and disconnected installations with Agent.
Currently 4 operators have been enabled for the Assisted Service SaaS create cluster flow: Local Storage Operator (LSO), OpenShift Virtualization (CNV), OpenShift Data Foundation (ODF), Logical Volume Manager (LVM)
The Agent-based installer doesn't leverage this framework yet.
When a user performs the creation of a new OpenShift cluster with the Assisted Installer (SaaS) or with the Agent-based installer (disconnected), provide the option to enable the multicluster engine (MCE) operator.
The cluster deployed can add itself to be managed by MCE.
Deploying an on-prem cluster 0 easily is a key operation for the remaining of the OpenShift infrastructure.
While MCE/ACM are strategic in the lifecycle management of OpenShift, including the provisioning of all the clusters, the first cluster where MCE/ACM are hosted, along with other supporting tools to the rest of the clusters (GitOps, Quay, log centralisation, monitoring...) must be easy and with a high success rate.
The Assisted Installer and the Agent-based installers cover this gap and must present the option to enable MCE to keep making progress in this direction.
MCE engineering is responsible for adding the appropriate definition as an olm-operator-plugins
See https://github.com/openshift/assisted-service/blob/master/docs/dev/olm-operator-plugins.md for more details
Feature Goal
As an OpenShift administrator I want to deploy OpenShift clusters with Assisted Installer that have the Multicluster Engine Operator (MCE) enabled with support for managing bare metal clusters.
As an OpenShift administrator I want to have the bare metal clusters deployed with the Assisted Installer managed by MCE, i.e. MCE managing its local cluster.
Definition of Done
Feature Origin
MCE is strategic to OpenShift adoption in different scenarios. For Edge use cases it has ZTP to automate the provisioning of OpenShift clusters from a central cluster (hub cluster). MCE is also key for lifecycle management of OpenShift clusters. MCE is also available with the OpenShift subscriptions to every customer.
Additionally MCE will be key in the deployment of Hypershift, so it serves a double strategic purpose.
Lastly, day-2 operations on newly deployed clusters (without the need to manage multiple clusters), can be covered with MCE too.
We expect MCE to enable our customers to grow their OpenShift installation-base more easily and manage their lifecycle.
Reasoning
When enabling the MCE operator in the Assisted Installer we need to add the required storage with the installation to be able to use the Infrastructure Operator to create bare metal/vSphere/Nutanix clusters.
Automated storage configuration workflows
The Infrastructure Operator, a dependency of MCE to deploy bare metal, vSphere and Nutanix clusters, requires storage.
There are multiple scenarios:
Note from planning: Alternative we can use a new feature in install-config that allows enabling some operators in day-2 and let the user configure it this
when MCE and a storage operator is selected, enable infrastructure operator
RedHat allows following roles for system:anonymous user and system:unauthenticated group:
oc get clusterrolebindings -o json | jq '.items[] | select(.subjects[]?.kind
== "Group" and .subjects[]?.name == "system:unauthenticated") |
.metadata.name' | uniq
Returns what unauthenticated users can do, which is the following:
"self-access-reviewers"
"system:oauth-token-deleters"
"system:openshift:public-info-viewer"
"system:public-info-viewer"
"system:scope-impersonation"
"system:webhooks"
Customers would like to minimize the allowed permissions to unauthenticated groups and users.
It was determined after initial analysis that following roles are necessary for OIDC access and version information and will not change
"system:openshift:public-info-viewer"
"system:public-info-viewer"
Workaround available: Gating the access with policy engines
Expected: Minimize the allowed roles for unauthenticated access
Reduce use of cluster-wide permissions for system:anonymous user and system:unauthenticated group for following roles
"self-access-reviewers"
"system:oauth-token-deleters"
"system:scope-impersonation"
"system:webhooks"
Requirements (aka. Acceptance Criteria):
Customers would like to minimize the allowed permissions to unauthenticated groups and users.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | Yes |
Hosted control planes | tbd |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
It was determined that following roles will not change
"system:openshift:public-info-viewer"
"system:public-info-viewer"
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Many security teams have flagged anonymous access on apiserver as a security risk. Reducing the permissions granted at cluster level helps in hardening access to apiserver.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
This feature should not impact upgrade from previous versions.
This feature will allow enabling the new functionality for fresh installs.
Documentation Considerations
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Impact to existing usecases should be documented
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
RedHat allows following roles for system:anonymous user and system:unauthenticated group:
oc get clusterrolebindings -o json | jq '.items[] | select(.subjects[]?.kind
== "Group" and .subjects[]?.name == "system:unauthenticated") |
.metadata.name' | uniq
Returns what unauthenticated users can do, which is the following:
"self-access-reviewers"
"system:oauth-token-deleters"
"system:openshift:public-info-viewer"
"system:public-info-viewer"
"system:scope-impersonation"
"system:webhooks"
Customers would like to minimize the allowed permissions to unauthenticated groups and users.
Workaround available: Gating the access with policy engines
Outcome: Minimize the allowed roles for unauthenticated access
Goals of spike:
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer - specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision GCP infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
In order to SSH to the bootstrap node, we'll need to create a firewall rule. This should be cleaned up with bootstrap destroy.
As a product manager or business owner of OpenShift Lightspeed. I want to track who is using what feature of OLS and WHY. I also want to track the product adoption rate so that I can make decision about the product ( add/remove feature , add new investment )
Enable moniotring of OLS by defult when a user install OLS operator ---> check the box by defualt
Users will have the ability to disable the monitoring by . ----> in check the box
Refer to this slack conversation :https://redhat-internal.slack.com/archives/C068JAU4Y0P/p1723564267962489
Description of problem:
When installing the OpenShift Lightspeed operator, cluster monitoring should be enabled by default.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Click OpenShift Lightspeed in operator catalog 2. Click Install
Actual results:
"Enable Operator recommended cluster monitoring on this Namespace" checkbox is not selected by default.
Expected results:
"Enable Operator recommended cluster monitoring on this Namespace" checkbox should be selected by default.
Additional info:
In this feature will follow up OCPBU-186 Image mirroring by tags.
OCPBU-186 implemented new API ImageDigestMirrorSet and ImageTagMirrorSet and rolling of them through MCO.
This feature will update the components using ImageContentSourcePolicy to use ImageDigestMirrorSet.
The list of the components: https://docs.google.com/document/d/11FJPpIYAQLj5EcYiJtbi_bNkAcJa2hCLV63WvoDsrcQ/edit?usp=sharing.
Migrate OpenShift Components to use the new Image Digest Mirror Set (IDMS)
This doc list openshift components currently use ICSP: https://docs.google.com/document/d/11FJPpIYAQLj5EcYiJtbi_bNkAcJa2hCLV63WvoDsrcQ/edit?usp=sharing
Plan for ImageDigestMirrorSet Rollout :
Epic: https://issues.redhat.com/browse/OCPNODE-521
4.13: Enable ImageDigestMirrorSet, both ICSP and ImageDigestMirrorSet objects are functional
4.14: Update OpenShift components to use IDMS
4.17: Remove support for ICSP within MCO
As an openshift developer, I want --idms-file flag so that I can fetch image info from alternative mirror if --icsp-file gets deprecated.
Goal:
Support enablement of dual-stack VIPs on existing clusters created as dual-stack but at a time when it was not possible to have both v4 and v6 VIPs at the same time.
Why is this important?
This is a followup to SDN-2213 ("Support dual ipv4 and ipv6 ingress and api VIPs").
We expect that customers with existing dual stack clusters will want to make use of the new dual stack VIPs fixes/enablement, but it's unclear how this will work because we've never supported modifying on-prem networking configuration after initial deployment. Once we have dual stack VIPs enabled, we will need to investigate how to alter the configuration to add VIPs to an existing cluster.
We will need to make changes to the VIP fields in the Infrastructure and/or ControllerConfig objects. Infrastructure would be the first option since that would make all of the fields consistent, but that relies on the ability to change that object and have the changes persist and be propagated to the ControllerConfig. If that's not possible, we may need to make changes just in ControllerConfig.
For epics https://issues.redhat.com/browse/OPNET-14 and https://issues.redhat.com/browse/OPNET-80 we need a mechanism to change configuration values related to our static pods. Today that is not possible because all of the values are put in the status field of the Infrastructure object.
We had previously discussed this as part of https://issues.redhat.com/browse/OPNET-21 because there was speculation that people would want to move from internal LB to external, which would require mutating a value in Infrastructure. In fact, there was a proposal to put that value in the spec directly and skip the status field entirely, but that was discarded because a migration would be needed in that case and we need separate fields to indicate what was requested and what the current state actually is.
There was some followup discussion about that with Joel Speed from the API team (which unfortunately I have not been able to find a record of yet) where it was concluded that if/when we want to modify Infrastructure values we would add them to the Infrastructure spec and when a value was changed it would trigger a reconfiguration of the affected services, after which the status would be updated.
This means we will need new logic in MCO to look at the spec field (currently there are only fields in the status, so spec is ignored completely) and determine the correct behavior when they do not match. This will mean the values in ControllerConfig will not always match those in Infrastructure.Status. That's about as far as the design has gone so far, but we should keep the three use cases we know of (internal/external LB, VIP addition, and DNS record overrides) in mind as we design the underlying functionality to allow mutation of Infrastructure status values.
Depending on how the design works out, we may only track the design phase in this epic and do the implementation as part of one of the other epics. If there is common logic that is needed by all and can be implemented independently we could do that under this epic though.
Infrastructure.Spec will be modified by end-user. CNO needs to validate those changes and if valid, propagate them to Infrastructure.Status
For clusters that are installed as fresh 4.15 o/installer will propagate Infrastructure.Spec and Infrastructure.Status based on the install-config. However for clusters that are upgraded this code in o/installer will never run.
In order to have a consistent state at upgrade, we will make CNO to propagate Status back to Spec when cluster is upgraded to OCP 4.15.
As we have already done it when introducing multiple VIPs (API change that created plural field next to the singular), all the necessary code scaffolding is already in place.
TLDR: cluster-bootstrap and network-operator fight for the same resource using `fieldManager` property of k8s objects. We need to take over everything what has been owned by cluster-bootstrap and manage it ourselves
Tasks to do here
The Agent Based installer is a clean and simple way to install new instances of OpenShift in disconnected environments, guiding the user through the questions and information needed to successfully install an OpenShift cluster. We need to bring this highly useful feature to the IBM Power and IBM zSystem architectures
Agent based installer on Power and zSystems should reflect what is available for x86 today.
Able to use the agent based installer to create OpenShift clusters on Power and zSystem architectures in disconnected environments
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Epic Goal
Why is this important?
Scenarios
1. …
Acceptance Criteria
Dependencies (internal and external)
1. …
Previous Work (Optional):
1. …
Open questions::
1. …
Done Checklist
As the multi-arch engineer, I would like to build an environment and deploy using Agent Based installer, so that I can confirm if the feature works per spec.
Acceptance Criteria
Ensure CSI Stack for Azure is running on management clusters with hosted control planes, allowing customers to associate a cluster as "Infrastructure only" and move the following parts of the stack:
This feature enables customers to run their Azure infrastructure more efficiently and cost-effectively by using hosted control planes and supporting infrastructure without incurring additional charges from Red Hat. Additionally, customers should care most about workloads not the management stack to operate their clusters, this feature gets us closer to this goal.
Non-CSI Stack for Azure-related functionalities are out of scope for this feature.
Workload identity authentication is not covered by this feature - see STOR-1748
This feature is designed to enable customers to run their Azure infrastructure more efficiently and cost-effectively by using HyperShift control planes and supporting infrastructure without incurring additional charges from Red Hat.
Documentation for this feature should provide clear instructions on how to enable the CSI Stack for Azure on management clusters with hosted control planes and associate a cluster as "Infrastructure only." It should also include instructions on how to move the Azure Disk CSI driver, Azure File CSI driver, and Azure File CSI driver operator to the appropriate clusters.
This feature impacts the CSI Stack for Azure and any layered products that interact with it. Interoperability test scenarios should be factored by the layered products.
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Run Azure File CSI driver operator + Azure File CSI driver control-plane Pods in the management cluster, run the driver DaemonSet in the hosted cluster allowing customers to associate a cluster as "Infrastructure only".
Why is this important? (mandatory)
This allows customers to run their Azure infrastructure more efficiently and cost-effectively by using hosted control planes and supporting infrastructure without incurring additional charges from Red Hat. Additionally, customers should care most about workloads not the management stack to operate their clusters, this feature gets us closer to this goal.
Scenarios (mandatory)
When leveraging Hosted control planes, the Azure File CSI driver operator + Azure File CSI driver control-plane Pods should run in the management cluster. The driver DaemonSet should run on the managed cluster. This deployment model should provide the same feature set as the regular OCP deployment.
Dependencies (internal and external) (mandatory)
Hosted control plane on Azure.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
We need to make changes to CSO, so as it can run azure-file-operator on hypershift.
As part of this story, we will simply move building and CI of existing code to combined csi-operator.
We need to modify csi-operator so as it be ran as azure-file operator on hypershift and standalone clusters.
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Run Azure Disk CSI driver operator + Azure Disk CSI driver control-plane Pods in the management cluster, run the driver DaemonSet in the hosted cluster allowing customers to associate a cluster as "Infrastructure only".
Why is this important? (mandatory)
This allows customers to run their Azure infrastructure more efficiently and cost-effectively by using hosted control planes and supporting infrastructure without incurring additional charges from Red Hat. Additionally, customers should care most about workloads not the management stack to operate their clusters, this feature gets us closer to this goal.
Scenarios (mandatory)
When leveraging Hosted control planes, the Azure Disk CSI driver operator + Azure Disk CSI driver control-plane Pods should run in the management cluster. The driver DaemonSet should run on the managed cluster. This deployment model should provide the same feature set as the regular OCP deployment.
Dependencies (internal and external) (mandatory)
Hosted control plane on Azure.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Done - Checklist (mandatory)
As part of this epic, Engineers working on Azure Hypershift should be able to build and use Azure Disk storage on hypershift guests via developer preview custom build images.
For this story, we are going to enable deployment of azure disk driver and operator by default in hypershift environment.
Add a knob in CNO to allow users to modify the changes made in ovn-k
Create custom roles for GCP with minimal set of required permissions.
Enable customers to better scope credential permissions and create custom roles on GCP that only include the minimum subset of what is needed for OpenShift.
Some of the service accounts that CCO creates, e.g. service account with role roles/iam.serviceAccountUser provides elevated permissions that are not required/used by the requesting OpenShift components. This is because we use predefined roles for GCP that come with bunch of additional permissions. The goal is to create custom roles with only the required permissions.
TBD
Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster Network Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer
Update GCP Credentials Request manifest of the Cluster Network Operator to use new API field for requesting permissions.
Evaluate if any of the GCP predefined roles in the credentials request manifests of OpenShift cluster operators give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer
Evaluate if any of the GCP predefined roles in the credentials request manifest of Cloud Controller Manager Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer
Evaluate if any of the GCP predefined roles in the credentials request manifest of machine api operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer
Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster CAPI Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer
These are phase 2 items from CCO-188
Moving items from other teams that need to be committed to for 4.13 this work to complete
Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster Storage Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer
Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster Ingress Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer
Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster Image Registry Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
As an Infrastructure Administrator, I want to deploy OpenShift on Nutanix distributing the control plane and compute nodes across multiple regions and zones, forming different failure domains.
As an Infrastructure Administrator, I want to configure an existing OpenShift cluster to distribute the nodes across regions and zones, forming different failure domains.
Install OpenShift on Nutanix using IPI / UPI in multiple regions and zones.
This implementation would follow the same idea that has been done for vSphere. The following are the main PRs for vSphere:
https://github.com/openshift/enhancements/blob/master/enhancements/installer/vsphere-ipi-zonal.md
Nutanix Zonal: Multiple regions and zones support for Nutanix IPI and Assisted Installer
Note
As a user, I want to be able to spread control plane nodes for an OCP clusters across Prism Elements (zones).
Feature Overview (aka. Goal Summary):
This feature will allow an x86 control plane to function with compute nodes exclusively of type Power in a HyperShift environment.
Goals (aka. expected user outcomes):
Enable an x86 control plane to operate with a Power data-plane in a HyperShift environment.
Requirements (aka. Acceptance Criteria):
Customer Considerations:
Customers who require a mix of x86 control plane and Power data-plane for their HyperShift environment will benefit from this feature.
Documentation Considerations:
Interoperability Considerations:
Currently only amd64 and arm64 are considered as support processing arch in hypershift, need to add ppc64le as a supported arch with agent platform.
Make it possible to entirely disable the Ingress Operator by leveraging the OCPPLAN-9638 Composable OpenShift capability.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
RFE: https://issues.redhat.com/browse/RFE-3395
Enhancement PR: https://github.com/openshift/enhancements/pull/1415
API PR: https://github.com/openshift/api/pull/1516
Ingress Operator PR: https://github.com/openshift/cluster-ingress-operator/pull/950
Feature Goal: Make it possible to entirely disable the Ingress Operator by leveraging the Composable OpenShift capability.
Implement the ingress capability focusing on the HyperShift users.
As described in the EP PR.
# ...
* Release Technical Enablement - Provide necessary release enablement details and documents.
* Ingress Operator can be disabled on HyperShift.
# The install-config and ClusterVersion API have been updated with the capability feature.
# The console operator.
#
* CI - CI is running, tests are automated and merged.
* Release Enablement <link to Feature Enablement Presentation>
* DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
* DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
* DEV - Downstream build attached to advisory: <link to errata>
* QE - Test plans in Polarion: <link or reference to Polarion>
* QE - Automated tests merged: <link or reference to automated tests>
* DOC - Downstream documentation merged: <link to meaningful PR>
Context
HyperShift uses the cluster-version-operator (CVO) to manage the part of the ingress operator's payload, namely CRDs and RBACs. The ingress operator's deployment though is reconciled directly by the control plane operator. That's why HyperShift projects the ClusterVersion resource into the hosted cluster and the enabled capabilities have to be set properly to enable/disable the ingress operator's payload and to let users as well as the other operators be aware of the state of the capabilities.
Goal
The goal of this user story is to implement a new capability in the OpenShift API repository. Use this procedure as example.
Goal
The goal of this user story is to add the new (ingress) capability to the cluster operator's payload (manifests: CRDs, RBACs, deployment, etc.).
Out of scope
Acceptance criteria
Links
Goal
The goal of this user story is to bump the openshift api which contains the ingress capability.
Acceptance criteria
Links
This epic is another epic under the "reduce workload disruptions" umbrella.
This is now updated to get us most of the way to MCO-200 (Admin-Defined reboot & drain), but not necessarily with all the final features in place.
This epic aims to create a reboot/drain policy object and a MCO-management apparatus for initial functionality with MachineConfig backed updates, with a restricted set of actions for the user. We also need reboot/drain policy object for ImageContentSourcePolicy, ImageTagMirrorSet and ImageDigestMirrorSet to avoid drains/reboots when admins use these APIs and have other ways of ensuring image integrity.,
This mostly focuses on the user interface for defining reboot/drain policies. We will also need this for the layering "live apply" cases and bifrost-backed updates, to be implemented into a future update.
The MCO's reboot and drain rules are currently hard-coded in the machine-config-daemon here.
Node drains also occur even beyond OCP 4.9 when not just adding but also removing ICSP, ITMS, IDMS objects or single mirroring rules in their configuratuion according to RFE-3667.
This causes at least three problems:
Done when:
Description of problem:
The MCO logic today allows users to not reboot when changing the registries.conf file (through ICSP/IDMS/ITMS objects), but the MCO will sometimes drain the node if the change is deemed "unsafe" (deleting a mirror, for example). This behaviour is very disruptive for some customers who with to make all image registries changes non-disruptive. We will address this long term with admin defined policies via the API properly, but we would like to have a backportable solution (as a support exception) for users to do so
Version-Release number of selected component (if applicable):
4.14->4.16
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Have the MCC validate the correctness of user-provided spec, and render the final object into the status for the daemon to use
Have either a separate RebootPolicy or as part of MachineConfiguration spec
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
The observable functionality that the user now has as a result of receiving this feature. Complete during New status.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Display console graphs for the following:
1. Specifics on testing framework- adding tests to https://github.com/openshift/console- pending story creation.
2. Feasibility of https://issues.redhat.com/browse/WINC-530
As a WMCO user, I want filesystem graphs to be visible for Windows nodes.
WMCO uses existing console queries to display graphs for Windows nodes. Changes in filesystem queries (https://github.com/openshift/console/pull/7201) on the console side, prevent filesystem graphs to be displayed for Windows nodes.
The root cause for filesystem queries to not work is windows_exporter does not return any value for `mountpoint` field used in the console queries.
Filesytem graph is populated for Windows nodes in the OpenShift console.
When inspecting a pod in the OCP console, the metrics tab shows a number of graphs. The networking graphs do not show data for Windows pods.
Using windows exporter 0.24.0 I am getting `No datapoints found` for the query made by the console:
(sum(irate(container_network_receive_bytes_total{pod='win-webserver-685cd6c5cc-8298l'}[5m])) by (pod, namespace, interface)) + on(namespace,pod,interface) group_left(network_name) (pod_network_name_info)
There is data returned from the query `irate(container_network_receive_bytes_total{pod='windows-machine-config-operator-7c8bcc7b64-sjqxw'}[5m])`
Which makes me believe the error is due to pod_network_name_info not having data for the Windows pods I am looking at.
I'm confirming that by checking in the namespace the workloads are deployed to via the query: pod_network_name_info{namespace="openshift-windows-machine-config-operator"}
I only see metrics for the Linux pods in the namespace.
Looking into this it seems like these metrics are coming from https://github.com/openshift/network-metrics-daemon which runs on each Linux node, and creates a metric for applicable pods running on the node.
We need to create CI tests to ensure that during the live migration when a cluster has both OVN-K and Openshift-SDN deployed in some nodes. The cluster network can still work as normal. We need run the disruptive tests as we do for upgrde.
In https://issues.redhat.com/browse/OPNET-10 we did some initial investigation of a new design for handling creation of br-ex for OVNK. We successfully came up with a high-level design, but we still need to translate that into an implementable OpenShift feature. The goal for this epic would be to come up with an accepted enhancement. The implementation of the design would likely need to happen in the next release.
Notable obstacles to this work are:
This is the followup epic to https://issues.redhat.com/browse/OPNET-265 Once we have an approved enhancement (which we expect to be a significant piece of work in itself), this will track the implementation. We expect this to happen in 4.15 or later unless things go very smoothly with the enhancement epic in 4.14.
Image and artifact signing is a key part of a DevSecOps model. The Red Hat-sponsored sigstore project aims to simplify signing of cloud-native artifacts and sees increasing interest and uptake in the Kubernetes community. This document proposes to incrementally invest in OpenShift support for sigstore-style signed images and be public about it. The goal is to give customers a practical and scalable way to establish content trust. It will strengthen OpenShift’s security philosophy and value-add in the light of the recent supply chain security crisis.
CRIO
https://docs.google.com/document/d/12ttMgYdM6A7-IAPTza59-y2ryVG-UUHt-LYvLw4Xmq8/edit#
As a hub cluster admin, I want to be able to:
so that I can prevent
Description of criteria:
Implementing managed-only configs is out of scope for this story
This does not require a design proposal.
This does not require a feature gate.
As a hub cluster admin, I want to be able to:
so that I can prevent
Description of criteria:
Implementing managed-only configs is out of scope for this story
This does not require a design proposal.
This does not require a feature gate.
We need to be able to install the HO with external DNS and create HCPs on AKS clusters
The cloud-network-config-operator is being deployed on HyperShift with `runAsNonRoot` set to true. When HCP is deployed on non-OpenShift management clusters, such as AKS, this needs to be unset so the pod can run as root.
This is currently causing issues deploying this pod on HCP on AKS with the following error:
state:
waiting:
message: 'container has runAsNonRoot and image will run as root (pod: "cloud-network-config-controller-59d4677589-bpkfp_clusters-brcox-hypershift-arm(62a4b447-1df7-4e4a-9716-6e10ec55d8fd)", container: hosted-cluster-kubecfg-setup)'
reason: CreateContainerConfigError
As a user of HyperShift on Azure, I want to be able to:
so that I can achieve
Description of criteria:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
CSI pods are failing to create on HCP on an AKS cluster.
% k get pods | grep -v Running NAME READY STATUS RESTARTS AGE csi-snapshot-controller-cfb96bff7-7tc94 0/1 CreateContainerConfigError 0 17h csi-snapshot-webhook-57f9799848-mlh8k 0/1 CreateContainerConfigError 0 17h
The issue is
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 34m default-scheduler Successfully assigned clusters-brcox-hypershift-arm/csi-snapshot-controller-cfb96bff7-7tc94 to aks-nodepool1-24902778-vmss000001 Normal Pulling 34m kubelet Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:805280104b2cc8d8799b14b2da0bd1751074a39c129c8ffe5fc1b370671ecb83" Normal Pulled 34m kubelet Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:805280104b2cc8d8799b14b2da0bd1751074a39c129c8ffe5fc1b370671ecb83" in 3.036610768s (10.652709193s including waiting) Warning Failed 32m (x12 over 34m) kubelet Error: container has runAsNonRoot and image will run as root (pod: "csi-snapshot-controller-cfb96bff7-7tc94_clusters-brcox-hypershift-arm(45ab89f2-9c00-4afa-bece-7846505edbfc)", container: snapshot-controller)
As a HyperShift CLI user, I want to be able to:
so that I can achieve
Description of criteria:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As of OpenShift 4.14, this functionality is Tech Preview for all platforms but OpenStack, where it is GA. This Feature is to bring the functionality to GA for all remaining platforms.
Allow to configure control plane nodes across multiple subnets for on-premise IPI deployments. With separating nodes in subnets, also allow using an external load balancer, instead of the built-in (keepalived/haproxy) that the IPI workflow installs, so that the customer can configure their own load balancer with the ingress and API VIPs pointing to nodes in the separate subnets.
I want to install OpenShift with IPI on an on-premise platform (high priority for bare metal and vSphere) and I need to distribute my control plane and nodes across multiple subnets.
I want to use IPI automation but I will configure an external load balancer for the API and Ingress VIPs, instead of using the built-in keepalived/haproxy-based load balancer that come with the on-prem platforms.
Customers require using multiple logical availability zones to define their architecture and topology for their datacenter. OpenShift clusters are expected to fit in this architecture for the high availability and disaster recovery plans of their datacenters.
Customers want the benefits of IPI and automated installations (and avoid UPI) and at the same time when they expect high traffic in their workloads they will design their clusters with external load balancers that will have the VIPs of the OpenShift clusters.
Load balancers can distribute incoming traffic across multiple subnets, which is something our built-in load balancers aren't able to do and which represents a big limitation for the topologies customers are designing.
While this is possible with IPI AWS, this isn't available with on-premise platforms installed with IPI (for the control plane nodes specifically), and customers see this as a gap in OpenShift for on-premise platforms.
Epic | Control Plane with Multiple Subnets | Compute with Multiple Subnets | Doesn't need external LB | Built-in LB |
---|---|---|---|---|
✓ | ✓ | ✓ | ✓ | |
✓ | ✓ | ✓ | ✕ | |
✓ | ✓ | ✓ | ✓ | |
✓ | ✓ | ✓ | ✓ | |
✓ | ✓ | ✓ | ||
✓ | ✓ | ✓ | ✕ | |
✓ | ✓ | ✓ | ✕ | |
✓ | ✓ | ✓ | ✓ | |
✕ | ✓ | ✓ | ✓ | |
✕ | ✓ | ✓ | ✓ | |
✕ | ✓ | ✓ | ✓ |
Workers on separate subnets with IPI documentation
We can already deploy compute nodes on separate subnets by preventing the built-in LBs from running on the compute nodes. This is documented for bare metal only for the Remote Worker Nodes use case: https://docs.openshift.com/container-platform/4.11/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#configure-network-components-to-run-on-the-control-plane_ipi-install-installation-workflow
This procedure works on vSphere too, albeit no QE CI and not documented.
External load balancer with IPI documentation
Currently o/installer validates that ELB is used only in TechPreview clusters. This validation needs to be removed so that ELB can be consumed from GA.
Goal:
Enable and support Multus CNI for microshift.
Background:
Customers with advanced networking requirement need to be able to attach additional networks to a pod, e.g. for high-performance requirements using SR-IOV or complex VLAN setups etc.
Requirements:
Documentation:
Testing:
Customer Considerations:
Out of scope:
(contacting ART to setup image build is another task)
It should include all CNIs (bridge, macvlan, ipvlan, etc.) - if we decide to support something, we'll just update scripting to copy those CNIs
It needs to include IPAMs: static, dynamic (DHCP), host-local (we might just not copy it to host)
RHEL9 binaries only to save space
Unify and update hosted control planes storage operators so that they have similar code patterns and can run properly in both standalone OCP and HyperShift's control plane.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Epic Goal*
Our current design of EBS driver operator to support Hypershift does not scale well to other drivers. Existing design will lead to more code duplication between driver operators and possibility of errors.
Why is this important? (mandatory)
An improved design will allow more storage drivers and their operators to be added to hypershift without requiring significant changes in the code internals.
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Out CSI driver YAML files are mostly copy-paste from the initial CSI driver (AWS EBS?).
As OCP engineer, I want the YAML files to be generated, so we can keep consistency among the CSI drivers easily and make them less error-prone.
It should have no visible impact on the resulting operator behavior.
Finally switch both CI and ART to the refactored aws-ebs-csi-driver-operator.
The functionality and behavior should be the same as the existing operator, however, the code is completely new. There could be some rough edges. See https://github.com/openshift/enhancements/blob/master/enhancements/storage/csi-driver-operator-merge.md
Ci should catch the most obvious errors, however, we need to test features that we do not have in CI. Like:
Add the DNSNameResolver Controller to Cluster DNS Operator as proposed in the enhancement https://github.com/openshift/enhancements/pull/1335
Implement the changes in Cluster DNS Operator as proposed in the enhancement https://github.com/openshift/enhancements/pull/1335
Make the changes as per the proposed enhancement https://github.com/openshift/enhancements/pull/1335
Note: The flag should be added to OVN-K after checking if the feature-gate DNSNameResolver is enabled.
- apiGroups: ["network.openshift.io"]
resources:
- dnsnameresolvers
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
Add the new external plugin from https://github.com/openshift/coredns-ocp-dnsnameresolver to CoreDNS as proposed in the enhancement https://github.com/openshift/enhancements/pull/1335
While trying to block requests going from the pods to different domain names, for example:
Here, the egressnetworkpolicy is working out for `registry.access.redhat.com` and `registry.access.redhat.com.edgekey.net`, however, for `registry-1.docker.io`, it is not denying access despite giving the deny entry.
"Domain name updates are polled based on the TTL (time to live) value of the domain returned by the local non-authoritative servers. The pod should also resolve the domain from the same local nameservers when necessary, otherwise, the IP addresses for the domain perceived by the egress network policy controller and the pod will be different, and the egress network policy may not be enforced as expected. Since egress network policy controller and pod are asynchronously polling the same local nameserver, there could be a race condition where pod may get the updated IP before the egress controller. Due to this current limitation, domain name usage in EgressNetworkPolicy is only recommended for domains with infrequent IP address changes."
Aim of this feature is to fix this and also support wildcard entries for EgressNetwork Policy
Goal
As an OpenShift installer I want to update the firmware of the hosts I use for OpenShift on day 1 and day 2.
As an OpenShift installer I want to integrate the firmware update in the ZTP workflow.
Description
The firmware updates are required in BIOS, GPUs, NICs, DPUs, on hosts that will often be used as DUs in Edge locations (commonly installed with ZTP).
Acceptance criteria
Out of Scope
Description of problem:
After running a firmware update the new version is not displayed in the status of the HostFirmwareComponents
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Execute a firmware update, after it succeeds check the Status to find the information about the new version installed.
Actual results:
Status only show the initial information about the firmware components.
Expected results:
Status should show the newer information about the firmware components.
Additional info:
When executing a firmware update for BMH, there is a problem updating the Status of the HostFirmwareComponents CRD, causing the BMH to repeat the update multiple times since it stays in Preparing state.
As a customer of self managed OpenShift or an SRE managing a fleet of OpenShift clusters I should be able to determine the progress and state of an OCP upgrade and only be alerted if the cluster is unable to progress. Support a cli-status command and status-API which can be used by cluster-admin to monitor the progress. status command/API should also contain data to alert users about potential issues which can make the updates problematic.
Here are common update improvements from customer interactions on Update experience
oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0 True True 16s Working towards 4.12.4: 9 of 829 done (1% complete)
Update docs for UX and CLI changes
Reference : https://docs.google.com/presentation/d/1cwxN30uno_9O8RayvcIAe8Owlds-5Wjr970sDxn7hPg/edit#slide=id.g2a2b8de8edb_0_22
Epic Goal*
Add a new command `oc adm upgrade status` command which is backed by an API. Please find the mock output of the command output attached in this card.
Why is this important? (mandatory)
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Add node status as mentioned in the sample output to "oc adm upgrade status" in OpenShift Update Concepts
With this output, users will be able to see the state of the nodes that are not part of the master machine config pool and what version of RHCOS they are currently on. I am not sure if it is possible to also show corresponding OCP version. If possible we should display that as well.
=Worker Upgrade= Worker Pool: openshift-machine-api/machineconfigpool/worker Assessment: Admin required Completion: 25% (Est Time Remaining: N/A - Manual action required) Mode: Manual | Assisted [- No Disruption][- Disruption Permitted][- Scaling Permitted] Worker Status: 4 Total, 4 Available, 0 Progressing, 3 Outdated, 0 Draining, 0 Excluded, 0 Degraded Worker Pool Node(s) NAME ASSESSMENT PHASE VERSION EST MESSAGE ip-10-0-134-165.ec2.internal Manual N/A 4.12.1 ? Awaiting manual reboot ip-10-0-135-128.ec2.internal Complete Updated 4.12.16 - ip-10-0-138-184.ec2.internal Manual N/A 4.12.1 ? Awaiting manual reboot ip-10-0-139-31.ec2.internal Manual N/A 4.12.1 ? Awaiting manual reboot
CVO story: persistence for "how long has this ClusterOperator been updating?". For a first pass, we can just hold this in memory in the CVO, and possibly expose it to our managers via the ResourceReconcilliationIssues structure created in OTA-1159. But to persist between CVO restarts, we could have the CVO annotate the ClusterOperator with "here's when I first precreated this ClusterOperator during update $KEY". Update keys are probably (a hash of?) (startTime, targetImage). And the CVO would set the annotation when it pre-created the cluster operator for the release, unless an annotation already existed with the same update key. And the reconciling-mode CVO could clear off the annotations.
We are not sure yet whether we want to do this with CO annotations or OTA-1159 "result of work", we defer this decision to after we implement OTA-1159.
I ended up doing this via the OTA-1159 result-of-work route, and I just did the "expose" side. If we want to cover persistence between CVO-container-restarts, we'd need a follow-up ticket for that. The CVO is only likely to restart when machine-config is moving though, and giving that component a bit more time doesn't seem like a terrible thing, so we might not need a follow-up ticket at all.
Add node status as mentioned in the sample output to "oc adm upgrade status".
With this output, users will be able to see the state of the nodes that are part of the master machine config pool and what version of RHCOS they are currently on. I am not sure if it is possible to also show corresponding OCP version. If possible we should display that as well.
Control Plane Node(s) NAME ASSESSMENT PHASE VERSION ip-10-0-128-174.ec2.internal Complete Updated 4.12.16 ip-10-0-142-240.ec2.internal Progressing Rebooting - ip-10-0-137-108.ec2.internal Outdated Pending 4.12.1
Definition of done:
= Control Plane = ... Operator Status: 33 Total, 31 Available, 1 Progressing, 4 Degraded
1. "33 Total" confuses people into thinking this number is a sum of the others (e.g. OCPBUGS-24643)
2. "Available" is useless most of the time: in happy path it equals total, in error path the "unavailable" would be more useful, we should not require the user to do a mental subtraction when we can just tell them
3. "1 Progressing" does not mean much, I think we can relay similar information by saying "1 upgrading" on the completion line (see OTA-1153)
4. "0 degraded" is not useful (and 0 unavailable would be too), we can hide it on happy path
5. somehow relay that operators can be both unavailable and degraded
// Happy path, all is well Operator status: 33 Healthy // Two operators Available=False Operator status: 31 Healthy, 2 Unavailable // Two operators Available=False, one degraded Operator status: 30 Healthy, 2 Unavailable, 1 Available but degraded // Two operators Available=False, one degraded, one both => unavailable trumps degraded Operator status: 29 Healthy, 3 Unavailable, 1 Available but degraded
1. How to handle COs briefly unavailable / degraded? OTA-1087 PoC does not show insights about them if they are briefly down / degraded to avoid noise, so we can either be inconsistent between counts and health, or between "adm status" and "get co". => extracted to OTA-1175
This is a clone of issue OCPBUGS-33898. The following is the description of the original issue:
—
The mockup output does not contain the version the cluster is being upgraded to. Existing CVO condition shows it, so we should find a place for it in the output (or explicitly decide we dont want it)
Definition of Done:
oc adm upgrade status tells that cluster is upgrading from x'.y'.z' (to partial intermediate version...) to x.y.z version.
For example, with a history like:
we might show something that mentioned 4.15.0, 4.15.2, 4.15.3, and 4.15.5. Or we could decide that those mid-update retargets are rare enough to not be worth spending code-time on, and only mention 4.15.0 and 4.15.5?
OpenShift Update Concepts proposes a --details option that should supply the insights with SOP/documentation links, but does not give an example of how the output would look like:
=Update Health= SINCE LEVEL IMPACT MESSAGE 3h Warning API Availability High control plane CPU during upgrade Resolved Info Update Speed Pod is slow to terminate 4h Info Update Speed Pod was slow to terminate 30m Info None Worker node version skew 1h Info None Update worker node Run with --details for additional description and links to online documentation.
Justin's design ideas contain a remediation struct for this which I like.
Definition of Done:
Implementing RFE-928 would help a number of update-team use-cases:
The updates team is not well positioned to maintain oc access long-term; that seems like a better fit for the monitoring team (who maintain Alertmanager) or the workloads team (who maintain the bulk of oc). But we can probably hack together a proof-of-concept which we could later hand off to those teams, and in the meantime it would unblock our work on tech-preview commands consuming the firing-alert information.
The proof-of-concept could follow the following process:
$ OC_ENABLE_CMD_INSPECT_ALERTS=true oc adm inspect-alerts ...dump of firing alerts...
and a backing Go function that other subcommands like oc adm upgrade status can consume internally.
oc adm upgrade status currently renders Progressing and Failing!=False directly, instead of feeding them in through updateInsight. OTA-1154 is removing those. But Failing has useful information about the cluster-version operator's direct dependents which isn't available via ClusterOperator, MachineConfigPools, or the other resources we consume. This ticket is about adding logic to assessControlPlaneStatus to convert Failing!=False into an updateInsight, so it can be rendered via the consolidated insights-rendering pathways, and not via the one-off printout.
This is a clone of issue OCPBUGS-33896. The following is the description of the original issue:
—
Using the alerts-in-CLI PoC OTA-1080 show relevant firing alerts in the OTA-1087 section. Probably do not show all firing alerts.
I propose showing
Impact can be probably simple alertname -> impact type classifier. Message can be "Alert name: Alert message":
=Update Health= SINCE LEVEL IMPACT MESSAGE 3h Warning API Availability KubeDaemonSetRolloutStuck: DaemonSet openshift-ingress-canary/ingress-canary has not finished or progressed for at least 30 minutes.
In https://github.com/openshift/oc/pull/1554 we scaffolded the status command with existing CVO condition message. We should stop printing this message once the standard command output relays this information well enough.
An update is in progress for 59m13s: Unable to apply 4.14.1: the cluster operator control-plane-machine-set is not available Failing=True: Reason: ClusterOperatorNotAvailable Message: Cluster operator control-plane-machine-set is not available
The above is CVO condition message: we should remove once all information it presents is presented in the actual new status output.
None of this specially processed CVO / ClusterVersion condition content is emitted, the first section in the output is "= Control Plane =". We should only do this if any of its possible content is already surfaced via an assessment or a health insight.
As a PoC (proof of concept), warn about Available=False and Degraded=True ClusterOperators, smoothing flapping conditions (like discussed on a refinement call on Nov 29).
Follow Justin's direction from design ideas, make this easily pluggable for more "warnings"
=Update Health= SINCE LEVEL IMPACT MESSAGE 3h Warning API Availability High control plane CPU during upgrade Resolved Info Update Speed Pod is slow to terminate 4h Info Update Speed Pod was slow to terminate 30m Info None Worker node version skew 1h Info None Update worker node
This is a clone of issue OCPBUGS-33903. The following is the description of the original issue:
—
Description of problem:
The cluster version is not updating (Progressing=False). Reason: <none> Message: Cluster version is 4.16.0-0.nightly-2024-05-08-222442
When cluster is outside of update it shows Failing=True condition content which is potentially confusing. I think we can just show "The cluster version is not updating ".
Expose 'Result of work' as structured JSON in a ClusterVersion condition (ResourceReconciliationIssues? ReconciliationIssues). This would allow oc to explain what the CVO is having trouble with, without needing to reach around the CVO and look at ClusterOperators directly (avoiding the risk of latency/skew between what the CVO thinks it's waiting on and the current ClusterOperator content). And we could replace the current Progressing string with more information, and iterate quickly on the oc side. Although if we find ways we like more to stringify this content, we probably want to push that back up to the CVO so all clients can benefit.
We want to do this work behind a feature gate OTA-1169.
Note that a condition with Message containing a JSON instead of a human-readable message is against apimachinery conventions and is VERY poor API method in general. The purpose of this story is simply to experiment with how such API could like, and will inform how we will build it in the future. Do not worry too much about tech debt, this is exploratory code.
Enable support to bring your own encryption key (BYOK) for OpenShift on IBM Cloud VPC.
As a user I want to be able to provide my own encryption key when deploying OpenShift on IBM Cloud VPC so the cluster infrastructure objects, VM instances and storage objects, can use that user-managed key to encrypt the information.
The Installer will provide a mechanism to specify a user-managed key that will be used to encrypt the data on the virtual machines that are part of the OpenShift cluster as well as any other persistent storage managed by the platform via Storage Classes.
This feature is a required component for IBM's OpenShift replatforming effort.
The feature will be documented as usual to guide the user while using their own key to encrypt the data on the OpenShift cluster running on IBM Cloud VPC
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Support deploying an OpenShift cluster across multiple vSphere clusters, i.e. configuring multiple vCenter servers in one OpenShift cluster.
Multiple vCenter support in the Cloud Provider Interface (CPI) and the Cloud Storage Interface (CSI).
Customers want to deploy OpenShift across multiple vSphere clusters (vCenters) primarily for high availability.
粗文本*h3. *Feature Overview
Support deploying an OpenShift cluster across multiple vSphere clusters, i.e. configuring multiple vCenter servers in one OpenShift cluster.
Multiple vCenter support in the Cloud Provider Interface (CPI) and the Cloud Storage Interface (CSI).
Customers want to deploy OpenShift across multiple vSphere clusters (vCenters) primarily for high availability.
This section contains all the test cases that we need to make sure work as part of the done^3 criteria.
This section contains all scenarios that are considered out of scope for this enhancement that will be done via a separate epic / feature / story.
As an OpenShift administrator, I would like the Cluster Config Operator (CCO) to not block the install of a new cluster due to vSphere Multi vCenter feature gate being enabled so that I can begin to install my cluster across multiple vcenters.
The purpose of this story is to perform the needed changes to get CCO allowing the configuration of the new Feature Gate for vSphere Multi vCenter support. This operator takes the infrastructure config by the installer and updates it for the cluster and applies it. The only change to this operator should be updating the version of openshift/api; however, this operator has a lot of legacy code that is being transitioned to openshift/api that is currently causing issues when updating openshift/api version. We will update the version and address any other modifications as need. We will need to work w/ the API team on what can be removed.
The CCO during installation will need to allow multiple vCenters to be configured. Any other failure reported based on issues performing operator tasks is valid and should be addressed via a new story.
We will need to do the following:
We will need to enhance all logic that has hard coded vCenter size to now look to see if vSphere Multi vCenter feature gate is enabled. If it is enabled, the vCenter count may be larger than 1, else it will still need to fail with the error message of vCenter count may not be greater than 1.
{}USER STORY:{}
As an OpenShift user, I need to span my clusters across multiple vCenters so that I can take advantage of multi-vCenter support. This will help me achieve better utilization, availability, , and redundancy across vCenter sites.
{}DESCRIPTION:{}
presently, only a single vCenter is allowed to be defined. the maximum number of allowed vCenters will take some thought but let's start with 3 for now.
see: https://github.com/openshift/api/blob/master/config/v1/types_infrastructure.go#L1325-L1334
{}ACCEPTANCE CRITERIA:{}
{}ADDITIONAL INFORMATION:{}
As an OpenShift on vSphere administrator, I want to specify static IP assignments to my VMs.
As an OpenShift on vSphere administrator, I want to completely avoid using a DHCP server for the VMs of my OpenShift cluster.
Customers want the convenience of IPI deployments for vSphere without having to use DHCP. As in bare metal, where METAL-1 added this capability, some of the reasons are the security implications of DHCP (customers report that for example depending on configuration they allow any device to get in the network). At the same time IPI deployments only require to our OpenShift installation software, while with UPI they would need automation software that in secure environments they would have to certify along with OpenShift.
Bare metal related work:
CoreOS Afterburn:
https://github.com/coreos/afterburn/blob/main/src/providers/vmware/amd64.rs#L28
https://github.com/openshift/installer/blob/master/upi/vsphere/vm/main.tf#L34
As an OpenShift on vSphere administrator, I want to specify static IP assignments to my VMs.
As an OpenShift on vSphere administrator, I want to completely avoid using a DHCP server for the VMs of my OpenShift cluster.
Customers want the convenience of IPI deployments for vSphere without having to use DHCP. As in bare metal, where METAL-1 added this capability, some of the reasons are the security implications of DHCP (customers report that for example depending on configuration they allow any device to get in the network). At the same time IPI deployments only require to our OpenShift installation software, while with UPI they would need automation software that in secure environments they would have to certify along with OpenShift.
Bare metal related work:
CoreOS Afterburn:
https://github.com/coreos/afterburn/blob/main/src/providers/vmware/amd64.rs#L28
https://github.com/openshift/installer/blob/master/upi/vsphere/vm/main.tf#L34
We need to ensure that when vSphere static IP IPI install is being performed, we need to make sure the masters that are generated are treated as valid machines and do not get recreated by CPMS operator.
{}USER STORY:{}
As a system admin, I would like the static IP support for vSphere to use IPAddressClaims to provide IP address during installation so that after the install, the machines are defined in a way that is intended for use with IPAM controllers.
{}DESCRIPTION:{}
Currently the installer for vSphere will directly set the static IPs into the machine object yaml files. We would like to enhance the installer to create IPAddress, IPAddressClaim for each machine as well as update the machinesets to use addressesFromPools to request the IPAddress. Also, we should create a custom CRD that is the basis for the pool defined in the addressesFromPools field.
{}ACCEPTANCE CRITERIA:{}
After installing static IP for vSphere IPI, the cluster should contain machines, machinesets, crd, ipaddresses and ipaddressclaims related to static IP assignment.
{}ENGINEERING DETAILS:{}
These changes should all be contained in the installer project. We will need to be sure to cover static IP for zonal and non-zonal installs. Additionally, we need to have this work for all control-plane and compute machines.
Add authentication to the internal components of the Agent Installer so that the cluster install is secure.
Requirements
Are there any requirements specific to the auth token?
Actors:
Do we need more than one auth scheme?
Agent-admin - agent-read-write
Agent-user - agent-read
Options for Implementation:
As a user, when running agent create image, agent create pxe-files and agent create config iso commands, I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a user, when running agent create image, agent create pxe-files and agent create config iso commands, I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).
Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.
This includes (but is not limited to):
Operators:
EOL, do not upgrade:
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Goal
Hardware RAID support on Dell, Supermicro and HPE with Metal3.
Why is this important
Setting up RAID devices is a common operation in the hardware for OpenShift nodes. While there's been work at Fujitsu for configuring RAID in Fujitsu servers with Metal3, we don't support any generic interface with Redfish to extend this support and set it up it in Metal3.
Dell, Supermicro and HPE, which are the most common hardware platforms we find in our customers environments are the main target.
Goal
Hardware RAID support on Dell with Metal3.
Why is this important
Setting up RAID devices is a common operation in the hardware for OpenShift nodes. While there's been work at Fujitsu for configuring RAID in Fujitsu servers with Metal3, we don't support any generic interface with Redfish to extend this support and set it up it in Metal3 for Dell, which are the most common hardware platforms we find in our customers environments.
Before implementing generic support, we need to understand the implications of enabling an interface in Metal3 to allow it on multiple hardware types.
Scope questions
While rendering BMO in https://issues.redhat.com/browse/METAL-829 the node cpu_arch was hardcoded to x86_64
We should use bmh.Spec.Architecture instead to be more future proof
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This part of the overall multiple release Composable OpenShift (OCPPLAN-9638 effort), which is being delivered in multiple phases:
Phase 1 (OpenShift 4.11): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators
Phase 2 (OpenShift 4.12): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators
Phase 3 (OpenShift 4.13): OCPBU-117
Phase 4 (OpenShift 4.14): OCPSTRAT-36 (formerly OCPBU-236)
Phase 4 (OpenShift 4.14): OCPSTRAT-36 (formerly OCPBU-236)
Phase 5 (OpenShift 4.15): OCPSTRAT-421 (formerly OCPBU-519)
Phase 6 (OpenShift 4.16): OCPSTRAT-731
Phase 7 (OpenShift 4.17): OCPSTRAT-1308
Questions to be addressed:
Extend OpenShift on IBM Cloud integration with additional features to pair the capabilities offered for this provider integration to the ones available in other cloud platforms.
Extend the existing features while deploying OpenShift on IBM Cloud.
This top level feature is going to be used as a placeholder for the IBM team who is working on new features for this integration in an effort to keep in sync their existing internal backlog with the corresponding Features/Epics in Red Hat's Jira.
A user currently is not able to create a Disconnected cluster, using IPI, on IBM Cloud.
Currently, support for BYON and Private clusters does exist on IBM Cloud, but support to override IBM Cloud Service endpoints does not exist, which is required to allow for Disconnected support to function (reach IBM Cloud private endpoints).
IBM dependent components of OCP will need to add support to use a set of endpoint override values in order to reach IBM Cloud Services in Disconnected environments.
The Image Registry components will need to be able to allow all API calls to IBM Cloud Services, be directed to these endpoint values, in order to communicate in environments where the Public or default IBM Cloud Service endpoint is not available.
The endpoint overrides are available via the infrastructure/cluster (.status.platformStatus.ibmcloud.serviceEndpoints) resource, which is how a majority of components are consuming cluster specific configurations (Ingress, MAPI, etc.). It will be structured as such
apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2023-10-04T22:02:15Z" generation: 1 name: cluster resourceVersion: "430" uid: b923c3de-81fc-4a0e-9fdb-8c4c337fba08 spec: cloudConfig: key: config name: cloud-provider-config platformSpec: type: IBMCloud status: apiServerInternalURI: https://api-int.us-east-disconnect-21.ipi-cjschaef-dns.com:6443 apiServerURL: https://api.us-east-disconnect-21.ipi-cjschaef-dns.com:6443 controlPlaneTopology: HighlyAvailable cpuPartitioning: None etcdDiscoveryDomain: "" infrastructureName: us-east-disconnect-21-gtbwd infrastructureTopology: HighlyAvailable platform: IBMCloud platformStatus: ibmcloud: dnsInstanceCRN: 'crn:v1:bluemix:public:dns-svcs:global:a/fa4fd9fa0695c007d1fdcb69a982868c:f00ac00e-75c2-4774-a5da-44b2183e31f7::' location: us-east providerType: VPC resourceGroupName: us-east-disconnect-21-gtbwd serviceEndpoints: - name: iam url: https://private.us-east.iam.cloud.ibm.com - name: vpc url: https://us-east.private.iaas.cloud.ibm.com/v1 - name: resourcecontroller url: https://private.us-east.resource-controller.cloud.ibm.com - name: resourcemanager url: https://private.us-east.resource-controller.cloud.ibm.com - name: cis url: https://api.private.cis.cloud.ibm.com - name: dnsservices url: https://api.private.dns-svcs.cloud.ibm.com/v1 - name: cis url: https://s3.direct.us-east.cloud-object-storage.appdomain.cloud type: IBMCloud
The CCM is currently relying on updates to the openshift-cloud-controller-manager/cloud-conf configmap, in order to override its required IBM Cloud Service endpoints, such as:
data: config: |+ [global] version = 1.1.0 [kubernetes] config-file = "" [provider] accountID = ... clusterID = temp-disconnect-7m6rw cluster-default-provider = g2 region = eu-de g2Credentials = /etc/vpc/ibmcloud_api_key g2ResourceGroupName = temp-disconnect-7m6rw g2VpcName = temp-disconnect-7m6rw-vpc g2workerServiceAccountID = ... g2VpcSubnetNames = temp-disconnect-7m6rw-subnet-compute-eu-de-1,temp-disconnect-7m6rw-subnet-compute-eu-de-2,temp-disconnect-7m6rw-subnet-compute-eu-de-3,temp-disconnect-7m6rw-subnet-control-plane-eu-de-1,temp-disconnect-7m6rw-subnet-control-plane-eu-de-2,temp-disconnect-7m6rw-subnet-control-plane-eu-de-3 iamEndpointOverride = https://private.iam.cloud.ibm.com g2EndpointOverride = https://eu-de.private.iaas.cloud.ibm.com rmEndpointOverride = https://private.resource-controller.cloud.ibm.com
Installer validates and injects user provided endpoint overrides into cluster deployment process and the Image Registry components use specified endpoints and start up properly.
In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.
The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.
On-cluster, automated RHCOS Layering builds are important for multiple reasons:
Allow attaching an ISO image that will be used for data on an already provisioned system using a BMH.
Currently this can be achieved using the existing BMH.Spec.Image fields, but this attempts to change the boot order of the system and relies on the host to fallback to the installed system when booting the image fails.
Scope questions:
Neither logging nor node.last_error is handled.
while the basic implementation of the API is done, we need to add the functionality to the redfish driver
Create a GCP cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in GCP) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.
Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.
Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").
Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.
Goals
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
List any affected packages or components.
Installer creates below list of gcp resources during create cluster phase and these resources should be applied with the user defined tags.
Resources List
Resource | Terraform API |
---|---|
VM Instance | google_compute_instance |
Storage Bucket | google_storage_bucket |
Acceptance Criteria:
Enhancement proposed for GCP tags support in OCP, requires machine-api-provider-gcp to add azure userTags available in the status sub resource of infrastructure CR, to the gcp virtual machine resources created.
Acceptance Criteria
This Feature covers effort in person-weeks of meetings in #wg-managed-ocp-versions where OTA helped SD refine how their doing OCM work would help, and what that OCM work might look like https://issues.redhat.com/browse/OTA-996?focusedId=25608383&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-25608383.
Currently the ROSA/ARO versions are not managed by OTA team.
This Feature covers the engineering effort to take the responsibility of management of OCP version management in OSD, ROSA and ARO from SRE-P to OTA.
Here is the design document for the effort: https://docs.google.com/document/d/1hgMiDYN9W60BEIzYCSiu09uV4CrD_cCCZ8As2m7Br1s/edit?skip_itp2_check=true&pli=1
Here are some objectives :
Presentation from Jeremy Eder :
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
This epic is to transfer the responsibility of OCP version management in OSD, ROSA and ARO from SRE-P to OTA.
Why is this important? (mandatory)
What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
To deliver RFE-5160 without passing data through the customer-accessible ClusterVersion resource (which lives in the hosted Kubernetes API), the cluster-version operator should grow a new command-line switch that allows its caller to override ClusterVersion with a custom upstream.
Definition of done:
Or similar knob to deliver RFE-5160, attaching the new parameter to OTA-1210's command-line flag.
Definition of done:
Management cluster admins can set spec.updateService on HostedCluster and have their status.version.availableUpdates and similar populated based on advice from that upstream. At no point in the implementation is the update service data pulled from anywhere accessible from the hosted cluster.
ABI is using assisted installer + kube-api or any other client that communicate with the service, all the building blocks related to day2 installation exist in those components
Assisted installer can create installed cluster and use it to perform day2 operations
A doc that explains how it's done with kube-api
Parameters that are required from the user:
Actions required from the user
To keep similar flow between day1 and day2 i suggest to run the service on each node that user is trying to add, it will create the cluster definition and start the installation, after first reboot it will pull the ignition from the day1 cluster
It will need to create the container, set it up with an appropriate kubeconfig, extract and appropriately direct the output (ISO+any errors or streamed status), and delete the container again when complete. Initially could be a script, but potentially could be implemented directly in the code. To be distributed within the installer image
It will need to create the container, set it up with an appropriate kubeconfig, extract and appropriately direct the output (ISO+any errors or streamed status), and delete the container again when complete. Initially could be a script, but potentially could be implemented directly in the code. To be distributed within the installer image
Deploy a command to generate a suitable ISO for adding a node to an existing cluster
Not all the required info are provided by the user (in reality, we do want to minimize as much as possible the amount of configuration provided by the user). Some of the required info needs to be extracted from the existing cluster, or from the existing kubeconfig. A dedicated asset could be useful for such operation.
The ignition assets currently assembles the ignition file with the requires files and services to install a cluster. In case of add node, this needs to be modified to support the new workflow.
A new workflow will be required to talk to assisted-service to import an existing cluster / add the node.New services could be required in the ignition assets to handle properly that
Usually the manifests assets (ClusterImageSet / AgentPullSecret / InfraEnv / NMStateConfig / AgentClusterInstall / ClusterDeployment) depends on OptionInstallConfig (or eventually a file on the asset dir, in case of ZTP manifests). We'll need to change the assets code so that it could be possible to retrieve the required info from ClusterInfo asset instead of OptionalInstallConfig). This may impact the asset framework itself.
Another approach could be to stick this info directly into OptionalInstallConfig, if possible
Create a new asset to manage the content of the nodes-config.yaml file
The two commands, one for adding the nodes (ISO generation) and the other to monitor the process, should be exposed by a new cli tool (name to be defined) built using the installer source. This task will be used to add the main of the cli tool and the two (empty) commands entry points
Networking Definition of Planned
Epic Template descriptions and documentation
With ovn-ic we have multiple actors (zones) setting status on some CRs. We need to make sure individual zone statuses are reported and then optionally merged to a single status
Without that change zones will overwrite each others statuses.
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
1. …
1. …
The MCO does not reap any old rendered machineconfigs. Each time a user or controller applies a new machineconfig, there are new rendered configs for each affected machineconfigpool. Over time, this leads to a very large number of rendered configs which are a UX annoyance but also could possibly contribute to space and performance issues with etcd.
Administrators should have a simple way to set a maximum number of rendered configs to maintain, but there should also be a minimum set as there are many cases where support or engineering needs to be able to look back at previous configs.
This story involves implementing the main "deletion" command for rendered machineconfigs under the oc adm prune subcommand. It will only support deleting rendered MCs that are not in use.
It would support the following options to start:
Important: If the admin specifies options that select any rendered MCs that are in use by an MCP, it should not be deleted. In such cases, the output should indicate why the rendered MC has been skipped over for deletion.
Sample user workflow:
$ oc adm prune renderedmachineconfigs --pool-name=worker # lists and deletes all unused rendered MCs for the worker pool in a dry run mode $ oc adm prune renderedmachineconfigs --count=10 --pool-name=worker # lists and deletes 10 oldest unused rendered MCs for the worker pool in a dry run mode $ oc adm prune renderedmachineconfigs --count=10 --pool-name=worker --confirm # actually deletes the rendered configs with the above options
Cluster administrators need an in-product experience to discover and install new Red Hat offerings that can add high value to developer workflows.
Requirements | Notes | IS MVP |
Discover new offerings in Home Dashboard | Y | |
Access details outlining value of offerings | Y | |
Access step-by-step guide to install offering | N | |
Allow developers to easily find and use newly installed offerings | Y | |
Support air-gapped clusters | Y |
< What are we making, for who, and why/what problem are we solving?>
Discovering solutions that are not available for installation on cluster
No known dependencies
Background, and strategic fit
None
Quick Starts
Cluster admins need to be guided to install RHDH on the cluster.
Enable admins to discover RHDH, be guided to installing it on the cluster, and verifying its configuration.
RHDH is a key multi-cluster offering for developers. This will enable customers to self-discover and install RHDH.
RHDH operator
Description of problem:
The OpenShift Console QuickStarts that promotes RHDH was written in generic terms and doesn't include some information on how to use the CRD-based installation.
We have removed this specific information because the operator wasn't ready at that time. As soon as the RHDH operator is available in the OperatorHub we should update the QuickStarts with some more detailed information.
With a simple CR example and some info on how to customize the base URL or colors.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
Just navigate to Quick starts and select the "Install Red Hat Developer Hub (RHDH) with an Operator" quick starts
Actual results:
The RHDH Operator Quick start exists but is written in a generic way.
Expected results:
The RHDH Operator Quick start should contain some more specific information.
Additional info:
Initial PR: https://github.com/openshift/console-operator/pull/806
Description of problem:
The OpenShift Console QuickStarts promotes RHDH but also includes Janus IDP information.
The Janus IDP quick starts should be removed and all information about Janus IDP should be removed.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
Just navigate to Quick starts and select the "Install Red Hat Developer Hub (RHDH) with an Operator" quick starts
Actual results:
Expected results:
Additional info:
Initial PR: https://github.com/openshift/console-operator/pull/806
We need to ensure we have parity with OCP and support heterogeneous clusters
https://github.com/openshift/enhancements/pull/1014
As a user of the HyperShift CLI, I would like the CLI to fail early if these conditions are all true:
so that we can prevent a HostedCluster from being created that will have errors due mismatches between the release image, management cluster's CPU architecture, and NodePool's CPU architecture.
This should be done for the API as well but will be covered thru HOSTEDCP-1105.
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a user of the HCP CLI, I want to be able to specify a NodePools arch from a flag so that I can easily create NodePools of different CPU architectures in AWS HostedClusters.
Other CPU arches are not being considered for the arch flag since they are unavailable in AWS.
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a user of the HyperShift, I would like the API to fail early if these conditions are all true:
so that we can prevent a HostedCluster from continuing to be created that will have errors due mismatches between the release image, management cluster's CPU architecture, and NodePool's CPU architecture.
This should be done for the CLI as well but will be covered thru HOSTEDCP-1104.
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Using a multi-arch Node requires the HC to be multi arch as well. This is an good to recipe to let users shoot on their foot. We need to automate the required input via CLI to multi-arch NodePool to work, e.g. on HC creation enabling a multi-arch flag which sets the right release image
Acceptance Criteria:
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Using a multi-arch Node requires the HC to be multi arch as well. This is an good to recipe to let users shoot on their foot. We need to automate the required input via CLI to multi-arch NodePool to work, e.g. on HC creation enabling a multi-arch flag which sets the right release image
Acceptance Criteria:
As a user of multi-arch HyperShift, I would like a CEL validation to be added to the NodePool types to prevent the arch field from being changed from `amd64` when the platform is not supported (AWS is currently the only supported platform).
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Migrate every occurrence of iptables in OpenShift to use nftables, instead.
Implement a full migration from iptables to nftables within a series of "normal" upgrades of OpenShift with the goal of not causing any more network disruption than would normally be required for an OpenShift upgrade. (Different components may migrate from iptables to nftables in different releases; no coordination is needed between unrelated components.)
Template:
Networking Definition of Planned
Epic Template descriptions and documentation
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
Consolidated Enhancement of HyperShift/KubeVirt Provider Post GA
This feature aims to provide a comprehensive enhancement to the HyperShift/KubeVirt provider integration post its GA release.
By consolidating CSI plugin improvements, core improvements, and networking enhancements, we aim to offer a more robust, efficient, and user-friendly experience.
Post GA quality of life improvements for the HyperShift/KubeVirt core
Who | What | Reference |
---|---|---|
DEV | Upstream roadmap issue (or individual upstream PRs) | <link to GitHub Issue> |
DEV | Upstream documentation merged | <link to meaningful PR> |
DEV | gap doc updated | <name sheet and cell> |
DEV | Upgrade consideration | <link to upgrade-related test or design doc> |
DEV | CEE/PX summary presentation | label epic with cee-training and add a <link to your support-facing preso> |
QE | Test plans in Polarion | <link or reference to Polarion> |
QE | Automated tests merged | <link or reference to automated tests> |
DOC | Downstream documentation merged | <link to meaningful PR> |
Currently there is no option to influence on the placement of the VMs of an hosted cluster with kubevirt provider. the existing NodeSelector in HostedCluster are influencing only the pods in the hosted control plane namespace.
The goal is to introduce an new field in .spec.platform.kubevirt stanza in NodePool for node selector, propagate it to the VirtualMachineSpecTemplate, and expose this in the hypershift and hcp CLIs.
the rbac required for external infra needs to be documented on this page.
https://hypershift-docs.netlify.app/how-to/kubevirt/external-infrastructure/
Technology Preview of the oc mirror enclaves support (Dev Preview was OCPSTRAT-765 in OpenShift 4.15).
Feature description
oc-mirror already focuses on mirroring content to disconnected environments for installing and upgrading OCP clusters.
This specific feature addresses use cases where mirroring is needed for several enclaves (disconnected environments), that are secured behind at least one intermediate disconnected network.
In this context, enclave users are interested in:
This epic covers the work for RFE-3800 (includes RFE-3393 and RFE-3733) for mirroring operators and additonal images
The full description / overview of the enclave support is best described here
The design document can be found here
Architecture Overview (diagram)
User Stories
All user stories are in the form :
Acceptance Criteria
OCI for ibm
Overview
Consider this as part of the separate discussions and design of the upgrade path/introspection tool
Acceptance Criteria{}
Tasks
Acceptance Criteria
Tasks
I need an interface and implementation that can read the history file and present it in a map of digests
If the archived content exceeds the max size specified, oc-mirror shall generate as many archive chunks as needed .
Clear out all tars before starting the next mirrorToDisk
so that after newly generated archives are not mixed with archives generated in previous runs
Reducing the amount of changes for end users, we should maybe consider reusing the existing --dry-run flag
Today the --help is not aligned with how v2 is working. It is necessary to update it to avoid confusing customers.
Acceptance Criteria
Tasks
Acceptance Criteria
Tasks
Overview
Ensure that the current v2 respects the v1 TargetCatalog and TargetTag fields (if set) for oci catalog and registry catalogs.
Also TargetCatalog and TargetTag should not be mutually exclusive.
Invalid example:
kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 mirror: operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.14 targetCatalog: abc/def:v5.5 packages: - name: aws-load-balancer-operator
Valid example:
kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 mirror: operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.14 targetCatalog: abc/def targetTag: v4.4 packages: - name: aws-load-balancer-operator
Acceptance Criteria
Tasks
Acceptance Criteria
Tasks
Acceptance Criteria
Tasks
Consume the newly introduced API and apply the scheduling configuration (taints and node selectors) to network-check-source and network-check-target.
In order to avoid an increased support overhead once the license changes at the end of the year, we should replace the instances in which metal IPI uses Terraform.
When we used Terraform to provision the control plane, the Terraform deployment could eventually time out and report an error. The installer was monitoring the Terraform output and could pass the error on to the user, e.g.
level=debug msg=ironic_node_v1.openshift-master-host[0]: Still creating... [2h9m1s elapsed]
level=error
level=error msg=Error: could not inspect: inspect failed , last error was 'timeout reached while inspecting the node'
level=error
level=error msg= with ironic_node_v1.openshift-master-host[2],
level=error msg= on main.tf line 13, in resource "ironic_node_v1" "openshift-master-host":
level=error msg= 13: resource "ironic_node_v1" "openshift-master-host" {
Now that provisioning is managed by Metal³, we have nothing monitoring it for errors:
level=info msg=Waiting up to 1h0m0s (until 1:05AM UTC) for bootstrapping to complete...
level=debug msg=Bootstrap status: complete
By this stage the bootstrap API is up (and this is a requirement for BMO to do its thing). The installer is capable of monitoring the API for the appearance of the bootstrap complete ConfigMap, so it is equally capable of monitoring the BaremetalHost status. This should actually be an improvement on Terraform, as we can monitor in real time as the hosts progress through the various stages, and report on errors and retries.
We will re-implement the functionality of terraform-provider-libvirt using Go libraries.
What we do in static mode, we need to do in k8s mode.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision vSphere infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
level=info msg=Running process: vsphere infrastructure provider with args [-v=2 --metrics-bind-addr=0 --health-addr=127.0.0.1:37167 --webhook-port=38521 --webhook-cert-dir=/tmp/envtest-serving-certs-445481834 --leader-elect=false] and env [...]
may contain sensitive data - passwords, logins etc. It should be filtered
Causes password leaking in CI.
Replace https://github.com/openshift/installer/tree/master/upi/vsphere with powercli. Keep terraform in place until powercli installations are working.
example of updates to be made to the upi image:
~~~
FROM upi-installer-image
RUN curl https://packages.microsoft.com/config/rhel/8/prod.repo | tee
/etc/yum.repos.d/microsoft.repo
RUN yum install -y powershell
RUN pwsh -Command 'Install-Module VMware.PowerCLI -Force -Scope
CurrentUser'
~~~
Description of problem:
the installer download the rhcos image locally to cache multiple times when using failure domains
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
time="2024-02-22T14:34:42-05:00" level=debug msg="Generating Cluster..." time="2024-02-22T14:34:42-05:00" level=warning msg="FeatureSet \"CustomNoUpgrade\" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster." time="2024-02-22T14:34:42-05:00" level=info msg="Creating infrastructure resources..." time="2024-02-22T14:34:43-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'" time="2024-02-22T14:34:43-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..." time="2024-02-22T14:36:02-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'" time="2024-02-22T14:36:02-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..." time="2024-02-22T14:37:22-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'" time="2024-02-22T14:37:22-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..." time="2024-02-22T14:38:39-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'" time="2024-02-22T14:38:39-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..." time="2024-02-22T14:39:33-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'" time="2024-02-22T14:39:33-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..." time="2024-02-22T14:39:33-05:00" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: unable to initialize folders and templates: failed to import ova: failed to lease wait: The name 'ngirard-dev-pr89z-rhcos-us-west-us-west-1a' already exists."
Expected results:
should only download once
Additional info:
okd required guestinfo domain
and
stealclock accounting
{}USER STORY:{}
The assisted installer should be able to use CAPI-based vsphere installs without requiring access to vcenter.
{}DESCRIPTION:{}
The installer makes calls to vcenter to determine the networks, which are required for CAPI based installs, but vcenter access is not guaranteed in the assisted installer.
See:
which were lovingly lifted from this slack thread.
{}Required:{}
In cases where the installer calls vcenter to obtain values to populate manifests, the installer should leave empty fields (or a default value) if it is unable to access vcenter. It should produce partial manifests, rather than throw an error.
{}Nice to have:{}
...
{}ACCEPTANCE CRITERIA:{}
Continued compatibility with agent installer, particularly producing capi manifests when access to vcenter fails.
{}ENGINEERING DETAILS:{}
<!--
Any additional information that might be useful for engineers: related
repositories or pull requests, related email threads, GitHub issues or
other online discussions, how to set up any required accounts and/or
environments if applicable, and so on.
-->
WARNING FeatureSet "CustomNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster.
INFO Creating infrastructure resources...
DEBUG Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'
DEBUG The file was found in cache: /home/jcallen/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing...
<very long pause here with no indication>
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision Azure infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Launch a cluster provisioned with CAPI usining a minimal (mostly default) install config.
As a developer, I want to:
so that I can achieve
Description of criteria:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
The `image` field in the AzureMachineSpec needs to point to an RHCOS image. For marketplace images, those images should already be available.
For non-marketplace images, we need to create an image for the users, using the VHD from the RHCOS stream.
The image could created in the PreProvision hook: https://github.com/openshift/installer/blob/master/pkg/infrastructure/clusterapi/types.go#L26
Technicaly it could also be done in the InfraAvailable hook, if that is needed.
Create private zone and DNS records in the resource group specified by baseDomainResourceGroupName. The records should be cleaned up with destroy cluster.
The ControlPlaneEndpoint will be available in the Cluster spec and can be used to populate the DNS records.
Currently we create both A and CNAME records in different scenarios: https://github.com/openshift/installer/blob/master/data/data/azure/cluster/dns/dns.tf
Ideally we do this in the InfraReady hook, before machine creation, so that control plane machines can pull ignition immediately.
The install config allows users to specify a `diskEncryptionSet` in machinepools.
CAPZ has existing support for disk encryption sets:
Note that CAPZ says the encryption set must belong to the same subscription, whereas our docs may not indicate that. We should point this out to the docs team.
In the ignition hook, we need to upload the bootstrap ignition data to the bootstrap storage account (created to hold the rhcos image). Then create an ignition stub containing the SAS link for the object.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision Nutanix infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Console enhancements based on customer RFEs that improve customer user experience.
Requirement | Notes | isMvp? |
---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
RFE: https://issues.redhat.com/browse/RFE-1772
In OpenShift v3 we have been displaying pod's last termination state and we need to display the same info in v4. Here is the v3 html interpretation of the kubernetes-object-describe-container-state directive.
AC: Add the Last State login into the pod's details page by rewritting the v3 implementation.
Based on the this old feature request
https://issues.redhat.com/browse/RFE-1530
we do have impersonation in place for gaining access to other user's permissions via the console. But the only documentation we currently have is how to impersonate system:admin via the CLI see
https://docs.openshift.com/container-platform/4.14/authentication/impersonating-system-admin.html
Please provide documentation for the console feature and the required prerequisites for the users/groups accordingly.
AC:
More info on the impersonate access role - https://github.com/openshift/console/pull/13345/files
Implement a toast notification feature in the console UI to notify the user that their action for creating/updating a resource violated a warn policy though the request was allowed.
See theConfigure OpenShift Console to display warnings from apiserver when creating/updating resources spike on how to reproduce the warn policy response in `oc` CLI
A.C.
Display a warning toast notification after create/update resource action for a resource
Add support for returning `response.header` in `consoleFetchCommon` function in the dynamic-plugin-sdk package
Problem:
The `consoleFetchCommon function in the dynamic-plugin-sdk package lack the support for retrieving HTTP `response.header`.
Justification:
The policy warning responses are visible in the `oc cli` but not visible on the console UI. The customer wants a similar behavior on the UI. The policy warning responses are returned in the HTTP `response.header` which is not implemented in the `consoleFetchCommon` function currently.
Proposed Solution
Add logic for extracting all `response.headers` along with `response.json` in the `consoleFetchCommon` function using `options` or something else.
A.C.
Add an option parameter to `consoleFetchCommon` to conditionally return full `response` or `response.json` so that the k8s functions like "K8sCreate` consume either, preventing breaking change for dynamic plugins
A guest cluster can use an external OIDC token issuer. This will allow machine-to-machine authentication workflows
A guest cluster can configure OIDC providers to support the current capability: https://kubernetes.io/docs/reference/access-authn-authz/authentication/#openid-connect-tokens and the future capability: https://github.com/kubernetes/kubernetes/blob/2b5d2cf910fd376a42ba9de5e4b52a53b58f9397/staging/src/k8s.io/apiserver/pkg/apis/apiserver/types.go#L164 with an API that
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
A guest cluster can use an external OIDC token issuer. This will allow machine-to-machine authentication workflows
A guest cluster can configure OIDC providers to support the current capability: https://kubernetes.io/docs/reference/access-authn-authz/authentication/#openid-connect-tokens and the future capability: https://github.com/kubernetes/kubernetes/blob/2b5d2cf910fd376a42ba9de5e4b52a53b58f9397/staging/src/k8s.io/apiserver/pkg/apis/apiserver/types.go#L164 with an API that
Description of problem:
When issuerCertificateAuthority is set, kube-apiserver pod is CrashLoopBackOff. Tried RCA debugging, found the cause is: the path /etc/kubernetes/certs/oidc-ca/ca.crt is incorrect. The expected path should be /etc/kubernetes/certs/oidc-ca/ca-bundle.crt .
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-03-13-061822
How reproducible:
Always
Steps to Reproduce:
1. Create fresh HCP cluster. 2. Create keycloak as OIDC server exposed as a Route which uses cluster's default ingress certificate as the serving certificate. 3. Configure clients necessarily on keycloak admin UI. 4. Configure external OIDC: $ oc create configmap keycloak-oidc-ca --from-file=ca-bundle.crt=router-ca/ca.crt --kubeconfig $MGMT_KUBECONFIG -n clusters $ oc patch hc $HC_NAME -n clusters --kubeconfig $MGMT_KUBECONFIG --type=merge -p=" spec: configuration: authentication: oidcProviders: - claimMappings: groups: claim: groups prefix: 'oidc-groups-test:' username: claim: email prefixPolicy: Prefix prefix: prefixString: 'oidc-user-test:' issuer: audiences: - $AUDIENCE_1 - $AUDIENCE_2 issuerCertificateAuthority: name: keycloak-oidc-ca issuerURL: $ISSUER_URL name: keycloak-oidc-server oidcClients: - clientID: $CONSOLE_CLIENT_ID clientSecret: name: $CONSOLE_CLIENT_SECRET_NAME componentName: console componentNamespace: openshift-console type: OIDC " 5. Check pods should be renewed, but new pod is CrashLoopBackOff: $ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp | tail -n 4 openshift-apiserver-65f8c5f545-x2vdf 3/3 Running 0 5h8m community-operators-catalog-57dd5886f7-jq25f 1/1 Running 0 4h1m kube-apiserver-5d75b5b848-c9c8r 4/5 CrashLoopBackOff 25 (3m9s ago) 107m $ oc logs --timestamps -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG -c kube-apiserver kube-apiserver-5d75b5b848-gk2t8 ... 2024-03-18T09:11:14.836540684Z I0318 09:11:14.836495 1 dynamic_cafile_content.go:119] "Loaded a new CA Bundle and Verifier" name="client-ca-bundle::/etc/kubernetes/certs/client-ca/ca.crt" 2024-03-18T09:11:14.837725839Z E0318 09:11:14.837695 1 run.go:74] "command failed" err="jwt[0].issuer.certificateAuthority: Invalid value: \"<omitted>\": data does not contain any valid RSA or ECDSA certificates"
Actual results:
5. New kube-apiserver pod is CrashLoopBackOff. `oc explain` for issuerCertificateAuthority says the configmap data should use ca-bundle.crt. But I also tried to use ca.crt in configmap's data, got same result.
Expected results:
6. No CrashLoopBackOff.
Additional info:
Below is my RCA for the CrashLoopBackOff kube-apiserver pod:
Check if it is valid RSA certificate, it is valid:
$ openssl x509 -noout -text -in router-ca/ca.crt | grep -i rsa Signature Algorithm: sha256WithRSAEncryption Public Key Algorithm: rsaEncryption Signature Algorithm: sha256WithRSAEncryption
So, the CA certificate has no issue.
Above pod logs show "/etc/kubernetes/certs/oidc-ca/ca.crt" is used. Double checked the configmap:
$ oc get cm auth-config -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG -o jsonpath='{.data.auth\.json}' | jq | ~/auto/json2yaml.sh --- kind: AuthenticationConfiguration apiVersion: apiserver.config.k8s.io/v1alpha1 jwt: - issuer: url: https://keycloak-keycloak.apps..../realms/master certificateAuthority: "/etc/kubernetes/certs/oidc-ca/ca.crt" ...
Then debug the CrashLoopBackOff pod:
The used path /etc/kubernetes/certs/oidc-ca/ca.crt does not exist! The correct path should be /etc/kubernetes/certs/oidc-ca/ca-bundle.crt:
$ oc debug -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG -c kube-apiserver kube-apiserver-5d75b5b848-gk2t8 Starting pod/kube-apiserver-5d75b5b848-gk2t8-debug-kpmlf, command was: hyperkube kube-apiserver --openshift-config=/etc/kubernetes/config/config.json -v2 --encryption-provider-config=/etc/kubernetes/secret-encryption/config.yaml sh-5.1$ cat /etc/kubernetes/certs/oidc-ca/ca.crt cat: /etc/kubernetes/certs/oidc-ca/ca.crt: No such file or directory sh-5.1$ ls /etc/kubernetes/certs/oidc-ca/ ca-bundle.crt sh-5.1$ cat /etc/kubernetes/certs/oidc-ca/ca-bundle.crt -----BEGIN CERTIFICATE----- MIIDPDCCAiSgAwIBAgIIM3E0ckpP750wDQYJKoZIhvcNAQELBQAwJjESMBAGA1UE ...
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
$ oc logs --previous --timestamps -n openshift-console console-64df9b5bcb-8h8xk
2024-03-22T11:17:07.824396015Z I0322 11:17:07.824332 1 main.go:210] The following console plugins are enabled:
2024-03-22T11:17:07.824574844Z I0322 11:17:07.824558 1 main.go:212] - monitoring-plugin
2024-03-22T11:17:07.824613918Z W0322 11:17:07.824603 1 authoptions.go:99] Flag inactivity-timeout is set to less then 300 seconds and will be ignored!
2024-03-22T11:22:07.828873678Z I0322 11:22:07.828819 1 main.go:634] Binding to [::]:8443...
2024-03-22T11:22:07.828982852Z I0322 11:22:07.828967 1 main.go:636] using TLS
2024-03-22T11:22:07.833771847Z E0322 11:22:07.833726 1 asynccache.go:62] failed a caching attempt: Get "https://keycloak-keycloak.apps.xxxx/realms/master/.well-known/openid-configuration": tls: failed to verify certificate: x509: certificate signed by unknown authority
2024-03-22T11:22:10.831644728Z I0322 11:22:10.831598 1 metrics.go:128] serverconfig.Metrics: Update ConsolePlugin metrics...
2024-03-22T11:22:10.848238183Z I0322 11:22:10.848187 1 metrics.go:138] serverconfig.Metrics: Update ConsolePlugin metrics: &map[monitoring:map[enabled:1]] (took 16.490288ms)
2024-03-22T11:22:12.829744769Z I0322 11:22:12.829697 1 metrics.go:80] usage.Metrics: Count console users...
2024-03-22T11:22:13.236378460Z I0322 11:22:13.236318 1 metrics.go:156] usage.Metrics: Update console users metrics: 0 kubeadmin, 0 cluster-admins, 0 developers, 0 unknown/errors (took 406.580502ms)
The cause is that the HCCO is not copying the issuerCertificateAuthority configmap into the openshift-config namespace of the HC.
Description of problem:
HCP does not honor the oauthMetadata field of hc.spec.configuration.authentication, making console crash and oc login fail.
Version-Release number of selected component (if applicable):
HyperShift management cluster: 4.16.0-0.nightly-2024-01-29-233218 HyperShift hosted cluster: 4.16.0-0.nightly-2024-01-29-233218
How reproducible:
Always
Steps to Reproduce:
1. Install HCP env. Export KUBECONFIG: $ export KUBECONFIG=/path/to/hosted-cluster/kubeconfig 2. Create keycloak applications. Then get the route: $ KEYCLOAK_HOST=https://$(oc get -n keycloak route keycloak --template='{{ .spec.host }}') $ echo $KEYCLOAK_HOST https://keycloak-keycloak.apps.hypershift-ci-18556.xxx $ curl -sSk "$KEYCLOAK_HOST/realms/master/.well-known/openid-configuration" > oauthMetadata $ cat oauthMetadata {"issuer":"https://keycloak-keycloak.apps.hypershift-ci-18556.xxx/realms/master" $ oc create configmap oauth-meta --from-file ./oauthMetadata -n clusters --kubeconfig /path/to/management-cluster/kubeconfig ... 3. Set hc.spec.configuration.authentication: $ CLIENT_ID=openshift-test-aud $ oc patch hc hypershift-ci-18556 -n clusters --kubeconfig /path/to/management-cluster/kubeconfig --type=merge -p=" spec: configuration: authentication: oauthMetadata: name: oauth-meta oidcProviders: - claimMappings: ... issuer: audiences: - $CLIENT_ID issuerCertificateAuthority: name: keycloak-oidc-ca issuerURL: $KEYCLOAK_HOST/realms/master name: keycloak-oidc-test type: OIDC " Check KAS indeed already picks up the setting: $ oc logs -c kube-apiserver kube-apiserver-5c976d59f5-zbrwh -n clusters-hypershift-ci-18556 --kubeconfig /path/to/management-cluster/kubeconfig | grep "oidc-" ... I0130 08:07:24.266247 1 flags.go:64] FLAG: --oidc-ca-file="/etc/kubernetes/certs/oidc-ca/ca.crt" I0130 08:07:24.266251 1 flags.go:64] FLAG: --oidc-client-id="openshift-test-aud" ... I0130 08:07:24.266261 1 flags.go:64] FLAG: --oidc-issuer-url="https://keycloak-keycloak.apps.hypershift-ci-18556.xxx/realms/master" ... Wait about 15 mins. 4. Check COs and check oc login. Both show the same error: $ oc get co | grep -v 'True.*False.*False' NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE console 4.16.0-0.nightly-2024-01-29-233218 True True False 4h57m SyncLoopRefreshProgressing: Working toward version 4.16.0-0.nightly-2024-01-29-233218, 1 replicas available $ oc get po -n openshift-console NAME READY STATUS RESTARTS AGE console-547cf6bdbb-l8z9q 1/1 Running 0 4h55m console-54f88749d7-cv7ht 0/1 CrashLoopBackOff 9 (3m18s ago) 14m console-54f88749d7-t7x96 0/1 CrashLoopBackOff 9 (3m32s ago) 14m $ oc logs console-547cf6bdbb-l8z9q -n openshift-console I0130 03:23:36.788951 1 metrics.go:156] usage.Metrics: Update console users metrics: 0 kubeadmin, 0 cluster-admins, 0 developers, 0 unknown/errors (took 406.059196ms) E0130 06:48:32.745179 1 asynccache.go:43] failed a caching attempt: request to OAuth issuer endpoint https://:0/oauth/token failed: Head "https://:0": dial tcp :0: connect: connection refused E0130 06:53:32.757881 1 asynccache.go:43] failed a caching attempt: request to OAuth issuer endpoint https://:0/oauth/token failed: Head "https://:0": dial tcp :0: connect: connection refused ... $ oc login --exec-plugin=oc-oidc --client-id=openshift-test-aud --extra-scopes=email,profile --callback-port=8080 error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused Unable to connect to the server: getting credentials: exec: executable oc failed with exit code 1 5. Check root cause, the configured oauthMetadata is not picked up well: $ curl -k https://a6e149f24f8xxxxxx.elb.ap-east-1.amazonaws.com:6443/.well-known/oauth-authorization-server { "issuer": "https://:0", "authorization_endpoint": "https://:0/oauth/authorize", "token_endpoint": "https://:0/oauth/token", ... }
Actual results:
As above steps 4 and 5, the configured oauthMetadata is not picked up well, causing console and oc login hit the error.
Expected results:
The configured oauthMetadata is picked up well. No error.
Additional info:
For oc, if I manually use `oc config set-credentials oidc --exec-api-version=client.authentication.k8s.io/v1 --exec-command=oc --exec-arg=get-token --exec-arg="--issuer-url=$KEYCLOAK_HOST/realms/master" ...` instead of using `oc login --exec-plugin=oc-oidc ...`, oc authentication works well. This means my configuration is correct. $ oc whoami Please visit the following URL in your browser: http://localhost:8080 oidc-user-test:xxia@redhat.com
Description of problem:
In https://issues.redhat.com/browse/OCPBUGS-28625?focusedId=24056681&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-24056681 , Seth Jennings states "It is not required to set the oauthMetadata to enable external OIDC".
Today having a chance to try without setting oauthMetadata, hit oc login fails with the error:
$ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080 error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused Unable to connect to the server: getting credentials: exec: executable oc failed with exit code 1
Console login can succeed, though.
Note, OCM QE also encounters this when using ocm cli to test ROSA HCP external OIDC. Either oc or HCP, or anywhere (as a tester I'm not sure TBH ), worthy to have a fix, otherwise oc login is affected.
Version-Release number of selected component (if applicable):
[xxia@2024-03-01 21:03:30 CST my]$ oc version --client Client Version: 4.16.0-0.ci-2024-03-01-033249 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 [xxia@2024-03-01 21:03:50 CST my]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.ci-2024-02-29-213249 True False 8h Cluster version is 4.16.0-0.ci-2024-02-29-213249
How reproducible:
Always
Steps to Reproduce:
1. Launch fresh HCP cluster. 2. Login to https://entra.microsoft.com. Register application and set properly. 3. Prepare variables. HC_NAME=hypershift-ci-267920 MGMT_KUBECONFIG=/home/xxia/my/env/xxia-hs416-2-267920-4.16/kubeconfig HOSTED_KUBECONFIG=/home/xxia/my/env/xxia-hs416-2-267920-4.16/hypershift-ci-267920.kubeconfig AUDIENCE=7686xxxxxx ISSUER_URL=https://login.microsoftonline.com/64dcxxxxxxxx/v2.0 CLIENT_ID=7686xxxxxx CLIENT_SECRET_VALUE="xxxxxxxx" CLIENT_SECRET_NAME=console-secret 4. Configure HC without oauthMetadata. [xxia@2024-03-01 20:29:21 CST my]$ oc create secret generic console-secret -n clusters --from-literal=clientSecret=$CLIENT_SECRET_VALUE --kubeconfig $MGMT_KUBECONFIG [xxia@2024-03-01 20:34:05 CST my]$ oc patch hc $HC_NAME -n clusters --kubeconfig $MGMT_KUBECONFIG --type=merge -p=" spec: configuration: authentication: oauthMetadata: name: '' oidcProviders: - claimMappings: groups: claim: groups prefix: 'oidc-groups-test:' username: claim: email prefixPolicy: Prefix prefix: prefixString: 'oidc-user-test:' issuer: audiences: - $AUDIENCE issuerURL: $ISSUER_URL name: microsoft-entra-id oidcClients: - clientID: $CLIENT_ID clientSecret: name: $CLIENT_SECRET_NAME componentName: console componentNamespace: openshift-console type: OIDC " Wait pods to renew: [xxia@2024-03-01 20:52:41 CST my]$ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp ... certified-operators-catalog-7ff9cffc8f-z5dlg 1/1 Running 0 5h44m kube-apiserver-6bd9f7ccbd-kqzm7 5/5 Running 0 17m kube-apiserver-6bd9f7ccbd-p2fw7 5/5 Running 0 15m kube-apiserver-6bd9f7ccbd-fmsgl 5/5 Running 0 13m openshift-apiserver-7ffc9fd764-qgd4z 3/3 Running 0 11m openshift-apiserver-7ffc9fd764-vh6x9 3/3 Running 0 10m openshift-apiserver-7ffc9fd764-b7znk 3/3 Running 0 10m konnectivity-agent-577944765c-qxq75 1/1 Running 0 9m42s hosted-cluster-config-operator-695c5854c-dlzwh 1/1 Running 0 9m42s cluster-version-operator-7c99cf68cd-22k84 1/1 Running 0 9m42s konnectivity-agent-577944765c-kqfpq 1/1 Running 0 9m40s konnectivity-agent-577944765c-7t5ds 1/1 Running 0 9m37s 5. Check console login and oc login. $ export KUBECONFIG=$HOSTED_KUBECONFIG $ curl -ksS $(oc whoami --show-server)/.well-known/oauth-authorization-server { "issuer": "https://:0", "authorization_endpoint": "https://:0/oauth/authorize", "token_endpoint": "https://:0/oauth/token", ... } Check console login, it succeeds, console upper right shows correctly user name oidc-user-test:xxia@redhat.com. Check oc login: $ rm -rf ~/.kube/cache/oc/ $ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080 error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused Unable to connect to the server: getting credentials: exec: executable oc failed with exit code 1
Actual results:
Console login succeeds. oc login fails.
Expected results:
oc login should also succeed.
Additional info:{}
Description of problem:
Updating oidcProviders does not take effect. See details below.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-02-26-155043
How reproducible:
Always
Steps to Reproduce:
1. Install fresh HCP env and configure external OIDC as steps 1 ~ 4 of https://issues.redhat.com/browse/OCPBUGS-29154 (to avoid repeated typing those steps, only referencing as is here). 2. Pods renewed: $ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp ... network-node-identity-68b7b8dd48-4pvvq 3/3 Running 0 170m oauth-openshift-57cbd9c797-6hgzx 2/2 Running 0 170m kube-controller-manager-66f68c8bd8-tknvc 1/1 Running 0 164m kube-controller-manager-66f68c8bd8-wb2x9 1/1 Running 0 164m kube-controller-manager-66f68c8bd8-kwxxj 1/1 Running 0 163m kube-apiserver-596dcb97f-n5nqn 5/5 Running 0 29m kube-apiserver-596dcb97f-7cn9f 5/5 Running 0 27m kube-apiserver-596dcb97f-2rskz 5/5 Running 0 25m openshift-apiserver-c9455455c-t7prz 3/3 Running 0 22m openshift-apiserver-c9455455c-jrwdf 3/3 Running 0 22m openshift-apiserver-c9455455c-npvn5 3/3 Running 0 21m konnectivity-agent-7bfc7cb9db-bgrsv 1/1 Running 0 20m cluster-version-operator-675745c9d6-5mv8m 1/1 Running 0 20m hosted-cluster-config-operator-559644d45b-4vpkq 1/1 Running 0 20m konnectivity-agent-7bfc7cb9db-hjqlf 1/1 Running 0 20m konnectivity-agent-7bfc7cb9db-gl9b7 1/1 Running 0 20m 3. oc login can succeed: $ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080 Please visit the following URL in your browser: http://localhost:8080 Logged into "https://a4af9764....elb.ap-southeast-1.amazonaws.com:6443" as "oidc-user-test:xxia@redhat.com" from an external oidc issuer.You don't have any projects. Contact your system administrator to request a project. 4. Update HC by changing claim: email to claim: sub: $ oc edit hc $HC_NAME -n clusters --kubeconfig $MGMT_KUBECONFIG ... username: claim: sub ... Update is picked up: $ oc get authentication.config cluster -o yaml ... spec: oauthMetadata: name: tested-oauth-meta oidcProviders: - claimMappings: groups: claim: groups prefix: 'oidc-groups-test:' username: claim: sub prefix: prefixString: 'oidc-user-test:' prefixPolicy: Prefix issuer: audiences: - 76863fb1-xxxxxx issuerCertificateAuthority: name: "" issuerURL: https://login.microsoftonline.com/xxxxxxxx/v2.0 name: microsoft-entra-id oidcClients: - clientID: 76863fb1-xxxxxx clientSecret: name: console-secret componentName: console componentNamespace: openshift-console serviceAccountIssuer: https://xxxxxx.s3.us-east-2.amazonaws.com/hypershift-ci-267402 type: OIDC status: oidcClients: - componentName: console componentNamespace: openshift-console conditions: - lastTransitionTime: "2024-02-28T10:51:17Z" message: "" reason: OIDCConfigAvailable status: "False" type: Degraded - lastTransitionTime: "2024-02-28T10:51:17Z" message: "" reason: OIDCConfigAvailable status: "False" type: Progressing - lastTransitionTime: "2024-02-28T10:51:17Z" message: "" reason: OIDCConfigAvailable status: "True" type: Available currentOIDCClients: - clientID: 76863fb1-xxxxxx issuerURL: https://login.microsoftonline.com/xxxxxxxx/v2.0 oidcProviderName: microsoft-entra-id 4. Check pods again: $ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp ... kube-apiserver-596dcb97f-n5nqn 5/5 Running 0 108m kube-apiserver-596dcb97f-7cn9f 5/5 Running 0 106m kube-apiserver-596dcb97f-2rskz 5/5 Running 0 104m openshift-apiserver-c9455455c-t7prz 3/3 Running 0 102m openshift-apiserver-c9455455c-jrwdf 3/3 Running 0 101m openshift-apiserver-c9455455c-npvn5 3/3 Running 0 100m konnectivity-agent-7bfc7cb9db-bgrsv 1/1 Running 0 100m cluster-version-operator-675745c9d6-5mv8m 1/1 Running 0 100m hosted-cluster-config-operator-559644d45b-4vpkq 1/1 Running 0 100m konnectivity-agent-7bfc7cb9db-hjqlf 1/1 Running 0 99m konnectivity-agent-7bfc7cb9db-gl9b7 1/1 Running 0 99m No new pods renewed. 5. Check login again, it does not use "sub", still use "email": $ rm -rf ~/.kube/cache/ $ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080 Please visit the following URL in your browser: http://localhost:8080 Logged into "https://xxxxxxx.elb.ap-southeast-1.amazonaws.com:6443" as "oidc-user-test:xxia@redhat.com" from an external oidc issuer. You don't have any projects. Contact your system administrator to request a project. $ cat ~/.kube/cache/oc/* | jq -r '.id_token' | jq -R 'split(".") | .[] | @base64d | fromjson' ... { ... "email": "xxia@redhat.com", "groups": [ ... ], ... "sub": "EEFGfgPXr0YFw_ZbMphFz6UvCwkdFS20MUjDDLdTZ_M", ...
Actual results:
Steps 4 ~ 5: after editing HC field value from "claim: email" to "claim: sub", even if `oc get authentication cluster -o yaml` shows the edited change is propagated: 1> The pods like kube-apiserver are not renewed. 2> After clean-up ~/.kube/cache, `oc login ...` relogin still prints 'Logged into "https://xxxxxxx.elb.ap-southeast-1.amazonaws.com:6443" as "oidc-user-test:xxia@redhat.com" from an external oidc issuer', i.e. still uses old claim "email" as user name, instead of using the new claim "sub".
Expected results:
Steps 4 ~ 5: Pods like kube-apiserver pods should renew after HC editing that changes user claim. The login should print that the new claim is used as user name.
Additional info:
This card tracks removing feature gate and making it GA in OCP.
Tracks work for https://github.com/kubernetes-sigs/network-policy-api/pull/209 AND then to consume that in OVNKubernetes
Card tracks getting anp and banp in must-gather
This card adds support for implementing ANP.Egress.Networks Peer in OVNKubernetes:
oc, the openshift CLI, needs as close to feature parity as we can get without the built-in oauth server and its associated user and group management. This will enable scripts, documentation, blog posts, and knowledge base articles to function across all form factors and the same form factor with different configurations.
CLI users and scripts should be usable in a consistent way regardless of the token issuer configuration.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
oc login needs to work without the embedded oauth server
Why is this important? (mandatory)
We are removing the embedded oauth-server and we utilize a special oauthclient in order to make our login flows functional
This allows documentation, scripts, etc to be functional and consistent with the last 10 years of our product.
This may require vendoring entire CLI plugins. It may require new kubeconfig shapes.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Description of problem:
Separate oidc certificate authority and cluster certificate authority.
Version-Release number of selected component (if applicable):
oc 4.16 / 4.15
How reproducible:
Always
Steps to Reproduce:
1. Launch HCP external OIDC cluster. The external OIDC uses keycloak. The keycloak server is created outside of the cluster and its serving certificate is not trusted, its CA is separate than cluster's any CA. 2. Test oc login $ curl -sSI --cacert $ISSUER_CA_FILE $ISSUER_URL/.well-known/openid-configuration | head -n 1 HTTP/1.1 200 OK $ oc login --exec-plugin=oc-oidc --issuer-url=$ISSUER_URL --client-id=$CLI_CLIENT_ID --extra-scopes=email,profile --callback-port=8080 --certificate-authority $ISSUER_CA_FILE The server uses a certificate signed by an unknown authority. You can bypass the certificate check, but any data you send to the server could be intercepted by others. Use insecure connections? (y/n): n error: The server uses a certificate signed by unknown authority. You may need to use the --certificate-authority flag to provide the path to a certificate file for the certificate authority, or --insecure-skip-tls-verify to bypass the certificate check and use insecure connections.
Actual results:
2. oc login with --certificate-authority pointing to $ISSUER_CA_FILE fails. The reason is, oc login not only communicates with the oidc server, but also communicates the test cluster's kube-apiserver which is also self signed. Need more action for the --certificate-authority flag, i.e. need combine test cluster's kube-apiserver's CA and $ISSUER_CA_FILE: $ grep certificate-authority-data $KUBECONFIG | grep -Eo "[^ ]+$" | base64 -d > hostedcluster_kubeconfig_ca.crt $ cat $ISSUER_CA_FILE hostedcluster_kubeconfig_ca.crt > combined-ca.crt $ oc login --exec-plugin=oc-oidc --issuer-url=$ISSUER_URL --client-id=$CLI_CLIENT_ID --extra-scopes=email,profile --callback-port=8080 --certificate-authority combined-ca.crt Please visit the following URL in your browser: http://localhost:8080
Expected results:
For step 2, per https://redhat-internal.slack.com/archives/C060D1W96LB/p1711624413149659?thread_ts=1710836566.326359&cid=C060D1W96LB discussion, separate trust like:
$ oc login api-server --oidc-certificate-auhority=$ISSUER_CA_FILE [--certificate-authority=hostedcluster_kubeconfig_ca.crt]
The [--certificate-authority=hostedcluster_kubeconfig_ca.crt] should be optional if it is included in $KUBECONFIG's certificate-authority-data already.
Description of problem:
Introduce --issuer-url flag in oc login .
Version-Release number of selected component (if applicable):
[xxia@2024-03-01 21:03:30 CST my]$ oc version --client Client Version: 4.16.0-0.ci-2024-03-01-033249 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 [xxia@2024-03-01 21:03:50 CST my]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.ci-2024-02-29-213249 True False 8h Cluster version is 4.16.0-0.ci-2024-02-29-213249
How reproducible:
Always
Steps to Reproduce:
1. Launch fresh HCP cluster. 2. Login to https://entra.microsoft.com. Register application and set properly. 3. Prepare variables. HC_NAME=hypershift-ci-267920 MGMT_KUBECONFIG=/home/xxia/my/env/xxia-hs416-2-267920-4.16/kubeconfig HOSTED_KUBECONFIG=/home/xxia/my/env/xxia-hs416-2-267920-4.16/hypershift-ci-267920.kubeconfig AUDIENCE=7686xxxxxx ISSUER_URL=https://login.microsoftonline.com/64dcxxxxxxxx/v2.0 CLIENT_ID=7686xxxxxx CLIENT_SECRET_VALUE="xxxxxxxx" CLIENT_SECRET_NAME=console-secret 4. Configure HC without oauthMetadata. [xxia@2024-03-01 20:29:21 CST my]$ oc create secret generic console-secret -n clusters --from-literal=clientSecret=$CLIENT_SECRET_VALUE --kubeconfig $MGMT_KUBECONFIG [xxia@2024-03-01 20:34:05 CST my]$ oc patch hc $HC_NAME -n clusters --kubeconfig $MGMT_KUBECONFIG --type=merge -p=" spec: configuration: authentication: oauthMetadata: name: '' oidcProviders: - claimMappings: groups: claim: groups prefix: 'oidc-groups-test:' username: claim: email prefixPolicy: Prefix prefix: prefixString: 'oidc-user-test:' issuer: audiences: - $AUDIENCE issuerURL: $ISSUER_URL name: microsoft-entra-id oidcClients: - clientID: $CLIENT_ID clientSecret: name: $CLIENT_SECRET_NAME componentName: console componentNamespace: openshift-console type: OIDC " Wait pods to renew: [xxia@2024-03-01 20:52:41 CST my]$ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp ... certified-operators-catalog-7ff9cffc8f-z5dlg 1/1 Running 0 5h44m kube-apiserver-6bd9f7ccbd-kqzm7 5/5 Running 0 17m kube-apiserver-6bd9f7ccbd-p2fw7 5/5 Running 0 15m kube-apiserver-6bd9f7ccbd-fmsgl 5/5 Running 0 13m openshift-apiserver-7ffc9fd764-qgd4z 3/3 Running 0 11m openshift-apiserver-7ffc9fd764-vh6x9 3/3 Running 0 10m openshift-apiserver-7ffc9fd764-b7znk 3/3 Running 0 10m konnectivity-agent-577944765c-qxq75 1/1 Running 0 9m42s hosted-cluster-config-operator-695c5854c-dlzwh 1/1 Running 0 9m42s cluster-version-operator-7c99cf68cd-22k84 1/1 Running 0 9m42s konnectivity-agent-577944765c-kqfpq 1/1 Running 0 9m40s konnectivity-agent-577944765c-7t5ds 1/1 Running 0 9m37s 5. Check console login and oc login. $ export KUBECONFIG=$HOSTED_KUBECONFIG $ curl -ksS $(oc whoami --show-server)/.well-known/oauth-authorization-server { "issuer": "https://:0", "authorization_endpoint": "https://:0/oauth/authorize", "token_endpoint": "https://:0/oauth/token", ... } Check console login, it succeeds, console upper right shows correctly user name oidc-user-test:xxia@redhat.com. Check oc login: $ rm -rf ~/.kube/cache/oc/ $ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080 error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused Unable to connect to the server: getting credentials: exec: executable oc failed with exit code 1
Actual results:
Console login succeeds. oc login fails.
Expected results:
oc login should also succeed.
Additional info:{}
When the internal oauth-server and oauth-apiserver are removed and replaced with an external OIDC issuer (like azure AD), the console must work for human users of the external OIDC issuer.
An end user can use the openshift console without a notable difference in experience. This must eventually work on both hypershift and standalone, but hypershift is the first priority if it impacts delivery
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Console needs to be able to auth agains external OIDC IDP. For that console-operator need to set configure it in that order.
AC:
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
The web console should behave like a generic OIDC client when requesting tokens from an OIDC provider.
User API may not always be available. K8S now has a stable API to query for user information - https://github.com/kubernetes/enhancements/tree/master/keps/sig-auth/3325-self-subject-attributes-review-api. See if it can be used and replace all `user/~` calls with it.
Enable a "Break Glass Mechanism" in ROSA (Red Hat OpenShift Service on AWS) and other OpenShift cloud-services in the future (e.g., ARO and OSD) to provide customers with an alternative method of cluster access via short-lived certificate-based kubeconfig when the primary IDP (Identity Provider) is unavailable.
Customers need to be able to request the revocation of the signer cert for their break-glass credentials and know when it's been done. See design here: https://docs.google.com/document/d/19l48HB7-4_8p96b2kvlpYFpbZXCh5eXN_tEmM8QhaIc/edit#heading=h.1a08yt9sno82 and diagram here: https://miro.com/app/board/uXjVNUPFcQA=/
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As part of the deprecation progression of the openshift-sdn CNI plug-in, remove it as an install-time option for new 4.15+ release clusters.
The openshift-sdn CNI plug-in is sunsetting according to the following progression:
All development effort is directed to the default primary CNI plug-in, ovn-kubernetes, which has feature parity with the older openshift-sdn CNI plug-in that has been feature frozen for the entire 4.x timeframe. In order to best serve our customers now and in the future, we are reducing our support footprint to the dominant plug-in, only.
The openshift-sdn CNI plug-in is sunsetting according to the following progression:
All development effort is directed to the default primary CNI plug-in, ovn-kubernetes, which has feature parity with the older openshift-sdn CNI plug-in that has been feature frozen for the entire 4.x timeframe. In order to best serve our customers now and in the future, we are reducing our support footprint to the dominant plug-in, only.
The openshift-sdn CNI plug-in is sunsetting according to the following progression:
All development effort is directed to the default primary CNI plug-in, ovn-kubernetes, which has feature parity with the older openshift-sdn CNI plug-in that has been feature frozen for the entire 4.x timeframe. In order to best serve our customers now and in the future, we are reducing our support footprint to the dominant plug-in, only.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Users of the OpenShift Console leverage a streamlined, visual experience when discovering and installing OLM-managed operators in clusters that run on cloud providers with support for short-lived token authentication enabled. Users are intuitively becoming aware when this is the case and are put on the happy path to configure OLM-managed operators with the necessary information to support GCP Workload Identify Foundation (WIF).
Customers do not need to re-learn how to enable GCP WIF authentication support for each and every OLM-managed operator that supports it. The experience is standardized and repeatable so customers spend less time with initial configuration and more team implementing business value. The process is so easy that OpenShift is perceived as enabler for an increased security posture.
The OpenShift Console today provides little to no support for configuring OLM-managed operators for short-lived token authentication. Users are generally unaware if their cluster runs on a cloud provider and is set up to use short-lived tokens for its core functionality and users are not aware which operators have support for that by implementing the respective flows defined in OCPSTRAT-922.
Customers may or may not be aware about short-lived token authentication support. They need to proper context and pointers to follow-up documentation to explain the general concept and the specific configuration flow the Console supports. It needs to become clear that the Console cannot 100% automate the overall process and some steps need to be run outside of the cluster/Console using Cloud-provider specific tooling.
https://issues.redhat.com/browse/PORTENABLE-525 adds new annotations, including features.operators.openshift.io/token-auth-gcp, but console does not yet support this new annotation. We need to determine what changes are necessary to support this new annotation and implement them.
AC:
https://issues.redhat.com/browse/PORTENABLE-525 adds new annotations, including features.operators.openshift.io/tls-profiles, but console does not yet support this new annotation. We need to determine what changes are necessary to support this new annotation and implement them.
AC:
Address technical debt around self-managed HCP deployments, including but not limited to
The CLI cannot create dual stack clusters with the default values. We need to create the proper flags to enable the HostedCluster to be a dual stack one using the default values
Users are encountering an issue when attempting to "Create hostedcluster on BM+disconnected+ipv6 through MCE." This issue is related to the default settings of `--enable-uwm-telemetry-remote-write` being true. Which might mean that that in the default case with disconnected and whatever is configured in the configmap for UWM e.g ( minBackoff: 1s url: https://infogw.api.openshift.com/metrics/v1/receive Is not reachable with disconneced. So we should look into reporting the issue and remdiating vs. Fataling on it for disconnected scenarios.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
In MCE 2.4, we currently document to disable `--enable-uwm-telemetry-remote-write` if the hosted control plane feature is used in a disconnected environment. https://github.com/stolostron/rhacm-docs/blob/lahinson-acm-7739-disconnected-bare-[…]s/hosted_control_planes/monitor_user_workload_disconnected.adoc Once this Jira is fixed, the documentation needs to be removed, users do not need to disable `--enable-uwm-telemetry-remote-write`. The HO is expected to fail gracefully on `--enable-uwm-telemetry-remote-write` and continue to be operational.
This can be based on the exising CAPI agent provider workflow which already has an env var flag for disconnected
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
While monitoring the payload job failures, open a parallel openshift/origin bump.
Note: There is a high chance of job failures in openshift/origin bump until the openshift/kubernetes PR merges as we only update the test and not the actual kube.
Benefit of opening this PR before ocp/k8s merge is to identify and fix the issues beforehand.
Follow the rebase doc[1] and update the spreadsheet[2] that tracks the required commits to be cherry-picked. Rebase the o/k repo with the "merge=ours" strategy as mentioned in the rebase doc.
Save the last commit id in the spreadsheet for future references.
Update the rebase doc if required.
[1] https://github.com/openshift/kubernetes/blob/master/REBASE.openshift.md
[2] https://docs.google.com/spreadsheets/d/10KYptJkDB1z8_RYCQVBYDjdTlRfyoXILMa0Fg8tnNlY/edit#gid=1957024452
Prev. Ref:
https://github.com/openshift/kubernetes/pull/1646
Bump the following libraries in an order with the latest kube and the dependent libraries
Prev Ref:
https://github.com/openshift/api/pull/1534
https://github.com/openshift/client-go/pull/250
https://github.com/openshift/library-go/pull/1557
https://github.com/openshift/apiserver-library-go/pull/118
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
This epic will own all of the usual update, rebase and release chores which must be done during the OpenShift 4.16 timeframe for Custom Metrics Autoscaler, Vertical Pod Autoscaler and Cluster Resource Override Operator
Shepherd and merge operator automation PR https://github.com/openshift/vertical-pod-autoscaler-operator/pull/151
Update operator like was done in https://github.com/openshift/vertical-pod-autoscaler-operator/pull/146
Shepherd and merge operand automation PR https://github.com/openshift/kubernetes-autoscaler/pull/277
Coordinate with cluster autoscaler team on upstream rebase as in https://github.com/openshift/kubernetes-autoscaler/pull/250
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
This should be the last SDN Kube rebase, but we need to work with the windows team to find a way for them to get the latest Kube-Proxy without depending on this rebase as SDN is deprecated
This story relates to this PR https://github.com/openshift/machine-config-operator/pull/4275
A new PR has been opened to investigate the issues found in the original PR (this is the link to the new PR): https://github.com/openshift/machine-config-operator/pull/4306
The original PR exceeded the watch request limits when merged. When discovered, the CNTO team needed to revert it. (see https://redhat-internal.slack.com/archives/C01CQA76KMX/p1711711538689689).
To investigate if exceeding the watch request limit was introduced from the API bump and its associated changes, or the kubeconfig changes, an additional PR was opened just for looking at removing the hardcoded values from the kubelet template, and payload tests were run against it: https://github.com/openshift/machine-config-operator/pull/4270. The payload tests passed, and it was concluded that the watch request limit issue was introduced in the portion of the PR that included the API bump and its associated changes.
It was discovered that the CNTO team was using an outdated form of openshift deps, so they were asked to bump. https://redhat-internal.slack.com/archives/CQNBUEVM2/p1712171079685139?thread_ts=1711712855.478249&cid=CQNBUEVM2
https://github.com/openshift/cluster-node-tuning-operator/pull/990
was opened in the past to address the kube bump (this just merged), and https://github.com/openshift/cluster-node-tuning-operator/pull/1022
was opened as well (still open)
CURRENT STATUS: waiting for https://github.com/openshift/cluster-node-tuning-operator/pull/1022 to merge so we can rerun payload tests against the revert PR open.
As a MCO developer, I want to pick up the openshift/kubernetes updates for the 1.29 k8s rebase to track the k8s version as rest of the OpenShift 1.29 cluster.
This feature is now re-opened because we want to run z-rollback CI. This feature doesn't block the release of 4.17.This is not going to be exposed as a customer-facing feature and will not be documented within OpenShift documentation. This is strictly going to be covered as a RH Support guided solution with KCS article providing guidance. A public facing KCS will basically point to contacting Support for help on Z-stream rollback, and Y-stream rollback is not supported.
NOTE:
Previously this was closed as "won't do" because didn't have a plan to support y-stream and z-stream rollbacks is standalone openshift.
For Single node openshift please check TELCOSTRAT-160 . "won't do" decisions was after further discussion with leadership.
The e2e tests https://docs.google.com/spreadsheets/d/1mr633YgQItJ0XhbiFkeSRhdLlk6m9vzk1YSKQPHgSvw/edit?gid=0#gid=0 We have identified a few bugs that need to be resolved before the General Availability (GA) release. Ideally, these should be addressed in the final month before GA when all features are development complete. However, asking component teams to commit to fixing critical rollback bugs during this time could potentially delay the GA date.
------
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
Red Hat Support assisted z-stream rollback from 4.16+
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
Red Hat Support may, at their discretion, assist customers with z-stream rollback once it’s determined to be the best option for restoring a cluster to the desired state whenever a z-stream rollback compromises cluster functionality.
Engineering will take a “no regressions, no promises” approach, ensuring there are no major regressions between z-streams, but not testing specific combinations or addressing case-specific bugs.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed | all |
Multi node, Compact (three node) | all |
Connected and Restricted Network | all |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | all |
Release payload only | all |
Starting with 4.16, including all future releases | all |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
As an admin who has determined that a z-stream update has compromised cluster functionality I have clear documentation that explains that unassisted rollback is not supported and that I should consult with Red Hat Support on the best path forward.
As a support engineer I have a clear plan for responding to problems which occur during or after a z-stream upgrade, including the process for rolling back specific components, applying workarounds, or rolling the entire cluster back to the previously running z-stream version.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
Should we allow rollbacks whenever an upgrade doesn’t complete? No, not without fully understanding the root cause. If it’s simply a situation where workers are in process of updating but stalled, that should never yield a rollback without credible evidence that rollback will fix that.
Similar to our “foolproof command” to initiate rollback to previous z-stream should we also craft a foolproof command to override select operators to previous z-stream versions? Part of the goal of the foolproof command is to avoid potential for moving to an unintended version, the same risk may apply at single operator level though impact would be smaller it could still be catastrophic.
High-level list of items that are out of scope. Initial completion during Refinement status.
Non-HA clusters, Hosted Control Planes – those may be handled via separately scoped features
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Occasionally clusters either upgrade successfully and encounter issues after the upgrade or may run into problems during the upgrade. Many customers assume that a rollback will fix their concerns but without understanding the root cause we cannot assume that’s the case. Therefore, we recommend anyone who has encountered a negative outcome associated with a z-stream upgrade contact support for guidance.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
It’s expected that customers should have adequate testing and rollout procedures to protect against most regressions, i.e. roll out a z-stream update in pre-production environments where it can be adequately tested prior to updating production environments.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
This is largely a documentation effort, i.e. we should create either a KCS article or new documentation section which describes how customers should respond to loss of functionality during or after an upgrade.
KCS Solution : https://access.redhat.com/solutions/7083335
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Given we test as many upgrade configurations as possible and for whatever reason the upgrade still encounters problems, we should not strive to comprehensively test all configurations for rollback success. We will only test a limited set of platforms and configurations necessary to ensure that we believe the platform is generally able to roll back a z-stream update.
Identify and fill gaps related to CI/CD for HyperShift-ARO integration.
We want to make sure ARO/HCP development happens while satisfying e2e expectations
There's a running, blocking test for azure in presubmits
Detail about what is specifically not being delivered in the story
This does not require a design proposal.
This does not require a feature gate.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Currently when creating the resource group it will be deleted after 24 hours. We can change this by setting an expiry tag on the resource group once created.
The ability to set this expiry tag when creating the resource group as part of the cluster creation command would be nice
OCP 4 clusters still maintain pinned boot images. We have numerous clusters installed that have boot media pinned to first boot images as early as 4.1. In the future these boot images may not be certified by the OEM and may fail to boot on updated datacenter or cloud hardware platforms. These "pinned" boot images should be updateable so that customers can avoid this problem and better still scale out nodes with boot media that matches the running cluster version.
The bootimage references are currently saved off in the machineset by the openshift installer and is thereafter unmanaged. This machineset object is not updated on an upgrade, so any node scaled up using it will boot up with the original “install” bootimage.
The “new” boot image references are available in a configmap/coreos-bootimages in the MCO namespace. Here is the PR that implemented this, it’s basically a CVO manifest that pulls from this file in the installer binary. Hence, they are updated on an upgrade. It can also be printed out to console by the following command on the installer: /openshift-install coreos print-stream-json.
Implementing this portion should be as simple as iterating through each machineset, and updating the new disk image by crossreferencing the configmap, architecture, region and the platform used in the machineset. This is where the installer figures out the bootimage during an install, so we could model a bit after this.
It looks like we have Machine API objects for every platform specific providerSpec(formally called providerConfig) we support here. We'd still have to special case the image/ami actual portion of this, but we should be able to leverage some of the work done in the installer(to generate machinesets, for example, GCP) to understand how the image reference is stored for every platform.
Done when:
For MVP, the goal is to
This feature is dedicated to enhancing data security and implementing encryption best practices across control-planes, Etcd, and nodes for HyperShift with Azure. The objective is to ensure that all sensitive data, including secrets is encrypted, thereby safeguarding against unauthorized access and ensuring compliance with data protection regulations.
Expose and propagate input for kms secret encryption similar to what we do in AWS.
See related discussion:
https://redhat-internal.slack.com/archives/CCV9YF9PD/p1696950850685729
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a user of HCP on Azure, I would like to be able to pass a customer-managed key when creating a HC so that the disks for the VMs in the NodePool are encrypted.
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a user of HCP on Azure, I want to be able to provide a DiskEncryptionSet ID to encrypt the OS disks for the VMs in the NodePool so that the data on the OS disks will be protected by encryption.
Description of criteria:
N/A
Epic Goal*
There was an epic / enhancement to create a cluster-wide TLS config that applies to all OpenShift components:
https://issues.redhat.com/browse/OCPPLAN-4379
https://github.com/openshift/enhancements/blob/master/enhancements/kube-apiserver/tls-config.md
For example, this is how KCM sets --tls-cipher-suites and --tls-min-version based on the observed config:
https://issues.redhat.com/browse/WRKLDS-252
https://github.com/openshift/cluster-kube-controller-manager-operator/pull/506/files
The cluster admin can change the config based on their risk profile, but if they don't change anything, there is a reasonable default.
We should update all CSI driver operators to use this config. Right now we have a hard-coded cipher list in library-go. See OCPBUGS-2083 and OCPBUGS-4347 for background context.
Why is this important? (mandatory)
This will keep the cipher list consistent across many OpenShift components. If the default list is changed, we get that change "for free".
It will reduce support calls from customers and backport requests when the recommended defaults change.
It will provide flexibility to the customer, since they can set their own TLS profile settings without requiring code change for each component.
Scenarios (mandatory)
As a cluster admin, I want to use TLSSecurityProfile to control the cipher list and minimum TLS version for all CSI driver operator sidecars, so that I can adjust the settings based on my own risk assessment.
Dependencies (internal and external) (mandatory)
None, the changes we depend on were already implemented.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
This feature introduces automatic Etcd snapshot functionality for self-managed hosted control planes, expanding control and flexibility for users. Unlike managed hosted control planes, self-managed environments allow for customized configurations. This feature aims to enable users to leverage any S3-compatible storage for etcd snapshot storage, ensuring high availability and resilience for their OpenShift clusters.
As an engineer I need to perform a spike over backup and restore proceduers on HostedCluster using OADP, in order to know if that could be an alternative or a resource in the new ETCD Backup API for HCP.
Goal:
As an administrator, I would like to use my own managed DNS solution instead of only specific openshift-install supported DNS services (such as AWS Route53, Google Cloud DNS, etc...) for my OpenShift deployment.
Problem:
While cloud-based DNS services provide convenient hostname management, there's a number of regulatory (ITAR) and operational constraints customers face prohibiting the use of those DNS hosting services on public cloud providers.
Why is this important:
Dependencies (internal and external):
Prioritized epics + deliverables (in scope / not in scope):
Estimate (XS, S, M, L, XL, XXL):
Previous Work:
Open questions:
Link to Epic: https://docs.google.com/document/d/1OBrfC4x81PHhpPrC5SEjixzg4eBnnxCZDr-5h3yF2QI/edit?usp=sharing
Append Infra CR with only the GCP PlatformStatus field (without any other fields esp the Spec) set with the LB IPs at the end of the bootstrap ignition. The theory is that when Infra CR is applied from the bootstrap ignition, first the infra manifest is applied. As we progress through all the other assets in the ignition files, Infra CR appears again but with only the LB IPs set. That way it will update the existing Infra CR already applied to the cluster.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
This Epic details that work required to augment this CoreDNS pod to also resolve the *.apps URL. In addition, it will include changes to prevent Ingress Operator from configuring the cloud DNS after the ingress LBs have been created.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Stop generating long-lived service account tokens. Long-lived service account tokens are currently generated in order to then create an image pull secret for the internal image registry. This feature calls for using the TokenRequest API to generate a bound service account token for use in the image pull secret.
Use TokenRequest API to create image pull secrets.
{}Performance benefits:
One less secret created per service account. This will result in at least three less secrets generated per namespace.
Security benefits:
Long lived tokens which are no longer recommended as they present a possible security risk.
Requirements (aka. Acceptance Criteria):
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
The upstream test `ServiceAccounts no secret-based service account token should be auto-generated` was previously patched to allow for the internal image registry's managed image pull secret to be present in the `Secrets` field. This will not longer be the case as of 4.16.
Post merge of API-1644, we can remove the patch entirely.
Provide mechanisms for the builder service account to be made optional in core OpenShift.
< Who benefits from this feature, and how? What is the difference between today’s current state and a world with this feature? >
Requirements | Notes | IS MVP |
Disable service account controller related to Build/BuildConfig when Build capability is disabled | When the API is marked as removed or disabled, stop creating the "builder" service account and its associated RBAC | Yes |
Option to disable the "builder" service account | Even if the Build capability is enabled, allow admins to disable the "builder" service account generation. Admins will need to bring their own service accounts/RBAC for builds to work | Yes |
< What are we making, for who, and why/what problem are we solving?>
<Defines what is not included in this story>
< Link or at least explain any known dependencies. >
Background, and strategic fit
< What does the person writing code, testing, documenting need to know? >
< Are there assumptions being made regarding prerequisites and dependencies?>
< Are there assumptions about hardware, software or people resources?>
< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>
< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >
< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact?>
< Are there assumptions being made regarding prerequisites and dependencies?>
< Are there assumptions about hardware, software or people resources?>
< If the feature is ordered with other work, state the impact of this feature on the other work>
As a cluster admin trying to disable the Build, DeploymentConfig, and Image Registry capabilities I want the RBAC controllers for the builder and deployer service accounts and default image-registry rolebindings disabled when their respective capability is disabled.
<Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it? How does it improve the customer's experience?>
<Describes the context or background related to this story>
In WRKLDS-695, ocm-o was enhanced to disable the Build and DeploymentConfig controllers when the respective capability was disabled. This logic should be extended to include the controllers that set up the service accounts and role bindings for these respective features.
<Defines what is not included in this story>
<Description of the general technical path on how to achieve the goal of the story. Include details like json schema, class definitions>
<Describes what this story depends on. Dependent Stories and EPICs should be linked to the story.>
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Unknown
Verified
Unsatisfied
As an OpenShift engineer trying to use capabilities to enable/disable the Build and DeploymentConfig systems, I want to refactor the default rolebindings controller so that each respective capability runs a separate controller.
<Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it? How does it improve the customer’s experience?>
<Describes the context or background related to this story>
OpenShift has a controller that automatically creates role-bindings for service accounts in every namespace. Though only one controller operates, its logic contains forks that are specific to the Build and DepoymentConfig systems.
The goal is to refactor this into separate controllers so that individual ones can be disabled by the cluster-openshift-controller-manager-operator.
<Defines what is not included in this story>
<Description of the general technical path on how to achieve the goal of the story. Include details like json schema, class definitions>
<Describes what this story depends on. Dependent Stories and EPICs should be linked to the story.>
<Describe edge cases to consider when implementing the story and defining tests>
<Provides a required and minimum list of acceptance tests for this story. More is expected as the engineer implements this story>
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Unknown
Verified
Unsatisfied
A 5G Core cluster communicates with many other clusters and remote hosts, aka remote sites. And these remote are added, and sometime removed temporarily or definitively from the telecom operator network. As of today, we need to add static routes on OCP nodes to be able to reach those remote sites, and updating static routes for each and every remote site change is unacceptable for Telco Operators, in particular as they do not have to configure static routes anywhere in their network as they rely on BGP to propage routes across their network. They do want OCP nodes to learn those BGP routes rather than to create a "script" that will translate BGP announces into an NMSTATE configuration to be pushed on OCP nodes.
More insights in (recording link on the title page) https://docs.google.com/presentation/d/1zW0-wmvtrU7dbApIaNWfvMbSZvLxoSMLa_gzRYn0Y3Y/edit#slide=id.gfb215b3717_0_3299
VRF: in this feature, a VRF is a Network VRF, not a Linux kernel VRF. A Network VRF is often implemented as a VLAN in a datacenter, but this is a pure logical entity on the routers/DCGW and are not visible on the OCP nodes. Please do not be confused with another Feature that relies on Linux VRFs to map the Network VRFs on OCP nodes (https://issues.redhat.com/browse/TELCOSTRAT-76).
More insights in (recording link on the title page) https://docs.google.com/presentation/d/1zW0-wmvtrU7dbApIaNWfvMbSZvLxoSMLa_gzRYn0Y3Y/edit#slide=id.gfb215b3717_0_3299
Any Kubernetes object/component learning/announcing routes, including pods, is not in this feature scope. This feature is about OCP nodes (==Linux host) to learn routes via BGP, and to eventually monitor their next hop with BFD (datacenter router, DCGW, fabric leaf, ...).
Routes next hop can be on any OCP node interface (any VLAN, bond, physical NIC): next hop are not necessarily reachable from the baremetal network. This means that we will have one BGP Peer per VRF.
We want to be able to learn routes with the same weight/local-preference, translating to ECMP routes on the OCP nodes, and also routes with different weight/local-preference, translating to active/backup routes on OCP nodes. In all cases, we want to be able to monitor the routes via BFD as BGP timeouts are too high for some customers.
This feature is limited to BGP and BFP, and must support IPv4, IPv6, and the commonly used BGP attributes, typically the ones supported by metalLB: https://docs.openshift.com/container-platform/4.12/networking/metallb/metallb-configure-bgp-peers.html#nw-metallb-bgppeer-cr_configure-metallb-bgp-peers
The OCP nodes will learn routes but will not announce routes, except the metalLB ones.
Expectation is to have 3 VRFs and two routers per VRF. We should scale beyond, in particular for the number of VRFs, but the number of routers per VRF should be 4 at a maximum as best and sane practices are 2 and we do not want to encourage faulty/wrong network designs. Of course, we can amend this Feature scope for future evolution if best practices evolves.
FRR is interoperable with all/most existing routers, and as Red Hat is upstream based, interoperability tests are not required but any interoperability issue will be fixed (upstream first, like for any code change at Red Hat).
This feature is only relevant for local gateway mode (not shared gateway).
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Telecommunications providers look to displace Physical Network Functions (PNFs) with modern Virtual Network Functions (VNFs) at the Far Edge. Single Node OpenShift, as a the CaaS layer in the vRAN vDU architecture, must achieve a higher standard in regards to OpenShift upgrade speed and efficiency, as in comparison to PNFs.
Telecommunications providers currently deploy Firmware-based Physical Network Functions (PNFs) in their RAN solutions. These PNFs can be upgraded quickly due to their monolithic nature and image-based download-and-reboot upgrades. Furthermore they often have the ability to retry upgrades and to rollback to the previous image if the new image fails. These Telcos are looking to displace PNFs with virtual solutions, but will not do so unless the virtual solutions have comparable operational KPIs to the PNFs.
Service (vDU) Downtime is the time when the CNF is not operational and therefore no traffic is passing through the vDU. This has a significant impact as it degrades the customer’s service (5G->4G) or there’s an outright service outage. These disruptions are scheduled into Maintenance Windows (MW), but the Telecommunications Operators primary goal is to keep service running, so getting vRAN solutions with OpenShift to near PNF-like Service Downtime is and always will be a primary requirement.
Upgrading OpenShift is only one of many operations that occur during a Maintenance Window. Reducing the CaaS upgrade duration is meaningful to many teams within a Telecommunications Operators organization as this duration fits into a larger set of activities that put pressure on the duration time for Red Hat software. OpenShift must reduce the upgrade duration time significantly to compete with existing PNF solutions.
As mentioned above, the Service Downtime disruption duration must be as small as possible, this includes when there are failures. Hardware failures fall into a category called Break+Fix and are covered by TELCOSTRAT-165. In the case of software failures must be detected and remediation must occur.
Detection includes monitoring the upgrade for stalls and failures and remediation would require the ability to rollback to the previously well-known-working version, prior to the failed upgrade.
The OpenShift product support terms are too short for Telco use cases, in particular vRAN deployments. The risk of Service Downtime drives Telecommunications Operators to a certify-deploy-and-then-don’t-touch model. One specific request from our largest Telco Edge customer is for 4 years of support.
These longer support needs drive a misalignment with the EUS->EUS upgrade path and drive the requirement that the Single Node OpenShift deployment can be upgraded from OCP X.y.z to any future [X+1].[y+1].[z+1] where X+1 and x+1 are decided by the Telecommunications Operator depending on timing and the desired feature-set and x+1 is determined through Red Hat, vDU vendor and custom maintenance and engineering validation.
Red Hat is challenged with improving multiple OpenShift Operational KPIs by our telecommunications partners and customers. Improved Break+Fix is tracked in TELCOSTRAT-165 and improved Installation is tracked in TELCOSTRAT-38.
Whatever methodology achieves the above requirements must ensure that the customer has a pleasant experience via RHACM and Red Hat GitOps. Red Hat’s current install and upgrade methodology is via RHACM and any new technologies used to improve Operational KPIs must retain the seamless experience from the cluster management solution. For example, after a cluster is upgraded it must look the same to a RHACM Operator.
Whatever methodology achieves the above requirements must ensure that a technician troubleshooting a Single Node OpenShift deployment has a pleasant experience. All commands issued on the node must return output as it would before performing an upgrade.
Run tuneD on a container on a one-shot mode and read the output kernel arguments to apply them using a MachineConfig (MC).
This would be run in the bootstrap procedure of the Openshift Installer, just before the MachineConfigOperator(MCO) procedure here
Initial considerations: https://docs.google.com/document/d/1zUpcpFUp4D5IM4GbM4uWbzbjr57h44dS0i4zP-hek2E/edit
Webhooks created during bootstrap-in-place will lead to failure in applying their admission subject resources. A permanent solution will be provided by resolving https://issues.redhat.com/browse/OCPBUGS-28550
We need a temporary workaround to unblock the development. This is easiest to do by changing the validating webhook failure policy to "Ignore"
A systemd service that runs on a golden image first boot and configure the following:
1. networking ( the internal IP address require special attention)
2. Update the hostname (MGMT-15775)
3. Execute recert (regenereate certs, Cluster name and base domain MGMT-15533)
4. Start kubelet
5. Apply the personalization info:
If the answer is "yes", please make sure to check the corresponding option.
The following features depend on this functionality:
In IBI and IBU flows we need a way to change nodeip-configuration hint file without reboot and before mco even starts. In order for MCO to be happy we need to remove this file from it's management to make it we will stop using machine config and move to ignition
Currently the behavior is to only use the default security group if no security groups were specified for a NodePool. This makes it difficult to implement additional security groups in ROSA because there is no way to know the default security group on cluster creation. By always appending the default security group, any security groups specified on the NodePool become additional security groups.
As a ROSA HyperShift customer I want to enforce that IMDSv2 is always the default, to ensure that I have the most secure setting by default.
Integration Testing:
Beta:
GA:
GREEN | YELLOW | RED
GREEN = On track, minimal risk to target date.
YELLOW = Moderate risk to target date.
RED = High risk to target date, or blocked and need to highlight potential
risk to stakeholders.
Links to Gdocs, github, and any other relevant information about this epic.
When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release
We want to add a new CLI option named `--all-images` to the `oc adm must-gather` command.
The oc client will scan all the CSVs on the cluster looking for `operators.openshift.io/must-gather-image=pullUrlOfTheImage` annotation building a list of must-gather images to be used to collect logs for all the products installed on the cluster and then the client will execute all of them aggregating the collection results.
Operator authors that want to opt-in to this mechanism should explicitly annotate their CSV with ``operators.openshift.io/must-gather-image` annotation.
As a stakeholder aiming to adopt KubeSaw as a Namespace-as-a-Service solution, I want the project to provide streamlined tooling and a clear code-base, ensuring seamless adoption and integration into my clusters.
Efficient adoption of KubeSaw, especially as a Namespace-as-a-Service solution, relies on intuitive tooling and a transparent codebase. Improving these aspects will empower stakeholders to effortlessly integrate KubeSaw into their Kubernetes clusters, ensuring a smooth transition to enhanced namespace management.
As a Stakeholder, I want streamlined setup of the KubeSaw project and fully automated way of upgrading this setup aling with the updates of the installation.
The expected outcome within the market is both growth and retention. The improved tooling and codebase will attract new stakeholders (growth) and enhance the experience for existing users (retention) by providing a straightforward path to adopting KubeSaw's Namespace-as-a-Service features in their clusters.
This epic is to track all the unplanned work related to security incidents, fixing flaky e2e tests, and other urgent and unplanned efforts that may arise during the sprint.
Placeholder feature for ccx-ocp-core maintenance tasks.
This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.
Description of problem:
InsightsRecommendationActive firing, description link results in "Invalid parameter: redirect_uri" on sso.redhat.com. Insights recommendation "OpenShift cluster with more or less than 3 control plane node replicas is not supported by Red Hat" with total risk "Moderate" was detected on the cluster. More information is available at https://console.redhat.com/openshift/insights/advisor/clusters/<UID>?first=ccx_rules_ocp.external.rules.control_plane_replicas|CONTROL_PLANE_NODE_REPLICAS.
Version-Release number of selected component (if applicable):
4.15.14
How reproducible:
unknown
Steps to Reproduce:
1. Install 4.15.14 on a cluster that triggers this alert 2. Log out of Red Hat SSO 3. Clink link in alert description
Actual results:
"Invalid parameter: redirect_uri" on sso.redhat.com
Expected results:
Link successfully navigates through SSO
Additional info:
This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.
Description of problem:
The operator panics in HyperShift hosted cluster with OVN and with enabled networking obfuscation:
1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 858 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x26985e0?, 0x454d700}) /go/src/github.com/openshift/insights-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0010d67e0?}) /go/src/github.com/openshift/insights-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75 panic({0x26985e0, 0x454d700}) /usr/lib/golang/src/runtime/panic.go:884 +0x213 github.com/openshift/insights-operator/pkg/anonymization.getNetworksFromClusterNetworksConfig(...) /go/src/github.com/openshift/insights-operator/pkg/anonymization/anonymizer.go:292 github.com/openshift/insights-operator/pkg/anonymization.getNetworksForAnonymizer(0xc000556700, 0xc001154ea0, {0x0, 0x0, 0x0?}) /go/src/github.com/openshift/insights-operator/pkg/anonymization/anonymizer.go:253 +0x202 github.com/openshift/insights-operator/pkg/anonymization.(*Anonymizer).readNetworkConfigs(0xc0005be640) /go/src/github.com/openshift/insights-operator/pkg/anonymization/anonymizer.go:180 +0x245 github.com/openshift/insights-operator/pkg/anonymization.(*Anonymizer).AnonymizeMemoryRecord.func1() /go/src/github.com/openshift/insights-operator/pkg/anonymization/anonymizer.go:354 +0x25 sync.(*Once).doSlow(0xc0010d6c70?, 0x21a9006?) /usr/lib/golang/src/sync/once.go:74 +0xc2 sync.(*Once).Do(...) /usr/lib/golang/src/sync/once.go:65 github.com/openshift/insights-operator/pkg/anonymization.(*Anonymizer).AnonymizeMemoryRecord(0xc0005be640, 0xc000cf0dc0) /go/src/github.com/openshift/insights-operator/pkg/anonymization/anonymizer.go:353 +0x78 github.com/openshift/insights-operator/pkg/recorder.(*Recorder).Record(0xc00075c4b0, {{0x2add75b, 0xc}, {0x0, 0x0, 0x0}, {0x2f38d28, 0xc0009c99c0}}) /go/src/github.com/openshift/insights-operator/pkg/recorder/recorder.go:87 +0x49f github.com/openshift/insights-operator/pkg/gather.recordGatheringFunctionResult({0x2f255c0, 0xc00075c4b0}, 0xc0010d7260, {0x2adf900, 0xd}) /go/src/github.com/openshift/insights-operator/pkg/gather/gather.go:157 +0xb9c github.com/openshift/insights-operator/pkg/gather.collectAndRecordGatherer({0x2f50058?, 0xc001240c90?}, {0x2f30880?, 0xc000994240}, {0x2f255c0, 0xc00075c4b0}, {0x0?, 0x8dcb80?, 0xc000a673a2?}) /go/src/github.com/openshift/insights-operator/pkg/gather/gather.go:113 +0x296 github.com/openshift/insights-operator/pkg/gather.CollectAndRecordGatherer({0x2f50058, 0xc001240c90}, {0x2f30880, 0xc000994240?}, {0x2f255c0, 0xc00075c4b0}, {0x0, 0x0, 0x0}) /go/src/github.com/openshift/insights-operator/pkg/gather/gather.go:89 +0xe5 github.com/openshift/insights-operator/pkg/controller/periodic.(*Controller).Gather.func2(0xc000a678a0, {0x2f50058, 0xc001240c90}, 0xc000796b60, 0x26f0460?) /go/src/github.com/openshift/insights-operator/pkg/controller/periodic/periodic.go:206 +0x1a8 github.com/openshift/insights-operator/pkg/controller/periodic.(*Controller).Gather(0xc000796b60) /go/src/github.com/openshift/insights-operator/pkg/controller/periodic/periodic.go:222 +0x450 github.com/openshift/insights-operator/pkg/controller/periodic.(*Controller).periodicTrigger(0xc000796b60, 0xc000236a80) /go/src/github.com/openshift/insights-operator/pkg/controller/periodic/periodic.go:265 +0x2c5 github.com/openshift/insights-operator/pkg/controller/periodic.(*Controller).Run.func1() /go/src/github.com/openshift/insights-operator/pkg/controller/periodic/periodic.go:161 +0x25 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?) /go/src/github.com/openshift/insights-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x3e k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00007d7c0?, {0x2f282a0, 0xc0012cd800}, 0x1, 0xc000236a80) /go/src/github.com/openshift/insights-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xb6 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001381fb0?, 0x3b9aca00, 0x0, 0x0?, 0x449705?) /go/src/github.com/openshift/insights-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x89 k8s.io/apimachinery/pkg/util/wait.Until(0xabfaca?, 0x88d6e6?, 0xc00078a360?) /go/src/github.com/openshift/insights-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x25 created by github.com/openshift/insights-operator/pkg/controller/periodic.(*Controller).Run /go/src/github.com/openshift/insights-operator/pkg/controller/periodic/periodic.go:161 +0x1ea
Version-Release number of selected component (if applicable):
How reproducible:
Enable networking obfuscation for the Insights Operator and wait for gathering to happen in the operator. You will see the above stacktrace.
Steps to Reproduce:
1. Create a HyperShift hosted cluster with OVN 2. Enable networking obfuscation for the Insights Operator 3. Wait for data gathering to happen in the operator
Actual results:
operator panics
Expected results:
there's no panic
Additional info:
By having the OpenShift Insights Operator gathering data from OSP CRDs, this data can be ingested into Insights and plug in into the existing tools and processes under the Insights/CCX teams.
This will allow us to create dedicated Superset dashboards or query the data using SQL via the Trino API to for example:
Besides, customers will benefit from any Insights rules that we'll be adding over time to for example anticipate issues or detect misconfigurations, suggest parameter tunings, etcetera.
Examples of how OCP Insights uses this data can be seen in the "Let's Do The Numbers" series of monthly presentations.
This epic is targeted only for the RHOSO (So OSP18 and newer). There are no any changes nor support for that planned in OSP-17.1 or older.
It is implementation of the solution 1 from the document https://docs.google.com/document/d/1r3sC_7ZU7qkxvafpEkAJKMTmtcWAwGOI6W_SZGkvP6s/edit#heading=h.kfjcs2uvui3g
We need to, base on the Yatin Karel's patch make proper integration of our CRs with the insights-operator. It needs to collect data from the 'OpenstackControlPlane', 'OpenstackDataPlaneDeployment' and 'OpenstackDataPlaneNodeSet' CRs with proper anonymization of the data like IP addresses etc. It also needs to set somehow "good" ID to identify Openstack cluster as we cannot rely on the Openshitf clusterID because we may have more than one Openstack cluster on the same OCP cluster.
To identify openstack cluster maybe uuid of the OpenstackControlPlane CR can be used. If no we will need to figure out something else there.
Definition of done:
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision IBM Cloud VPC infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Enhance Dynamic plugin with similar capabilities as Static page. Add new control and security related enhancements to Static page.
The Dynamic plugin should list pipelines similar to the current static page.
The Static page should allow users to override task and sidecar task parameters.
The Static page should allow users to control tasks that are setup for manual approval.
The TSSC security and compliance policies should be visible in Dynamic plugin.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
As a DevOps Engineer, I want to add manual approval points in my pipeline so that the pipeline pauses at that point and waits for a manual approval before continuing execution. Manual approvals are commonly used for approving deploying to production or modeling activities that are not automated (e.g. manual testing) in the pipeline.
Acceptance Criteria
As a developer, I want to remove the feature of introducing the ApprovalTasks into the Developer Console from the Console Repository as it will be shipped as a Dynamic plugin.
As a user, I want a list of all approvals needed for all my pipeline runs. From this page, I can approve or reject if I am an approver for the pipelines.
Description of problem:
Update the Pipeline/PipelineRun List and Details Pages to acknowledge Custom Task
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
Always when Custom Task is used
Steps to Reproduce:
1.Create a Pipeline with a CustomTask like the Approval Task 2.Check the Tasks List in the Pipeline/Run Details Page 3.Check the Progress Bar in the PipelineRun List page
Actual results:
CustomTask is not recognized. Either throwing undefined or showing Task as pending always.
Expected results:
Just like Normal Tasks, CT should be infused thoroughly in the mentioned pages.
Additional info:
As a user, I need to properly distinguish between a classic Task and a CustomTask in the Pipeine Topology, once I have created a Pipeline using the YAML view.
Also, I need to see the details of the CustomTask on hovering over the node.
As a developer, you need to look into the recent changes proposed by the UX and apply those changes.
With the OpenShift Pipelines operator 1.2x we added support for a dynamic console plugin to the operator. In the first version it is only responsible for the Dashboard and Pipeline/Repository Metrics tab. We want move more and more code to the dynamic plugin and remove this later from the console repository.
Description of problem:
Multiple Output tabs is present on PipelineRun details page if dynamic Pipeline console-plugin is enabled.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Add TaskRun tab in PLR details page using plugin
Enable the OCP Console to send back user analytics to our existing endpoints in console.redhat.com. Please refer to doc for details of what we want to capture in the future:
Collect desired telemetry of user actions within OpenShift console to improve knowledge of user behavior.
OpenShift console should be able to send telemetry to a pre-configured Red Hat proxy that can be forwarded to 3rd party services for analysis.
User analytics should respect the existing telemetry mechanism used to disable data being sent back
Need to update existing documentation with what we user data we track from the OCP Console: https://docs.openshift.com/container-platform/4.14/support/remote_health_monitoring/about-remote-health-monitoring.html
Capture and send desired user analytics from OpenShift console to Red Hat proxy
Red Hat proxy to forward telemetry events to appropriate Segment workspace and Amplitude destination
Use existing setting to opt out of sending telemetry: https://docs.openshift.com/container-platform/4.14/support/remote_health_monitoring/opting-out-of-remote-health-reporting.html#opting-out-remote-health-reporting
Also, allow just disabling user analytics without affecting the rest of telemetry: Add annotation to the Console to disbale just user analytics
Update docs to show this method as well.
We will require a mechanism to store all the segment values
We need to be able to pass back orgID that we receive from the OCM subscription API call
Sending telemetry from OpenShift cluster nodes
Console already has support for sending analytics to segment.io in Dev Sandbox and OSD environments. We should reuse this existing capability, but default to http://console.redhat.com/connections/api for analytics and http://console.redhat.com/connections/cdn to load the JavaScript in other environments. We must continue to allow Dev Sandbox and OSD clusters a way to configure their own segment key, whether telemetry is enabled, segment API host, and other options currently set as annotations on the console operator configuration resource.
Console will need a way to determine the org-id to send with telemetry events. Likely the console operator will need to read this from the cluster pull secret.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
API the Console uses:
const apiUrl = `/api/accounts_mgmt/v1/subscriptions?page=1&search=external_cluster_id%3D%27${clusterID}%27`;
Reference: Original Console PR
High Level Feature Details can be found here
Currently, we can get the organization ID from the OCM server by querying subscription and adding the fetchOrganization=true query parameter based on the comment.
We should be passing this ID as SERVER_FLAG.telemetry.ORGANIZATION_ID to the frontend, and as organizationId to Segment.io
Fetching should be done by the console-operator due to its RBAC permissions. Once the Organization_ID is retrieved, console operator should set it on the console-config.yaml, together with other telemetry variables.
AC:
Based on a discussion with Ali Mobrem and Parag Dave yesterday, we want to hide the analytics option from the cluster configuration.
We keep the option for a cluster admin to activate the opt-in or opt-out banner in the UI.
For clarification: This is an undocumented feature that we keep for customers that might request such a feature.
Meeting notes: https://docs.google.com/document/d/11gxr_7kxMqm1zSJC5pPVHZhvLJthZGLH-QvqCqAyFdc/edit
As a user, I want to opt-in or opt-out from the telemetry.
If the cluster prefers opt-in or opt-out should be configable via SERVER_FLAGS.
The console telemetry plugin needs to send data to a new Red Hat ingress point that will then forward it to Segment for analysis.
Goal:
Update console telemetry plugin to send data to the appropriate ingress point.
Ingress point created for console.redhat.com
As an administrator, I want to disable all telemetry on my cluster including UI analytics sent to segment.
We should honor the existing telemetry configuration so that we send no analytics when an admin opts out of telemetry. See the documentation here:
yes this is the official supported way to disable telemetry though we also have a hidden flag in the CMO configmap that CI clusters use to disable telemetry (it depends if you want to push analytics for CI clusters).
CMO configmap is set to
data:
config.yaml: |-
telemeterClient:
enabled: false
the CMO code that reads the cloud.openshift.com token:
https://github.com/openshift/cluster-monitoring-operator/blob/b7e3f50875f2bb1fed912b23fb80a101d3a786c0/pkg/manifests/config.go#L358-L386
Slack discussion https://redhat-internal.slack.com/archives/C0VMT03S5/p1707753976034809
# [1] Check cluster pull secret for cloud.openshift.com creds oc get secret pull-secret -n openshift-config -o json | jq -r '.data.".dockerconfigjson"' | base64 -d | jq -r '.auths."cloud.openshift.com"' # [2] Check cluster monitoring operator config for 'telemeterClient.enabled == false' oc get configmap cluster-monitoring-config -n openshift-monitoring | jq -r '.data."config.yaml"' # [3] Check console operator config telemetry disabled annotation oc get console.operator.openshift.io cluster -o json | jq -r '.metadata.annotations."telemetry.console.openshift.io/DISABLED"'
We want to enable segment analytics by default on all (incl. self-managed) OCP clusters using a known segment API key and the console.redhat.com proxy. We'll still want to allow to honor the segment-related annotations on the console operator config for overriding these values.
Most likely the segment key should be defaulted in the console operator, otherwise we would need a separate console flag for disabling analytics. If the operator provides the key, then the console backend can use the presence of the key to determine when to enable analytics. We can likely change the segment URL and CDN default values directly in the console code, however.
ODC-7517 tracks disabling segment analytics when cluster telemetry is disabled, which is a separate change, but required for this work.
OpenShift UI Telemetry Implementation details
This three keys should have new default values:
Defaults:
stringData: SEGMENT_API_KEY: BnuS1RP39EmLQjP21ko67oDjhbl9zpNU SEGMENT_API_HOST: console.redhat.com/connections/api/v1 SEGMENT_JS_HOST: console.redhat.com/connections/cdn
AC:
The console telemetry plugin needs to send data to a new Red Hat ingress point that will then forward it to Segment for analysis.
For that the telemetry-console-plugin must have options to configure where it loads the analytics.js and where to send the API calls (analytics events).
Volume Group Snapshots is a key new Kubernetes storage feature that allows multiple PVs to be grouped together and snapshotted at the same time. This enables customers to takes consistent snapshots of applications that span across multiple PVs.
This is also a key requirement for backup and DR solutions.
https://kubernetes.io/blog/2023/05/08/kubernetes-1-27-volume-group-snapshot-alpha/
https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/3476-volume-group-snapshot
Productise the volume group snapshots feature as tech preview have docs, testing as well as a feature gate to enable it in order for customers and partners to test it in advance.
The feature should be graduated beta upstream to become TP in OCP. Tests and CI must pass and a feature gate should allow customers and partners to easily enable it. We should identify all OCP shipped CSI drivers that support this feature and configure them accordingly.
CSI drivers development/support of this feature.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Drivers must support this feature and enable it. Partners may need to change their operator and/or doc to support it.
Document how to enable the feature, what this feature does and how to use it. Update the OCP driver's table to include this capability.
Can be leveraged by ODF and OCP virt, especially around backup and DR scenarios.
Epic Goal*
Create an OCP feature gate that allows customers and parterns to VolumeGroupSnapshot feature while the feature is in alpha & beta upstream.
Why is this important? (mandatory)
Volume group snapshot is an important feature for ODF, OCP virt and backup partners. It requires driver support so partners need early access to the feature to confirm their driver works as expected before GA. The same applies to backup partners.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
This depends on the driver's support, the feature gate will enable it in the drivers that support it (OCP shipped drivers).
The feature gate should
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
By enabling the feature gate partners should be able to use the VolumeGroupSnapshot API. Non OCP shipped drivers may need to be configured.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
From apiserver audit logs:
customresourcedefinitions.apiextensions.k8s.io "volumegroupsnapshotclasses.groupsnapshot.storage.k8s.io" not found customresourcedefinitions.apiextensions.k8s.io "volumegroupsnapshotcontents.groupsnapshot.storage.k8s.io" not found customresourcedefinitions.apiextensions.k8s.io "volumegroupsnapshots.groupsnapshot.storage.k8s.io" not found clusterrolebindings.rbac.authorization.k8s.io "csi-snapshot-webhook-clusterrolebinding" not found clusterrolebindings.rbac.authorization.k8s.io "csi-snapshot-controller-clusterrolebinding" not found clusterroles.rbac.authorization.k8s.io "csi-snapshot-controller-clusterrole" not found clusterroles.rbac.authorization.k8s.io "csi-snapshot-webhook-clusterrole" not found customresourcedefinitions.apiextensions.k8s.io "volumegroupsnapshotclasses.groupsnapshot.storage.k8s.io" not found customresourcedefinitions.apiextensions.k8s.io "volumegroupsnapshotcontents.groupsnapshot.storage.k8s.io" not found customresourcedefinitions.apiextensions.k8s.io "volumegroupsnapshots.groupsnapshot.storage.k8s.io" not found clusterrolebindings.rbac.authorization.k8s.io "csi-snapshot-webhook-clusterrolebinding" not found clusterrolebindings.rbac.authorization.k8s.io "csi-snapshot-controller-clusterrolebinding" not found clusterroles.rbac.authorization.k8s.io "csi-snapshot-controller-clusterrole" not found clusterroles.rbac.authorization.k8s.io "csi-snapshot-webhook-clusterrole" not found
This Feature covers the pending tasks from OCPBU-16 to be covered in openshift-4.14.
Goal: Control plane nodes in the cluster can be scaled up or down, lost and recovered, with no more importance or special procedure than that of a data plane node.
Problem: There is a lengthy special procedure to recover from a failed control plane node (or majority of nodes) and to add new control plane nodes.
Why is this important: Increased operational simplicity and scale flexibility of the cluster's control plane deployment.
See slack working group: #wg-ctrl-plane-resize
As a user I'd like to be warned when I'm setting up a Control Plane Machine Set for my control plane machines on GCP and I don't have the necessary TargetPools requirement for it to work correctly.
We previously had a validating webhook on the CPMSO for GCP that would check if the control plane machine provider config set on the CPMS did have TargetPools, otherwise it would error.
This unfortunately collides with GCP Private clusters, see https://issues.redhat.com/browse/OCPBUGS-6760.
As such we decided to remove the check until warnings are supported in controller-runtime.
Once those land we can re-add the check and throw a warning in those situations, to still inform the user without disrupting normal operations where that's not an issue.
See: https://redhat-internal.slack.com/archives/C68TNFWA2/p1675105892589279.
Webhook warnings available for 0.16.0 release of controller-runtime
apiVersion: v1 kind: ConfigMap metadata: name: cluster-autoscaler-priority-expander namespace: kube-system data: priorities: |- 10: - .*t2\.large.* - .*t3\.large.* 50: - .*m4\.4xlarge.*
Documentation will need to be updated to point out the new maximum for ROSA HCP clusters, and any expectations to set with customers.
the following flag needs to be set on the cluster-auto-scaler `--expander=priority` .
The Configuration is based on the values stored in a ConfigMap called `cluster-autoscaler-priority-expander` which will be created by the user/ocm.
Feature description
Oc-mirror v2 is focuses on major enhancements that include making oc-mirror faster and more robust and introduces caching as well as address more complex air-gapped scenarios. OC mirror v2 is a rewritten version with three goals:
Currently in oc-mirror v2 the distribution/distribution being used is v3 which is not a stable version. This sub-task is to check the feasibility of using a stable version.
Support using a proxy in oc-mirror.
We have customers who want to use Docker Registries as proxies / pull-through cache's.
This means that customers would need a way to get the ICSP/IDMS/ITMS and image list which seems relevant to the "generating mapping files" for "V2 tooling". Would like to make sure this is addressed in your use cases
From our IBM sync
"We have customers who want to use Docker Registries as proxies / pull-through cache's. This means that customers would need a way to get the ICSP/IDMS/ITMS and image list which seems relevant to the "generating mapping files" for “V2 tooling”. Would like to make sure this is addressed in your use cases."
Description of problem:
When recovering signatures for releases, the http connection doesn't use the system proxy configuration
Version-Release number of selected component (if applicable):WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info
How reproducible:
Always Image set config kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: platform: graph: true channels: - name: stable-4.16 - name: stable-4.15
Steps to Reproduce:
1. Run oc-mirror with above imagesetconfig in mirror to mirror in an env that requires proxy setup
Actual results:
2024/07/15 14:02:11 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/07/15 14:02:11 [INFO] : 👋 Hello, welcome to oc-mirror 2024/07/15 14:02:11 [INFO] : ⚙️ setting up the environment for you... 2024/07/15 14:02:11 [INFO] : 🔀 workflow mode: mirrorToMirror 2024/07/15 14:02:11 [INFO] : 🕵️ going to discover the necessary images... 2024/07/15 14:02:11 [INFO] : 🔍 collecting release images... I0715 14:02:11.770186 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.16&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6 2024/07/15 14:02:12 [INFO] : detected minimum version as 4.16.1 2024/07/15 14:02:12 [INFO] : detected minimum version as 4.16.1 I0715 14:02:12.321748 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&arch=amd64&channel=stable-4.16&channel=stable-4.16&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6 I0715 14:02:12.485330 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&arch=amd64&arch=amd64&channel=stable-4.16&channel=stable-4.16&channel=stable-4.15&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6 2024/07/15 14:02:12 [INFO] : detected minimum version as 4.15.20 2024/07/15 14:02:12 [INFO] : detected minimum version as 4.15.20 I0715 14:02:12.844366 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&arch=amd64&arch=amd64&arch=amd64&channel=stable-4.16&channel=stable-4.16&channel=stable-4.15&channel=stable-4.15&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6 I0715 14:02:13.115004 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.15&id=00000000-0000-0000-0000-000000000000 I0715 14:02:13.784795 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&arch=amd64&channel=stable-4.15&channel=stable-4.15&id=00000000-0000-0000-0000-000000000000&id=00000000-0000-0000-0000-000000000000&version=4.15.20 I0715 14:02:13.965936 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&arch=amd64&arch=amd64&channel=stable-4.15&channel=stable-4.15&channel=stable-4.16&id=00000000-0000-0000-0000-000000000000&id=00000000-0000-0000-0000-000000000000&id=00000000-0000-0000-0000-000000000000&version=4.15.20 I0715 14:02:14.136625 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&arch=amd64&arch=amd64&arch=amd64&channel=stable-4.15&channel=stable-4.15&channel=stable-4.16&channel=stable-4.16&id=00000000-0000-0000-0000-000000000000&id=00000000-0000-0000-0000-000000000000&id=00000000-0000-0000-0000-000000000000&id=00000000-0000-0000-0000-000000000000&version=4.15.20&version=4.16.1 W0715 14:02:14.301982 2475426 core-cincinnati.go:282] No upgrade path for 4.15.20 in target channel stable-4.16 2024/07/15 14:02:14 [ERROR] : http request Get "https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=c17d4489c1b283ee71c76dda559e66a546e16b208a57eb156ef38fb30098903a/signature-1": dial tcp: lookup mirror.openshift.com: no such host panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x3d02c56] goroutine 1 [running]: github.com/openshift/oc-mirror/v2/internal/pkg/release.SignatureSchema.GenerateReleaseSignatures({{0x54bb930, 0xc000c80738}, {{{0x4c6edb1, 0x15}, {0xc00067ada0, 0x1c}}, {{{...}, {...}, {...}, {...}, ...}, ...}}, ...}, ...) /go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/release/signature.go:96 +0x656 github.com/openshift/oc-mirror/v2/internal/pkg/release.(*CincinnatiSchema).GetReleaseReferenceImages(0xc000fdc000, {0x54aef28, 0x74cf1c0}) /go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/release/cincinnati.go:208 +0x70a github.com/openshift/oc-mirror/v2/internal/pkg/release.(*LocalStorageCollector).ReleaseImageCollector(0xc000184e00, {0x54aef28, 0x74cf1c0}) /go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/release/local_stored_collector.go:62 +0x47f github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).CollectAll(0xc000ace000, {0x54aef28, 0x74cf1c0}) /go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:942 +0x115 github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).RunMirrorToMirror(0xc000ace000, 0xc0007a5800, {0xc000f3f038?, 0x17dcbb3?, 0x2000?}) /go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:748 +0x73 github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).Run(0xc000ace000, 0xc0004f9730?, {0xc0004f9730?, 0x0?, 0x0?}) /go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:443 +0x1b6 github.com/openshift/oc-mirror/v2/internal/pkg/cli.NewMirrorCmd.func1(0xc000ad0e00?, {0xc0004f9730, 0x1, 0x7}) /go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:203 +0x32a github.com/spf13/cobra.(*Command).execute(0xc0007a5800, {0xc000052110, 0x7, 0x7}) /go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:987 +0xaa3 github.com/spf13/cobra.(*Command).ExecuteC(0xc0007a5800) /go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:1115 +0x3ff github.com/spf13/cobra.(*Command).Execute(0x72d7738?) /go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:1039 +0x13 main.main() /go/src/github.com/openshift/oc-mirror/cmd/oc-mirror/main.go:10 +0x18
Expected results:
Additional info:
Continue scale testing and performance improvements for ovn-kubernetes
Networking Definition of Planned
Epic Template descriptions and documentation
Manage Openshift Virtual Machines IP addresses from within the SDN solution provided by OVN-Kubernetes.
Customers want to offload IPAM from their custom solutions (e.g. custom DHCP server running on their cluster network) to SDN.
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
1. …
1. …
Investigate options to identify ovn scale problems, that usually cause high CPU usage for ovn-controller and vswitchd
https://docs.google.com/document/d/15PLDLKB9tGnbGYMhdHjlOsvlvXzT9TV7VfKmTJCwMRk/edit
Add support to GCP N4 Machine Series to be used as Control Plane and Compute Nodes when deploying Openshift on Google Cloud
As a user, I want to deploy OpenShift on Google Cloud using N4 Machine Series for the Control Plane and Compute Node so I can take advantage of these new Machine types
OpenShift can be deployed in Google Cloud using the new N4 Machine Series for the Control Plane and Compute Nodes
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | all |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Google has made N4 Machine Series available on their cloud offering. These Machine Series use "hyperdisk-balanced" disk for the boot device that are not currently supported
The documentation will be updated adding the new disk type that needs to be supported as part of this enablement. Also the N4 Machine Series will be added as tested Machine types for Google Cloud when deploying OpenShift
The hyperdisk-balanced and pd-balanced types must be allowed gcp types.
Generally speaking, customers and partners should not be installing packages client-side, i.e. `rpm-ostree install $pkg` directly on the nodes. It's not officially supported outside of troubleshooting situations, but the documentation is not very explicit on this point and we have anecdotal data that customers and partners do in fact install packages directly on hosts.
Adding some telemetry to help understand how common this is among data-reporting clusters. Hopefully such data will help us understand how important it is to preserve this ability in the bootc-world. While it's not a pattern we want to encourage, we should be careful about dropping it without considering how to avoid breaking users' clusters in unexpected ways.
Understand what % of machines (or a proxy thereof) have locally layered packages which aren't CoreOS extensions.
This needs to be backported to 4.14 so we have a better sense of the fleet as it is.
4.12 might be useful as well, but is optional.
Why not simply block upgrades if there are locally layered packages?
That is indeed an option. This card is only about gathering data.
Some customers are known to layer packages locally but it's worse if the issue is a third party integration. In such a case, if the add-on breaks, the customer will call the 3rd party first because that's what appears to be broken. It may be a long, undelightful trip to get to a satisfying resolution. If they are blocked on upgrade due to that 3rd party integration they may not be able to upgrade the OCP y-version. That could be a lengthy delay.
Description copied from attached feature card: https://issues.redhat.com/browse/OCPSTRAT-1521
Generally speaking, customers and partners should not be installing packages client-side, i.e. `rpm-ostree install $pkg` directly on the nodes. It's not officially supported outside of troubleshooting situations, but the documentation is not very explicit on this point and we have anecdotal data that customers and partners do in fact install packages directly on hosts.
Adding some telemetry to help understand how common this is among data-reporting clusters. Hopefully such data will help us understand how important it is to preserve this ability in the bootc-world. While it's not a pattern we want to encourage, we should be careful about dropping it without considering how to avoid breaking users' clusters in unexpected ways.
Understand what % of machines (or a proxy thereof) have locally layered packages which aren't CoreOS extensions.
This needs to be backported to 4.14 so we have a better sense of the fleet as it is.
4.12 might be useful as well, but is optional.
Why not simply block upgrades if there are locally layered packages?
That is indeed an option. This card is only about gathering data.
Some customers are known to layer packages locally but it's worse if the issue is a third party integration. In such a case, if the add-on breaks, the customer will call the 3rd party first because that's what appears to be broken. It may be a long, undelightful trip to get to a satisfying resolution. If they are blocked on upgrade due to that 3rd party integration they may not be able to upgrade the OCP y-version. That could be a lengthy delay.
We aim to continue establishing a comprehensive testing strategy for Hosted Control Planes (HCP) that aligns with Red Hat’s support requirements and ensures customer satisfaction. This involves testing across various permutations, including providers, lifecycle, upgrades, and version compatibility. The testing must span management clusters, hubs, MCE, control planes, and nodepools, while coordinating across multiple QE teams to avoid duplication and inefficiencies. We aim to sustain an evolving testing matrix to meet product demands, especially as new versions and extended OCP lifecycles are introduced.
See: https://docs.google.com/spreadsheets/d/1j8TjMfyCfEt8OzTgvrAG3tuC6WMweBh5ElzWu6oAvUw/edit?gid=0#gid=0
The HCP architecture introduces decoupled control planes and worker nodes, significantly increasing the number of testing permutations. Ensuring these scenarios are tested is crucial to maintaining product quality, customer satisfaction, and stay compliant as an OpenShift form-factor.
This was attempted once before
https://github.com/openshift/release/pull/47599
Then reverted
https://github.com/openshift/release/pull/48326
ROSA HCP prod runs with HO from main but 4.14 and 4.15 HCs (currently), however, we do not test these together in presubmit testing, increases the chance of an escape.
Review, refine and harden the CAPI-based Installer implementation introduced in 4.16
From the implementation of the CAPI-based Installer started with OpenShift 4.16 there is some technical debt that needs to be reviewed and addressed to refine and harden this new installation architecture.
Review existing implementation, refine as required and harden as possible to remove all the existing technical debt
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
There should not be any user-facing documentation required for this work
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
The OpenShift IPsec implementation will be enhanced for a growing set of enterprise use cases, and for larger scale deployments.
The OpenShift IPsec implementation was originally built for purpose-driven use cases from telco NEPs, but was also be useful for a specific set of other customer use cases outside of that context. As customer adoption grew and it was adopted by some of the largest (by number of cluster nodes) deployments in the field, it became obvious that some redesign is necessary in order to continue to deliver enterprise-grade IPsec, for both East-West and North-South traffic, and for some of our most-demanding customer deployments.
Key enhancements include observability and blocked traffic across paths if IPsec encryption is not functioning properly.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
The OpenShift IPsec feature is fundamental to customer deployments for ensuring that all traffic between cluster nodes (East-West) and between cluster nodes and external-to-the-cluster entities that also are configured for IPsec (North-South) is encrypted by default. This encryption must scale to the largest of deployments.
Questions to be addressed:
In the current version, router does not support to load secrets directly and uses route resource to load private key and certificates exposing the security artifacts.
Acceptance criteria :
Description of problem:
should reduce error message details for Not Found secret when edit/patch route with spec.tls.externalCertificate
Version-Release number of selected component (if applicable):
4.16.0-0.ci.test-2024-05-13-005506-ci-ln-05s0z32-latest
How reproducible:
100%
Steps to Reproduce:
1. enable TP feature "RouteExternalCertificate" 2. create pod,svc and route 3. oc -n hongli patch route myedge --type=merge --patch='{"spec":{"tls":{"externalCertificate":{"name": "newtls"}}}}'
Actual results:
the error message: The Route "myedge" is invalid: spec.tls.externalCertificate: Not found: errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:"", RemainingItemCount:(*int64)(nil)}, Status:"Failure", Message:"secrets \"newtls\" not found", Reason:"NotFound", Details:(*v1.StatusDetails)(0xc0077e25a0), Code:404}}
Expected results:
something like: `spec.tls.externalCertificate: Not found: "secrets \"newtls\" not found"`
Additional info:
discuss in thread of https://redhat-internal.slack.com/archives/C06EK9ZH3Q8/p1715243443244879
Update cluster-ingress-operator to bootstrap router with required featuregates
The cluster-ingress-operator will propagate the relevant Tech-Preview feature gate down to the router. This feature gate will be added as a command-line argument called ROUTER_EXTERNAL_CERTIFICATE to the router and will not be user configurable.
Refer:
Acceptance criteria
Bump Kubernetes in openshift-apiserver to 1.29.2 to unblock CFE-885.
Background:
We need to bump openshift/library-go with latest commit into openshift/openshift-apiserver in order to vendor Route validation changes done in https://github.com/openshift/library-go/pull/1625, but due the kube version mismatch between library-go and openshift-apiserver , there are some dependency issues.
library-go is at 1.29 , but openshift-apiserver is still using 1.28.
References:
As part of this EP, there is a use case where there is a need to trigger re-sync of routes based on secret changes observed. The caveat here is that, we are not using secret informers, rather a new interface aka secret monitor (reasons are in the EP but don't pertain to this query). Since the router uses RouterController and not specific controllers for each resource (routes, namespaces, endpoints, etc), it doesn't have access to lower level components of a controller (eg: the workqueue) and without this I don't really see a way to integrate router with the secret monitor. Is re-designing the routercontroller the way forward here? I'm open to suggestions on other way to integrate here.
Router will take feature-gate info from CFE-987
Router will integrate secret-monitor done in CFE-866
Validations required on router
- The secret created should be in the same namespace as that of the route.
- The secret created is of type `kubernetes.io/tls`.
- Verify certificate and key (PEM encode/decode)
- Verify private key matches public certificate
This enhances EgressQoS CRD with status information and provide implementation to update this field with relevant information while creating/updating EgressQoS.
This epic will encompass work that wasn't required for the MVP of tlsSecurityProfile for the MCO.
The ability in OpenShift to create trust and directly consume access tokens issued by external OIDC Authentication Providers using an authentication approach similar to upstream Kubernetes.
BYO Identity will help facilitate CLI only workflows and capabilities of the Authentication Provider (such as Keycloak, Dex, Azure AD) similar to upstream Kubernetes.
Ability in OpenShift to provide a direct, pluggable Authentication workflow such that the OpenShift/K8s API server can consume access tokens issued by external OIDC identity providers. Kubernetes provides this integration as described here. Customer/Users can then configure their IDPs to support the OIDC protocols and workflows they desire such as Client credential flow.
OpenShift OAuth server is still available as default option, with the ability to tune in the external OIDC provider as a Day-2 configuration.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Now that the code is delivered in the console, we need to focus on adding unit tests to the console code base.
Code changes:
AC:
Feature Overview
This is a TechDebt and doesn't impact OpenShift Users.
As the autoscaler has become a key feature of OpenShift, there is the requirement to continue to expand it's use bringing all the features to all the cloud platforms and contributing to the community upstream. This feature is to track the initiatives associated with the Autoscaler in OpenShift.
Goals
Requirements
Requirement | Notes | isMvp? |
---|---|---|
vSphere autoscaling from zero | No | |
Upstream E2E testing | No | |
Upstream adapt scale from zero replicas | No | |
Out of Scope
n/a
Background, and strategic fit
Autoscaling is a key benefit of the Machine API and should be made available on all providers
Assumptions
Customer Considerations
Documentation Considerations
please note, the changes described by this epic will happen in OpenShift controllers and as such there is no "upstream" relationship in the same sense as the Kubernetes-based controllers.
As a developer, in order to deprecate the old annotations, we will need to carry both for at least one release cycle. Updating the CAO to apply the upstream annotations, and the CAS to accept both (preferring upstream), will allow me to properly deprecate the old annotations.
Background
as part of the effort to migrate to the upstream scale from zero annotations, we should add e2e tests which confirm the presence of the annotations. this can be an addition to our current scale from zero tests.
As a developer, in order to deprecate the old annotations, we will need to carry both for at least one release cycle. Updating the CAO to apply the upstream annotations, and the CAS to accept both (preferring upstream), will allow me to properly deprecate the old annotations.
During the process of making the CAO recognize the annotations, we need to enable it to modify the machineset to have the new annotation. Similarly, we want the autoscaler to recognize both sets of annotations in the short term while we switch.
Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission.
With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.
With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn".
Epic Goal
Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.
When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.
To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).
Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).
The following tables track progress.
# namespaces | 4.19 | 4.18 | 4.17 | 4.16 | 4.15 | 4.14 |
---|---|---|---|---|---|---|
monitored | 82 | 82 | 82 | 82 | 82 | 82 |
fix needed | 68 | 68 | 68 | 68 | 68 | 68 |
fixed | 39 | 39 | 35 | 32 | 39 | 1 |
remaining | 29 | 29 | 33 | 36 | 29 | 67 |
~ remaining non-runlevel | 8 | 8 | 12 | 15 | 8 | 46 |
~ remaining runlevel (low-prio) | 21 | 21 | 21 | 21 | 21 | 21 |
~ untested | 5 | 2 | 2 | 2 | 82 | 82 |
# | namespace | 4.19 | 4.18 | 4.17 | 4.16 | 4.15 | 4.14 |
---|---|---|---|---|---|---|---|
1 | oc debug node pods | #1763 | #1816 | #1818 | |||
2 | openshift-apiserver-operator | #573 | #581 | ||||
3 | openshift-authentication | #656 | #675 | ||||
4 | openshift-authentication-operator | #656 | #675 | ||||
5 | openshift-catalogd | #50 | #58 | ||||
6 | openshift-cloud-credential-operator | #681 | #736 | ||||
7 | openshift-cloud-network-config-controller | #2282 | #2490 | #2496 | |||
8 | openshift-cluster-csi-drivers | #118 #5310 #135 | #524 #131 #306 #265 #75 | #170 #459 | #484 | ||
9 | openshift-cluster-node-tuning-operator | #968 | #1117 | ||||
10 | openshift-cluster-olm-operator | #54 | n/a | n/a | |||
11 | openshift-cluster-samples-operator | #535 | #548 | ||||
12 | openshift-cluster-storage-operator | #516 | #459 #196 | #484 #211 | |||
13 | openshift-cluster-version | #1038 | #1068 | ||||
14 | openshift-config-operator | #410 | #420 | ||||
15 | openshift-console | #871 | #908 | #924 | |||
16 | openshift-console-operator | #871 | #908 | #924 | |||
17 | openshift-controller-manager | #336 | #361 | ||||
18 | openshift-controller-manager-operator | #336 | #361 | ||||
19 | openshift-e2e-loki | #56579 | #56579 | #56579 | #56579 | ||
20 | openshift-image-registry | #1008 | #1067 | ||||
21 | openshift-ingress | #1032 | |||||
22 | openshift-ingress-canary | #1031 | |||||
23 | openshift-ingress-operator | #1031 | |||||
24 | openshift-insights | #1033 | #1041 | #1049 | #915 | #967 | |
25 | openshift-kni-infra | #4504 | #4542 | #4539 | #4540 | ||
26 | openshift-kube-storage-version-migrator | #107 | #112 | ||||
27 | openshift-kube-storage-version-migrator-operator | #107 | #112 | ||||
28 | openshift-machine-api | #1308 #1317 | #1311 | #407 | #315 #282 #1220 #73 #50 #433 | #332 #326 #1288 #81 #57 #443 | |
29 | openshift-machine-config-operator | #4636 | #4219 | #4384 | #4393 | ||
30 | openshift-manila-csi-driver | #234 | #235 | #236 | |||
31 | openshift-marketplace | #578 | #561 | #570 | |||
32 | openshift-metallb-system | #238 | #240 | #241 | |||
33 | openshift-monitoring | #2298 #366 | #2498 | #2335 | #2420 | ||
34 | openshift-network-console | #2545 | |||||
35 | openshift-network-diagnostics | #2282 | #2490 | #2496 | |||
36 | openshift-network-node-identity | #2282 | #2490 | #2496 | |||
37 | openshift-nutanix-infra | #4504 | #4539 | #4540 | |||
38 | openshift-oauth-apiserver | #656 | #675 | ||||
39 | openshift-openstack-infra | #4504 | #4539 | #4540 | |||
40 | openshift-operator-controller | #100 | #120 | ||||
41 | openshift-operator-lifecycle-manager | #703 | #828 | ||||
42 | openshift-route-controller-manager | #336 | #361 | ||||
43 | openshift-service-ca | #235 | #243 | ||||
44 | openshift-service-ca-operator | #235 | #243 | ||||
45 | openshift-sriov-network-operator | #995 | #999 | #1003 | |||
46 | openshift-user-workload-monitoring | #2335 | #2420 | ||||
47 | openshift-vsphere-infra | #4504 | #4542 | #4539 | #4540 | ||
48 | (runlevel) kube-system | ||||||
49 | (runlevel) openshift-cloud-controller-manager | ||||||
50 | (runlevel) openshift-cloud-controller-manager-operator | ||||||
51 | (runlevel) openshift-cluster-api | ||||||
52 | (runlevel) openshift-cluster-machine-approver | ||||||
53 | (runlevel) openshift-dns | ||||||
54 | (runlevel) openshift-dns-operator | ||||||
55 | (runlevel) openshift-etcd | ||||||
56 | (runlevel) openshift-etcd-operator | ||||||
57 | (runlevel) openshift-kube-apiserver | ||||||
58 | (runlevel) openshift-kube-apiserver-operator | ||||||
59 | (runlevel) openshift-kube-controller-manager | ||||||
60 | (runlevel) openshift-kube-controller-manager-operator | ||||||
61 | (runlevel) openshift-kube-proxy | ||||||
62 | (runlevel) openshift-kube-scheduler | ||||||
63 | (runlevel) openshift-kube-scheduler-operator | ||||||
64 | (runlevel) openshift-multus | ||||||
65 | (runlevel) openshift-network-operator | ||||||
66 | (runlevel) openshift-ovn-kubernetes | ||||||
67 | (runlevel) openshift-sdn | ||||||
68 | (runlevel) openshift-storage |
Workloads running in platform namespaces (openshift-, kube-, default) must have the required-scc annotation defined in order to pin a specific SCC (see AUTH-482) for more details. This task adds a monitor test that analyzes all such workloads and tests the existence of the "openshift.io/required-scc" annotation.
Openshift prefixed namespaces should all define their required PSa labels. Currently, the list of namespaces that are missing some or all PSa labels are the following:
namespace | in review | merged |
---|---|---|
openshift | ||
openshift-apiserver-operator | PR | |
openshift-cloud-credential-operator | PR | |
openshift-cloud-network-config-controller | PR | |
openshift-cluster-samples-operator | PR | |
openshift-cluster-storage-operator | PR | |
openshift-config |
PR | |
openshift-config-managed | PR | |
openshift-config-operator | PR | |
openshift-console | PR | |
openshift-console-operator | PR | |
openshift-console-user-settings | PR | |
openshift-controller-manager | PR | |
openshift-controller-manager-operator | PR | |
openshift-dns-operator | PR | |
openshift-etcd-operator | PR | |
openshift-host-network | PR | |
openshift-ingress-canary | PR | |
openshift-ingress-operator | PR | |
openshift-insights | PR | |
openshift-kube-apiserver-operator | PR | |
openshift-kube-controller-manager-operator | PR | |
openshift-kube-scheduler-operator | PR | |
openshift-kube-storage-version-migrator | PR | |
openshift-kube-storage-version-migrator-operator | PR | |
openshift-network-diagnostics | PR | |
openshift-node | ||
openshift-operator-lifecycle-manager | PR | |
openshift-operators | PR | |
openshift-route-controller-manager | PR | |
openshift-service-ca | PR | |
openshift-service-ca-operator | PR | |
openshift-user-workload-monitoring | PR |
Currently, the existing procedure for full rotation of all cluster CAs/certs/keys is not suitable for Hypershift. Several oc helper commands added for this flow are not functional in Hypershift. Therefore, a separate and tailored procedure is required specifically for Hypershift post its General Availability (GA) stage.
Most of the rotation procedure can be performed on the management side, given the decoupling between the control-plane and workers in the HyperShift architecture.
That said, it is important to ensure and assess the potential impacts on customers and guests during the rotation process, especially on how they affect SLOs and disruption budgets.
As a hypershift QE, I want to be able to:
so that I can achieve
This does not require a design proposal.
This does not require a feature gate.
As an engineer I would like to customize the self-signed certificates expiration used in the HCP components using an annotation over the HostedCluster object.
As an engineer I would like to customize the self-signed certificates rotation used in the HCP components using an annotation over the HostedCluster object.
TBD
Known affected components:
[2] https://github.com/openshift/cluster-kube-controller-manager-operator/blob/351c1193c7eebb49054a289a17fc25dfc0e0cd73/bindata/bootkube/manifests/secret-csr-signer-signer.yaml#L10
[3] https://github.com/openshift/cluster-etcd-operator/pull/1234
Note: This feature will be a TechPreview in 4.16 since the newly introduced API must graduate to v1.
Overarching Goal
Customers should be able to update and boot a cluster without a container registry in disconnected environments. This feature is for Baremetal disconnected cluster.
Background
Describes the work needed from the MCO team to take Pinned Image Sets to GA.
Description of problem:
MCO taking too much time to update the node count for MCP when removing labels from node which MCP uses to match with nodes
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Remove `node-role.kubernetes.io/worker=` label from any worker node. ~~~ # oc label node worker-0.sharedocp4upi411ovn.lab.upshift.rdu2.redhat.com node-role.kubernetes.io/worker- ~~~ 2. Check MCP worker for correct node count. ~~~ # oc get mcp worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-6916abae250ad092875791f8297c13e1 True False False 3 3 3 0 5d7h ~~~ 3. Check after 10-15 mins ~~~ # oc get mcp worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-6916abae250ad092875791f8297c13e1 True False False 2 2 2 0 5d7h ~~~
Actual results:
It took 10-15 mins for MCP to detect node removal.
Expected results:
It should detect node removal as soon as the appropriate label from the node gets missing.
Additional info:
Similar to bug 1955300, but seen in a recent 4.11-to-4.11 update [1]:
: [bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available
Run #0: Failed expand_less 47m16s
1 unexpected clusteroperator state transitions during e2e test run
Feb 05 22:15:40.430 - 1044s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [
{operator 4.11.0-0.nightly-2022-02-05-152519}]]
Feb 05 17:21:15.357 - 1087s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [
]
Feb 05 09:31:14.667 - 1632s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [
]
Feb 05 12:29:22.119 - 1060s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.8.0-0.okd-2022-02-05-101655
Feb 05 17:43:45.938 - 1380s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.6.54
Feb 06 02:35:34.300 - 1085s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [
]
Feb 06 06:15:23.991 - 1135s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [
]
Feb 05 09:25:22.083 - 1071s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [
]
Breaking down by job name:
$ w3m -dump -cols 200 'https://search.ci.openshift.org?maxAge=24h&type=junit&context=0&search=s+E+clusteroperator/machine-config+condition/Available+status/False' | grep 'failures match' | sort
periodic-ci-openshift-multiarch-master-nightly-4.8-upgrade-from-nightly-4.7-ocp-remote-libvirt-s390x (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-aws-ovn-upgrade (all) - 70 runs, 47% failed, 6% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade (all) - 40 runs, 60% failed, 4% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade (all) - 76 runs, 42% failed, 9% of failures match = 4% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade (all) - 77 runs, 65% failed, 4% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade (all) - 41 runs, 61% failed, 12% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-ovirt-upgrade (all) - 4 runs, 75% failed, 33% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade (all) - 80 runs, 59% failed, 4% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade (all) - 82 runs, 51% failed, 7% of failures match = 4% impact
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade (all) - 88 runs, 55% failed, 8% of failures match = 5% impact
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade (all) - 79 runs, 54% failed, 2% of failures match = 1% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade (all) - 45 runs, 44% failed, 25% of failures match = 11% impact
periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade (all) - 33 runs, 45% failed, 13% of failures match = 6% impact
periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-upgrade (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-release-master-okd-4.10-e2e-vsphere (all) - 6 runs, 100% failed, 17% of failures match = 17% impact
pull-ci-openshift-cluster-authentication-operator-master-e2e-agnostic-upgrade (all) - 8 runs, 100% failed, 13% of failures match = 13% impact
pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade (all) - 6 runs, 83% failed, 20% of failures match = 17% impact
pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade (all) - 31 runs, 100% failed, 3% of failures match = 3% impact
release-openshift-okd-installer-e2e-aws-upgrade (all) - 8 runs, 75% failed, 17% of failures match = 13% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.6-to-4.7-to-4.8-to-4.9-ci (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
Those impact percentages are just matches; this particular test-case is non-fatal.
The Available=False conditions also lack a 'reason', although they do contain a 'message', which is the same state we had back when I'd filed bug 1948088. Maybe we can pass through the Degraded reason around [4]?
Going back to the run in [1], the Degraded condition had a few minutes at RenderConfigFailed, while [4] only has a carve out for RequiredPools. And then the Degraded condition went back to False, but for reasons I don't understand we remained Available=False until 22:33, when the MCO declared its portion of the update complete:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1490071797725401088/artifacts/e2e-aws-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'clusteroperator/machine-config '
Feb 05 22:15:40.029 E clusteroperator/machine-config condition/Degraded status/True reason/RenderConfigFailed changed: Failed to resync 4.11.0-0.nightly-2022-02-05-152519 because: refusing to read images.json version "4.11.0-0.nightly-2022-02-05-211325", operator version "4.11.0-0.nightly-2022-02-05-152519"
Feb 05 22:15:40.029 - 147s E clusteroperator/machine-config condition/Degraded status/True reason/Failed to resync 4.11.0-0.nightly-2022-02-05-152519 because: refusing to read images.json version "4.11.0-0.nightly-2022-02-05-211325", operator version "4.11.0-0.nightly-2022-02-05-152519"
Feb 05 22:15:40.430 E clusteroperator/machine-config condition/Available status/False changed: Cluster not available for [
]
Feb 05 22:18:07.150 W clusteroperator/machine-config condition/Progressing status/True changed: Working towards 4.11.0-0.nightly-2022-02-05-211325
Feb 05 22:18:07.150 - 898s W clusteroperator/machine-config condition/Progressing status/True reason/Working towards 4.11.0-0.nightly-2022-02-05-211325
Feb 05 22:18:07.178 W clusteroperator/machine-config condition/Degraded status/False changed:
Feb 05 22:18:21.505 W clusteroperator/machine-config condition/Upgradeable status/False reason/PoolUpdating changed: One or more machine config pools are updating, please see `oc get mcp` for further details
Feb 05 22:33:04.574 W clusteroperator/machine-config condition/Available status/True changed: Cluster has deployed [
]
Feb 05 22:33:04.584 W clusteroperator/machine-config condition/Upgradeable status/True changed:
Feb 05 22:33:04.931 I clusteroperator/machine-config versions: operator 4.11.0-0.nightly-2022-02-05-152519 -> 4.11.0-0.nightly-2022-02-05-211325
Feb 05 22:33:05.531 W clusteroperator/machine-config condition/Progressing status/False changed: Cluster version is 4.11.0-0.nightly-2022-02-05-211325
[bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available
[bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Degraded
[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1490071797725401088
[2]: https://github.com/openshift/cluster-version-operator/blob/06ec265e3a3bf47b599e56aec038022edbe8b5bb/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L79-L87
[3]: https://github.com/openshift/cluster-version-operator/pull/643
[4]: https://github.com/openshift/machine-config-operator/blob/2add8f323f396a2063257fc283f8eed9038ea0cd/pkg/operator/status.go#L122-L126
add OwnerReferences to MCN ObjectMeta so that it gets garbage collected.
Given enhancement - https://github.com/openshift/enhancements/pull/1481
Design Review Doc: https://docs.google.com/document/d/1-XuHN6_jvJMLULFwwAThfIcHqY32s32lU6m4bx7BiBE/edit
We want to allow the relevant APIs to pin images and make sure those don't get garbage collected.
Here is a summary of what will be required:
It is important that when a new CRI-O pinned-image configuration is applied via machine config that the net result is a reload of the crio systemd unit vs node reboot.
As an openshift admin ,who wants to make my OCP more secure and stable . I want to prevent anyone to schedule their workload in master node so that master node only run OCP management related workload .
secure OCP master node by preventing scheduling of customer workload in master node
Anyone applying toleration(s) in a pod spec can unintentionally tolerate master taints which protect master nodes from receiving application workload when master nodes are configured to repel application workload. An admission plugin needs to be configured to protect master nodes from this scenario. Besides the taint/toleration, users can also set spec.NodeName directly, which this plugin should also consider protecting master nodes against.
Needed so we can provide this workflow to customers following the proposal at https://github.com/openshift/enhancements/pull/1583
Reference https://issues.redhat.com/browse/WRKLDS-1015
kube-scheduler pods are created by code residing in controllers provided by the kubescheduler operator. So changes are required in that repo to add a toleration to the node-role.kubernetes.io/control-plane:NoExecute taint.
The operator itself does not run in the control-plane nodes, but if that change is necessary it would be here: https://github.com/openshift/cluster-kube-scheduler-operator/blob/4be4e433eec566df60d6d89f09a13b706e93f2a3/manifests/0000_25_kube-scheduler-operator_06_deployment.yaml#L12
Needed so we can provide this workflow to customers following the proposal at https://github.com/openshift/enhancements/pull/1583
Reference https://issues.redhat.com/browse/WRKLDS-1015
kube-controller-manager pods are created by code residing in controllers provided by the kube-controler-manager operator. So changes are required in that repo to add a toleration to the node-role.kubernetes.io/control-plane:NoExecute taint.
Migrate every occurrence of iptables in OpenShift to use nftables, instead.
Implement a full migration from iptables to nftables within a series of "normal" upgrades of OpenShift with the goal of not causing any more network disruption than would normally be required for an OpenShift upgrade. (Different components may migrate from iptables to nftables in different releases; no coordination is needed between unrelated components.)
Template:
Networking Definition of Planned
Epic Template descriptions and documentation
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
This feature focuses on the optimization of resource allocation and image management within NodePools. This will include enabling users to specify resource groups at NodePool creation, integrating external DNS support, ensuring Cluster API (CAPI) and other images are sourced from the payload, and utilizing Image Galleries for Azure VM creation.
Add additional details to the NodePool API for Azure. Replace SubnetName with SubnetID in the NodePool API.
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As ARO/HCP provider, I want to be able to:
so that I can achieve
Description of criteria:
This requires a design proposal so OCM knows where to specify the resource group
This requires might require a feature gate in case we don't want it for self-managed.
As ARO/HCP provider, I want to be able to:
so that I can achieve
Description of criteria:
This requires a design proposal so OCM knows where to specify the resource group
This requires might require a feature gate in case we don't want it for self-managed.
As ARO/HCP provider, I want to be able to:
so that I can achieve
Description of criteria:
This requires a design proposal so OCM knows where to specify the resource group
This requires might require a feature gate in case we don't want it for self-managed.
The SubnetID and NsgID flags are ignored if a VNET ID is specified in HyperShift CLI. Those flags should not be set if a value was passed into them in the CLI.
As a HyperShift, I want to use the service principal as the identity type for CAPZ, so that the warning message about using manual service principal in the capi-provider pod goes away.
Description of criteria:
This does not require a design proposal.
This does not require a feature gate.
As a user of HyperShift, I want to be able to create Azure VMs with ephemeral disks, so that I can achieve higher IOPS.
Note: AKS also defaults to using them.
Description of criteria:
N/A
This does not require a design proposal.
This does not require a feature gate.
The ability to create an Azure HCP since the k8s v1.29 bump is broken.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a user of ARO HCP, I want to be able to:
so that I can achieve
Description of criteria:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
The current HCP implementation lets the Azure CCM pod create a load balancer (LB) and public IP address for guest cluster egress. The outbound SNAT is using default port allocation.
ARO HCP needs more control over how the LB is created and setup. Ideally, it would be nice to have CAPZ create and manage the LB. ARO HCP would also like the ability to utilize a LB with user-defined routing (UDR).
Utilizing a LB for guest cluster egress is the better option cost wise and availability wise compared to NAT Gateway. NAT Gateways are more expensive and also zonal.
Investigate the possibility of CAPZ creating and managing a LB for guest cluster egress.
As a hosted cluster deployer, I want to be able to:
so that I can achieve
Detail about what is specifically not being delivered in the story
This does not require a design proposal.
This does not require a feature gate.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a cluster admin trying to install the Shared Resource CSI driver, I want all images to come from the Red Hat container catalog so that I run fully supported and secured installations of the CSI driver.
Akin to BUILD-109.
We have decided to onboard Builds for OpenShift to Konflux for v1.1 (instead of CPaaS)
Legend
DPDK applications require dedicated CPUs, and isolated any preemption (other processes, kernel threads, interrupts), and this can be achieved with the “static” policy of the CPU manager: the container resources need to include an integer number of CPUs of equal value in “limits” and “request”. For instance, to get six exclusive CPUs:
spec:
containers:
- name: CNF
image: myCNF
resources:
limits:
cpu: "6"
requests:
cpu: "6"
The six CPUs are dedicated to that container, however non trivial, meaning real DPDK applications do not use all of those CPUs as there is always at least one of the CPU running a slow-path, processing configuration, printing logs (among DPDK coding rules: no syscall in PMD threads, or you are in trouble). Even the DPDK PMD drivers and core libraries include pthreads which are intended to sleep, they are infrastructure pthreads processing link change interrupts for instance.
Can we envision going with two processes, one with isolated cores, one with the slow-path ones, so we can have two containers? Unfortunately no: going in a multi-process design, where only dedicated pthreads would run on a process is not an option as DPDK multi-process is going deprecated upstream and has never picked up as it never properly worked. Fixing it and changing DPDK architecture to systematically have two processes is absolutely not possible within a year, and would require all DPDK applications to be re-written. Knowing that the first and current multi-process implementation is a failure, nothing guarantees that a second one would be successful.
The slow-path CPUs are only consuming a fraction of a real CPU and can safely be run on the “shared” CPU pool of the CPU Manager, however containers specifications do not accept to request two kinds of CPUs, for instance:
spec:
containers:
- name: CNF
image: myCNF
resources:
limits:
cpu_dedicated: "4"
cpu_shared: "20m"
requests:
cpu_dedicated: "4"
cpu_shared: "20m"
Why do we care about allocating one extra CPU per container?
Let’s take a realistic example, based on a real RAN CNF: running 6 containers with dedicated CPUs on a worker node, with a slow Path requiring 0.1 CPUs means that we waste 5 CPUs, meaning 3 physical cores. With real life numbers:
Intel public CPU price per core is around 150 US$, not even taking into account the ecological aspect of the waste of (rare) materials and the electricity and cooling…
Requirement | Notes | isMvp? |
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This issue has been addressed lately by OpenStack.
Test_Description : Create a performance profile by enabling shared cpus. Create a pod with 1 gu pod , where it has a shared cpu device enabled. Verify the pod have the shared cpus exported, then disable the feature functionality by modifying the PP, then verify that the cpuset of the pod doesn’t include the shared cpus.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
N/A
TBD
But this should probably goes under NTO as a separate lane.
Telecommunications providers continue to deploy OpenShift at the Far Edge. The acceleration of this adoption and the nature of existing Telecommunication infrastructure and processes drive the need to improve OpenShift provisioning speed at the Far Edge site and the simplicity of preparation and deployment of Far Edge clusters, at scale.
A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
requirement | Notes | isMvp? |
Telecommunications Service Provider Technicians will be rolling out OCP w/ a vDU configuration to new Far Edge sites, at scale. They will be working from a service depot where they will pre-install/pre-image a set of Far Edge servers to be deployed at a later date. When ready for deployment, a technician will take one of these generic-OCP servers to a Far Edge site, enter the site specific information, wait for confirmation that the vDU is in-service/online, and then move on to deploy another server to a different Far Edge site.
Retail employees in brick-and-mortar stores will install SNO servers and it needs to be as simple as possible. The servers will likely be shipped to the retail store, cabled and powered by a retail employee and the site-specific information needs to be provided to the system in the simplest way possible, ideally without any action from the retail employee.
Q: how challenging will it be to support multi-node clusters with this feature?
< What does the person writing code, testing, documenting need to know? >
< Are there assumptions being made regarding prerequisites and dependencies?>
< Are there assumptions about hardware, software or people resources?>
< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>
< Are there Upgrade considerations that customers need to account for or that the feature should address on behalf of the customer?>
<Does the Feature introduce data that could be gathered and used for Insights purposes?>
< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >
< What does success look like?>
< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact>
< If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>
< Which other products and versions in our portfolio does this feature impact?>
< What interoperability test scenarios should be factored by the layered product(s)?>
Question | Outcome |
As an operation person, I would like to have the tooling to flush generic seed image to disk so it will be bootable, and run precaching as well. At scale.
Initial idea is to base the flow on anaconda: https://issues.redhat.com/browse/RHEL-2250)
https://redhat-internal.slack.com/archives/C05JHD9QYTC/p1702496826548079
When performing IBI, we're not previously wiping the installation disk
As our experience with assisted-installer shows, this can create problems
We should see how to reuse the disk wiping that we already do in assisted: https://github.com/openshift/assisted-installer/blob/f4b8cfd85dfe8194aac489bacfb93ef8501fd290/src/installer/installer.go#L780
Feature Overview
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled
The goal of this epic is to guarantee that all pods running within the ACM (Advanced Cluster Management) cluster adhere to Kubernetes Security Context Constraints (SCC). The implementation of a comprehensive SCC compliance checking system will proactively maintain a secure and compliant environment, mitigating security risks.
Ensuring SCC compliance is critical for the security and stability of a Kubernetes cluster.
A customer who is responsible for overseeing the operations of their cluster, faces the challenge of maintaining a secure and compliant Kubernetes environment. The organization relies on the ACM cluster to run a variety of critical workloads across multiple namespaces. Security and compliance are top priorities, especially considering the sensitive nature of the data and applications hosted in the cluster.
As an ACM admin, I want to add Kubernetes Security Context Constraints (SCC) V2 options to the component's resource YAML configuration to ensure that the Pod runs with the 'readonlyrootfilesystem' and 'privileged' settings, in order to enhance the security and functionality of our application.
In the resource config YAML, we need to add the follow context:
securityContext: privileged: false readOnlyRootFilesystem: true
Affected resources:
Description
This epic tracks install server errors which require investigation into whether the RP or ARO-Installer can be more resilient or return a User Error instead of a Server Error.
This lives under shift improvement as it will reduce the number of incidents we get from customers due to `Internal Server Error` being returned.
How to Create Stories Under Epic
During the weekly SLO/SLA/SLI meeting, we will examine install failures on the cluster installation failure dashboard. We will aggregate the top occurring items which show up as Server Errors, and each story underneath will be an investigation required to figure out root cause and how we can either prevent it, be resilient to failures, or return a User Error.
AS AN ARO SRE
I WANT either{}
SO THAT
1. Decorate the error in the ARO installer to return a more informative message as to why the SKU was not found.
2. Ensure that the OpenShift installer and the RP are validating the SKU in the same manner
3. If the validation is the same between the Installer and ARO installer, we have the option to remove the ARO installer validation step
Acceptance Criteria
Given: The RP and the ARO Installer validates the SKU in the same manner
When: The RP validates
Then: The ARO Installer does not
Given: The ARO Installer validates additional or improved information validation than the RP
When: The ARO Installer validation fails due to missing SKU (failed validation)
Then: Enhance the log to include the SKU that was not found, providing us with more information to troubleshoot
Breadcrumbs
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
On hypershift there are two different clusters that we should communicate with during the test run.
The existing testclient only communicates with a single client.
We should add additional client in order to support hypershift case.
More details in the design doc:
https://docs.google.com/document/d/1_NFonPShbi1kcybaH1NXJO4ZojC6q7ChklCxKKz6PIs/edit#heading=h.ocz97kq0reax
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Some changes are needed in the NodePool controller to enable running the Performance Profile controller as part of Node Tuning Operator in the HyperShift hosted control planes to manage node tuning of hosted nodes.
PerformanceProfile objects will be created in the management cluster, embedded into a configmap, and referenced in a field of the NodePool API, then NodePool controller will handle this objects and create ConfigMap into the hosted cluster namespace for the Performance Profile controller to read them
More information in the enhancement proposal
When PAO controller tries to create an event we get the following error:
E0402 09:41:10.578920 1 event.go:280] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"perfprofile-hostedcluster01.17c0ad00e8c2abc7", GenerateName:"", Namespace:"clusters-hostedcluster01", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"ConfigMap", Namespace:"clusters-hostedcluster01", Name:"perfprofile-hostedcluster01", UID:"e8d4883f-9944-4827-acba-21f757444a21", APIVersion:"v1", ResourceVersion:"21684945", FieldPath:""}, Reason:"Creation succeeded", Message:"[hypershift:perfprofile-hostedcluster01] Succeeded to create all components", Source:v1.EventSource{Component:"performance-profile-controller", Host:""}, FirstTimestamp:time.Date(2024, time.March, 27, 16, 47, 57, 817465799, time.Local), LastTimestamp:time.Date(2024, time.April, 2, 9, 41, 10, 577809390, time.Local), Count:14, Type:"Normal", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events "perfprofile-hostedcluster01.17c0ad00e8c2abc7" is forbidden: User "system:serviceaccount:clusters-hostedcluster01:cluster-node-tuning-operator" cannot patch resource "events" in API group "" in the namespace "clusters-hostedcluster01"' (will not retry!)
This means we should add events under the `Role` of NTO operator
PerformanceProfile objects are handled in a different way on Hypershift so modifications in the Performance Profile controller are needed to handle this.
Basically Performance Profile controller have to reconcile ConfigMaps which will have PerformanceProfile objects embedded into them, create the different manifests as usual and then handle them to the hosted cluster using different methods.
More info in the enhancement proposal
Target: To have a feature equivalence in Hypershift and Standalone deployments
[UPDATE] This story is about setup the scaffolding for the actual hypershift implementation
A separate PR is requested from NTO team to put rationale on setting intel_pstate by default. The idea of this story is to set pstate mode depending on underlying hardwares generation. From recommendation of I///, the newer hardwares from IceLake+ generation comes with hwp enabled, therefore recommends to keep intel_pstate to active.
This PR should also modify all the e2e tests where rendered output is verified, because setting pstate to active actually needs modification on those files
This epic tracks any part of our codebase / solutions we implemented taking shortcuts.
Whenever a shortcut is taken, we should add a story here not to forget to improve it in a safer and more maintainabile way.
maintanability and debuggability, and in general fighting the technical debt, is critical to keep velocity and ensure overall high quality
https://issues.redhat.com/browse/CNF-796
https://issues.redhat.com/browse/CNF-1479
https://issues.redhat.com/browse/CNF-2134
https://issues.redhat.com/browse/CNF-6745
https://issues.redhat.com/browse/CNF-8036
Set of API updates, best practices adoption and cleanup to the e2e suite:
This epic is to track any stories for 4.16 hypershift kubevirt development that do not fit cleanly within a larger effort.
Here are some examples of tasks that this "catch all" epic can capture
This is a followup for CNV-29003 in which we've enabled the NodePoolUpgrade test, but just for replace strategy, in which a node pool release image update triggers a machine rollout resulting in creation of new nodes with the updated RHCOS/kubelet version.
With InPlace strategy, the machines are not getting recreated, but are getting upgraded only with soft-reboot. We've noticed it doesn't work for kubevirt as expected, as the updated node is getting stuck at SchedulingDisabled and the nodepool.status.version is not updated.
This task is for find the root cause, fix and make the InPlace nodepool test pass on the presubmit CI.
Building CI Images has recently increased in duration, sometimes hitting 2 hours, which causes multiple problems:
More importantly, the build times have gotten to a point where OSBS is failing to build the installer due to timeouts, which is making it impossible for ART to deliver the product or critical fixes.
Create a new repo for the providers, build them into a container image and import the image in the installer container image.
Hopefully this will save resources and decrease build times for CI jobs in Installer PRs.
Rebase openshift/etcd to latest 3.5.11 upstream release.
Rebase openshift/etcd to latest 3.5.12 upstream release.
Rebase openshift/etcd to latest 3.5.13 upstream release.
This epic tracks the rebase of openshift/etcd to 3.5.14
This update includes the following changes:
https://github.com/etcd-io/etcd/blob/main/CHANGELOG/CHANGELOG-3.5.md#v3514-2024-05-29
Most notably this includes the experimental flag to stop serving requests on an etcd member that is undergoing defragmentation which would help address https://issues.redhat.com/browse/OCPSTRAT-319
Rebase openshift/etcd to latest 3.5.14 upstream release.
After Layering and Hypershift GAs in 4.12, we need to remove the code and builds that are no longer associated with mainline OpenShift.
This describes non-customer facing.
Removing code from MCO where we maybe referencing machine-os-content such as legacy OS update.
Links to the effort:
The MCO and ART bits can be done ahead of time (except for the final config knob flip). Then we would merge the openshift/os, openshift/driver-toolkit PRs and do the ART config knob flip at roughly the same time.
This is not a user-visible change.
METAL-119 provides the upstream ironic functionality
They're hardcoding URLs :5050/v1/continue. The easiest way is probably to add this as an alias to both httpd and ironic-proxy.
As discussed in https://issues.redhat.com/browse/METAL-856 we need a way to stop assisted service vendoring CBO code as a way for detecting the Ironic URL.
since we're going toward a source-based model for the OCP ironic-image using downstream packages, we're starting to see more and more discrepancies with the OKD version based on CS9 and upstream packages, causing conflicts and issues due to missing or too old dependencies
for this reason we'd like to split the lists of installed packages between OCP and OKD as it was done for the ironic-agent-image
having the cachito configuration in place we can now start converting the ironic packages and the dependencies to install from source
in CI and local builds if REMOTE_SOURCES and REMOTE_SOURCES_DIR are not defined they assume the value of . actually enabling the COPY . . command in the Dockerfile
to avoid potential issues we need to find an alternative and defaulting REMOTE_SOURCES to something more safe
to avoid mistakes like using hashes from wrong releases we need to have a way to test them
ideally this should also be automated in a CI job
as it was done for the ironic-image, we'd like to migrate the ironic-agent-image to an hybrid model using RPMs for dependencies and source code for the ironic-python-agent installation
In SaaS, allow users of assisted-installer UI or API, to install any published OCP version out of a supported list of x.y options.
Feature probably origins from our own team. This feature will enhance the current workflow we're following to allow users selectively install versions in assisted-installer SaaS.
Until now we had to be contacted by individual users to allow a specific version (usually, it was replaced by us with a newer version). In this case, we would add this version to the relevant configuration file.
It's not possible to quantify the relevant numbers here, because users might be missing certain versions in assisted and just give up the usage of it. In addition, it's not possible to know if users intended to use a certain "old" version, or if it's just an arbitrary decision.
Osher De Paz can we know how many requests we had for "out-of-supported-list"?
It's essential to include this feature in the UI. Otherwise, users will get very confused about the feature parity between API and UI.
Osher De Paz there will always be features that exist in the API and not in the UI. We usually show in the UI features that are more common and we know that users will be interacting with them.
Description of the problem:
In the current user interface, when you select a multiarch OpenShift version, you must specify a CPU architecture from those available in the manifest. After making this selection and clicking "next," the user interface attempts to register a cluster with the chosen version and CPU architecture. However, the assisted-service encounters an error and fails to find the requested release image.
How reproducible:
Always.
Steps to reproduce:
1. Choose multiarch OpenShift version
2. Press "next"
Actual results:
Expected results:
Successfully registering a cluster. **
When enabling infrastructure operator automatically import the cluster and enable users to add nodes to self cluster via Infrastructure operator
Yes, it's a new functionality that will need to be documented
Description of the problem:
Imported local cluster doesn't inherit proxy settings from AI.
How reproducible:
100%
Steps to reproduce:
1. Install 4.15 hub cluster with proxy enabled (IPv6)
2. Install 2.10 ACM
Actual results:
No proxy settings in `local-cluster` ACI.
Expected results:
ACI `local-cluster` inherited proxy settings from AI pod.
The external platform was created to allow cloud providers to supply their own integration components (cloud controller manager, etc.) without prior integration into openshift release artifacts. We need to support this new platform in assisted-installer in order to provide a user friendly way to enable such clusters, and to enable new-to-openshift cloud providers to quickly establish an installation process that is robust and will guide them toward success.
Please describe what conditions must be met in order to mark this feature as "done".
If the answer is "yes", please make sure to check the corresponding option.
Currently, the UI shows a badge saying dev preview for OCI. We need it to be tech preview for 4.15.
GA date for 4.15 is Feb 27th, but releasing this change a week or so earlier is also fine. We're flexible.
CC Ju Lim Montse Ortega Gallart Tomas Jelinek liat gamliel Adrien Gentil Marcos Entenza Garcia
deprecate
platform.type: oci
in favor of
platform { type: external external: { platformName: oci CloudControllerManager: External } }
This is an existing feature that was added, but didn't go through any testing / documentation, it only works internally when the assisted installer uses it itself, but broken completely when users try to use it
See MGMT-12435 , MGMT-15999 , MGMT-16000 and MGMT-16002
Please describe what this feature is going to do.
This feature already exists, MGMT-16000 covers it and gives an example of what it looks like
Please describe what conditions must be met in order to mark this feature as "done".
Yes
If the answer is "yes", please make sure to check the corresponding option.
Description of the problem:
This pattern: https://github.com/openshift/assisted-service/blob/efc20a5ea46368da70143455dc0300aebd79ce18/swagger.yaml#L6687
is too strict, it does not allow users to create patch manifests
See related MGMT-15999 and MGMT-16000
How reproducible:
100%
Steps to reproduce:
1. Try to create a patch manifest:
curl http://localhost:8080/api/assisted-install/v2/clusters/${CLUSTER_ID}/manifests \ --json ' { "folder": "openshift", "file_name": "cluster-network-02-config.yml.patch_custom_ovn_v4internalsubnet", "content": "---\n- op: add\npath: /spec/defaultNetwork\nvalue:\novnKubernetesConfig:\nv4InternalSubnet: 100.65.0.0/16" } '
Actual results:
Error
{"code":605,"message":"file_name in body should match '^[^/]*\\.(yaml|yml|json)$'"}
Expected results:
No error
Allow using late binding via BMH, to support late binding via ZTP, part of RFE-4769
Please describe what conditions must be met in order to mark this feature as "done".
If the answer is "yes", please make sure to check the corresponding option.
In order to allow declarative association of host with a cluster in late binding flow, annotation support will be added to bmc.
The annotation is
bmac.agent-install.openshift.io/cluster-reference.
The annotation value is a json encoded string containing dictionary with keys [name,namespace] that correspond to the name and namespace of the cluster deployment.
Please describe what this feature is going to do.
Please describe what conditions must be met in order to mark this feature as "done".
If the answer is "yes", please make sure to check the corresponding option.
Manage the effort for adding jobs for release-ocm-2.10 on assisted installer
https://docs.google.com/document/d/1WXRr_-HZkVrwbXBFo4gGhHUDhSO4-VgOPHKod1RMKng
Merge order:
Update the BUNDLE_CHANNELS in the Makefile in assisted-service and run bundle generation.
There were several issues found in customer sites concerning connectivity checks:
Currently when L3 connectivity groups, there must be symmetrical connectivity between all nodes. This poses restriction that nodes cannot have networks that do not participate in cluster networking.
Instead, the L3 connectivity will change to connected addresses. A connected address is in address that can be reached from all other hosts. So if a host has a connected address, then it is regarded as host that belongs to majority group (the validation).
I created a SNO cluster through the SaaS. A minor issue prevented one of the ClusterOperators from reporting "Available = true", and after one hour, the install process hit a timeout and was marked as failed. When I found the failed install, I was able to easily resolve the issue and get both the ClusterOperator and ClusterVersion to report "Available = true", but the SaaS no longer cared; as far as it was concerned, the installation failed and was permanently marked as such. It also would not give me the kubeadmin password, which is an important feature of the install experience and tricky to obtain otherwise. A hard timeout can cause more harm than good, especially when applied to a system (openshift) that continuously tries to get to desired state without an absolute concept of an operation succeeding or failing; desired state just hasn't been achieved yet. We should consider softening this timeout to be a warning that installation hasn't achieved completion as quickly as expected, without actively preventing a successful outcome.
Late binding scenario (kube-api):
User try to install a cluster with late binding featured enabled (deleting the cluster will return the hosts to InfraEnv), installation timeout and cluster goes into error state, user connect to the cluster and fix the issue.
AI will still think that there is an error in the cluster, If user will try to perform day2 operations on an in error cluster it will fail, the only option is to delete the cluster and create another one that is marked as installed but that will cause the host to boot from discovery ISO.
assisted-service should take care of timeouts.
Then, the API can be removed.
1. Proposed title of this feature request
A) Getting VPAs metrics in Openshift's Prometheus/kube-state-metrics
2. Who is the customer behind the request?
Account: name: xxx
TAM customer: no
CSM customer: no
Strategic: no
3. What is the nature and description of the request?
A) Feature request so that the data is the VPA is available in Prometheus is available for Kube State Metrics
4. Why does the customer need this? (List the business requirements here)
A) To view the data related to VPA
7. Is there already an existing RFE upstream or in Red Hat Bugzilla?
A) No
8. Does the customer have any specific time-line dependencies and which release would they like to target (i.e. RHEL5, RHEL6)?
A) No
9. Is the sales team involved in this request and do they have any additional input?
A) No
10. List any affected packages or components.
11. Would the customer be able to assist in testing this functionality if implemented?
A) Yes
Introduce VPA metrics using CRS feature in KSM, i.e., the --custom-resource-state-config flag, since there's a need to expose the same after they were recently dropped from the set of native resources from KSM.
This has no link to a planing session, as this predates our Epic workflow definition.
Base Collection Profile enablement on dedicated feature gate introduced in: https://github.com/openshift/api/pull/1670. The effort will entail utilizing https://github.com/openshift/cluster-monitoring-operator/pull/2151 machinery.
Add a custom TP feature gate for Collection Profiles, instead of depending on "TechPreviewNoUpgrade".
Our documentation suggests creating an alert after configuring scrape sample limits.
That PrometheusRule object has two alerts configured within it [1]
`ApproachingEnforcedSamplesLimit`
`TargetDown`
The `Targetdown` alert is designed to fire after the `ApproachingEnforcedSamplesLimit` because the target is dropped once the enforced sample limit is reached
The TargetDown alert is creating false positives - its firing for reasons other than pods in the namespace have reached there enforced sample limit (e.g. the metrics endpoint may be down).
User-defined monitoring should provide out-of-the-box metrics that will help with troubleshooting:
[2] - https://prometheus.io/docs/prometheus/latest/feature_flags/#extra-scrape-metrics
Update Prometheus user-workload to enable additional scrape metrics
https://prometheus.io/docs/prometheus/latest/feature_flags/#extra-scrape-metrics
Make sure the statefulsets are not recreated after upgrade.
Recreating a statefulset should be an exception.
https://github.com/rhobs/handbook/pull/59/files
https://github.com/openshift/cluster-monitoring-operator/pull/1631
https://github.com/openshift/origin/pull/27031
https://github.com/openshift/cluster-monitoring-operator/pull/1580
https://github.com/openshift/cluster-monitoring-operator/pull/1552
Require read-only access to Alertmanager in developer view.
https://issues.redhat.com/browse/RFE-4125
Common user should not see alerts in UWM.
https://issues.redhat.com/browse/OCPBUGS-17850
Related ServiceAccounts.
Interconnection diagram in monitoring stack.
https://docs.google.com/drawings/d/16TOFOZZLuawXMQkWl3T9uV2cDT6btqcaAwtp51dtS9A/edit?usp=sharing
None.
In CMO, Alertmanager pods in the openshift-monitoring namespace have an Oauth-proxy on port 9095 for web access on all paths.
We are going to replace it with kube-rbac-proxy and constraint the access to /api/v2 pathes.
The current behavior is to allow access to the Alertmanager web server for any user having "get" access to "namespace" resources. We do not have to keep the same logic but have to make sure no regression happen. We may need use a stubbed custom resource to authorize both "post" and "get" HTTP requests from certain users.
There is a request to allow read-only access to alerts in Developer view. kube-rbac-proxy can facilitate this functionality.
In CMO, ThanosRuler pods have an Oauth-proxy on port 9091 for web access on all paths.
We are going to replace it with kube-rbac-proxy and constraint the access to /api/v1 paths.
The current behavior is to allow access to the ThanosRuler web server for any user having "get" access to "namespace" resources. We do not have to keep the same logic but have to make sure no regression happen. We may need use a stub custom resource to authorize both "post" and "get" HTTP requests from certain users.
In CMO, Prometheus pods have an Oauth-proxy on port 9091 for web access on all paths.
We are going to replace it with kube-rbac-proxy and constraint the access to /api/v1 paths.
The current behavior is to allow access to the Prometheus web server for any user having "get" access to "namespace" resources. We do not have to keep the same logic but have to make sure no regression happen. We may need use a stub custom resource to authorize both "post" and "get" HTTP requests from certain users.
The insight component is using this port, figure out how to keep its access after replacing the Oauth proxy https://github.com/openshift/insights-operator/blob/master/pkg/controller/const.go
Its service account "gather" and "operator" should use the prometheus endpoint. https://redhat-internal.slack.com/archives/CLABA9CHY/p1701345127689009
$ oc get clusterrole,role -n openshift-monitoring -o name | egrep -i -e 'monitoring{-}(alertmanager|rules)-(edit|view)' -e cluster-monitoring-view clusterrole.rbac.authorization.k8s.io/cluster-monitoring-view clusterrole.rbac.authorization.k8s.io/monitoring-rules-edit clusterrole.rbac.authorization.k8s.io/monitoring-rules-view clusterrole.rbac.authorization.k8s.io/openshift-cluster-monitoring-view role.rbac.authorization.k8s.io/monitoring-alertmanager-edit
we have cluster roles for viewing metrics and editing alerts/silences but not a local role for viewing all.
I have a customer requesting read only access to alerts in the developer console.
Create E2E Tests for Metrics Server in the CMO repository.
openshift/origin e2e tests don't explicitly exercise the metrics.k8s.io API group (they do verify the HPA functionality though which uses the metrics API underneath). We should add tests under the test/extended/prometheus proving that the API works before/after upgrade.
It will help showing off that the prometheus-adpater -> metrics-server migration happens seamlessly when we turn the feature gate on by default.
Skip Prometheus Adapter tests once MetricsServer Featuregate is default
Allow OpenShift users to configure audit logs for Metrics Server
Enable collection of metrics-server audit logs in must-gather similar to prometheus-adapter https://github.com/openshift/must-gather/pull/266
Need to wait till MON-3748 is completed
Graduate MetricsServer FeatureGate to GA
In CMO we have static yaml assets, that are added to our CMO container image and applied by CMO. We read them once from the file system and after that they are chached in memory.
Currently we use a loose unmarshal, i.e. fields that are spuperflous (not part of the type unmarshaled into) are silently dropped.
We should use strict unmarshaling in order to catch config mistakes like https://issues.redhat.com//browse/OCPBUGS-24630.
Additionally we could consider adding the static assets to our CMO binary via golangs embed.
Discussed, probably not worth it.
sigs.k8s.io/yaml offers a k8s compatible UnmarshalStrict.
golangs yaml implementation does not work well with optional fields. See http://web.archive.org/web/20190603050330/http://ghodss.com/2014/the-right-way-to-handle-yaml-in-golang/ for more details
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
NOTE:
Epic Goal
Why is this important?
Scenarios
1. …
Acceptance Criteria
Dependencies (internal and external)
1. …
Previous Work (Optional):
1. …
Open questions::
1. …
Done Checklist
Currently various serviceIDs are created by create infra command to be used by various cluster operators, in which storage operator is exposed in guest cluster which should not happen. Need to reduce the scope of all the serviceIDs to specific to infra resources created for that cluster alone.
Currently cloud connection is used to connect powervs dhcp private network and vpc network to create load balancer to in vpc for ingress controller.
Cloud connection has a limitation of only 2 per zone.
Whereas in Transit gateway, 5 instances can be created in a zone also its lot faster than cloud connection since it uses PER.
https://cloud.ibm.com/docs/power-iaas?topic=power-iaas-per
https://cloud.ibm.com/docs/transit-gateway?topic=transit-gateway-getting-started&interface=ui
Epic Goal
Why is this important?
Scenarios
1. …
Acceptance Criteria
Dependencies (internal and external)
1. …
Previous Work (Optional):
1. …
Open questions::
1. …
Done Checklist
Description of the problem:
On s390x and DHCP user should not set static IP to IPL nodes.
This would lead to dub for IP in case of coreos installer.
E.g.: {}installer-args '[\"{}append-karg\",\"ip=encbdd0:dhcp\",\"{}append-karg\",\"rd.neednet=1\",\"{}append-karg\",\"ip=10.14.6.3::10.14.6.1:255.255.255.0:master-0.boea3e06.lnxero1.boe:encbdd0:none\",\"{}append-karg\",\"nameserver=10.14.6.1\",\"{}append-karg\",\"ip=[fd00::3]::[fd00::1]:64::encbdd0:none\",\"{-}-append-karg\",\"nameserver=[fd00::1]\"
How reproducible:
Create cluster with DHCP and ipl node(s) using parm line containting static ip settings (ip4 and / or ipv6).
Steps to reproduce:
1.
2.
3.
Actual results:
karg setting for ip using the same device is twice (dhcp and static ip).
Expected results:
If DHCP is used than no static IP in parm line.
In case of LPAR installation it's possible that the cmd: systemd-detect-virt --vm might return an exitCode < 0 even if logged into the booted VM and executed from cmdline returns "none" without an errorcode.
This leads to an abort of further processing in system_vendor.go and Manufacturer will not be set to IBM/S390.
To fix this glitch, the code to determine the Manufacturer should be moved before the call for systemd-detect-virt --vm cmd.
Epic Goal
Running doc to describe terminologies and concepts which are specific to Power VS - https://docs.google.com/document/d/1Kgezv21VsixDyYcbfvxZxKNwszRK6GYKBiTTpEUubqw/edit?usp=sharing
Epic Goal
`sao04`,`wdc07`,`eu-de-1`, and `eu-de-2` have all added PER capability. Add them to the installer
Epic Goal
On s390x there are some network configurations where MAC addresses are not static and lead to issues using Assisted Installer, Agent based Installer and HCP (see attached net conf).
For AI and HCP there is a possibility to patch kernel arguments but using the UI there is a separate manual step needed using the API. This will be a bad user experience.
In addition patching the kernel arguments for ABI is not possible.
To solve this, a config override parameter need to be added to the parm file by the user and the ip settings will be automatically passed to the coreos installer regardless what the user configure (DHCP or static IP using nmstate).
New configuration for parm file in cases marked red in the network configuration matrix.
New parameter will be ai.ip_cfg_override and can have 0 or 1. In case of set to 1 the network configuration will be taken from the parm file regardless the user configure for example using the WebUI form Assisted installer. This issue is affecting Agent based installer and HCP.
Parameter for IP configuration will be "ip" and "nameserver".
Example of the parm line looks like:
rd.neednet=1 ai.ip_cfg_override=1 console=ttysclp0 coreos.live.rootfs_url=http://172.23.236.156:8080/assisted-installer/rootfs.img ip=10.14.6.3::10.14.6.1:255.255.255.0:master-0.boea3e06.lnxero1.boe:encbdd0:none nameserver=10.14.6.1 ip=[fd00::3]::[fd00::1]:64::encbdd0:none nameserver=[fd00::1] zfcp.allow_lun_scan=0 rd.znet=qeth,0.0.bdd0,0.0.bdd1,0.0.bdd2,layer2=1 rd.zfcp=0.0.8002,0x500507630400d1e3,0x4000404600000000 random.trust_cpu=on rd.luks.options=discard ignition.firstboot ignition.platform.id=metal console=tty1 console=ttyS1,115200n8
Networking Definition of Planned
Epic Template descriptions and documentation
Bump OpenShift Router from Haproxy from 2.6 and 2.8.
As a cluster administrator, I want OpenShift to include a recent HAProxy version, so that I have the latest available performance and security fixes.
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1. Perf & Scale
...
1. …
1. …
Push the bump and build the new HaProxy RPM after it has been Perf & Scale tested and reviewed.
Networking Definition of Planned
Epic Template descriptions and documentation
Bump the openshift-router container image to RHEL9 with help from the ART Team.
OpenShift is transitioning to RHEL9 base images and we must move to RHEL9 by 4.16. RHEL9 introduces OpenSSL 3.0. OpenSSL 3.0 has shown
Additional information on each of the above items can be found here: Networking Definition of Planned
...
...
1. …
The origin test "when FIPS is disabled the HAProxy router should serve routes when configured with a 1024-bit RSA key" is failing due to a hardcoded certificate using SHA1 as the hash algorithm.
The certificate needs to be updated to use SHA256 as SHA1 isn't support anymore.
More details: https://github.com/openshift/router/pull/538#issuecomment-1831925290
Work with ART team to have https://github.com/openshift-eng/ocp-build-data/pull/3895 merged, verified that everything is working appropriately, and ensure we keep our openshift-router CI jobs up-to-date by using RHEL9 since we disabled automatic base image definition from ART.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Setting up the distgit (for production brew builds) depends on the ART team. This should be tackled early.
The PR to ocp-build-data should also be prioritised, as it blocks the PR to openshift/os. There is a separate CI Mirror used to run CI for openshift/os in order to merge, which can take a day to sync.
To enable the use of the new Azure and GCP image credential providers, we need to enable the DisableKubeletCloudCredentialProviders feature gate in all cluster profiles.
As a user I want kubelet to know how to authenticate with acr automatically so that I don't have to roll credentials every 12h
This functionality is being removed in tree from the kubelet, so we now need to provide it via a credential provider plugin
Before this can be completed, we will need to create and ship an rpm within RHCOS to provide the binary kubelet will exec.
See https://github.com/openshift/machine-config-operator/pull/4103/files for an example PR
As an openshift maintainer I want our build tooling to produce the gcr credential provider plugin so that it can be distributed in RHCOS to be used by kubelet.
We need to ship the gcr credential provider via an rpm, so it is available to kubelet when it first starts.
To ship an rpm we must create a .spec file that provides information on how the package should be built
A working example for AWS is provided in this PR: https://github.com/openshift/cloud-provider-aws/pull/63
As an openshift maintainer I want our build tooling to produce the acr credential provider plugin so that it can be distributed in RHCOS to be used by kubelet.
We need to ship the acr credential provider via an rpm, so it is available to kubelet when it first starts.
To ship an rpm we must create a .spec file that provides information on how the package should be built
A working example for AWS is provided in this PR: https://github.com/openshift/cloud-provider-aws/pull/63
As a user I want kubelet to know how to authenticate with gcr automatically so that I don't have to roll credentials every 12h
This functionality is being removed in tree from the kubelet, so we now need to provide it via a credential provider plugin
Before this can be completed, we will need to create and ship an rpm within RHCOS to provide the binary kubelet will exec.
See https://github.com/openshift/machine-config-operator/pull/4103/files for an example PR
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Create separate SNO alert with templating engine to adjust alerting rules based on workload partitioning mechanism
Here is our overall tech debt backlog: ODC-6711
See included tickets, we want to clean up in 4.16.
Keep the OWNERS file from ODC up to date and help the ci bot to ask our active team members for reviews.
As a developer, I want to go through the e2e tests and remove PatternFly classnames being used as selectors so that future PatternFly upgrades do not break integration tests.
AC:
As a developer, I want to go through the e2e tests and remove PatternFly classnames being used as selectors so that future PatternFly upgrades do not break integration tests.
AC:
As a developer, I want to go through the e2e tests and remove PatternFly classnames being used as selectors so that future PatternFly upgrades do not break integration tests.
AC:
As a developer, I want to go through the e2e tests and remove PatternFly classnames being used as selectors so that future PatternFly upgrades do not break integration tests.
AC:
As a developer, I want to go through the e2e tests and remove PatternFly classnames being used as selectors so that future PatternFly upgrades do not break integration tests.
AC:
As a developer, I want to go through the e2e tests and remove PatternFly classnames being used as selectors so that future PatternFly upgrades do not break integration tests.
AC:
As a developer, I want to go through the e2e tests and remove PatternFly classnames being used as selectors so that future PatternFly upgrades do not break integration tests.
AC:
Increase and improve the automation health of ODC
Improve automation coverage for ODC
Automating the Recently Used Resources Section in the Search Resource page Dropdown
As a user,
We released 4.15 with broken quick starts, see OCPBUGS-29992
We must ensure that our main features are covered by e2e tests.
Service Binding Operator will reach its end of life, and we want customers to be informed about that change. The first step for this is showing some deprecation messages if customers uses SBOs.
This only affects Service Bindings in our UI and not the general Operator based bindings.
Description of problem:
Since the Service Binding Operator is deprecated and will be removed with the OpenShift Container Platform 4.16 release, users should be notified about this in the console in the below pages 1. Add page / flow 2. Creating a SB yaml 3. SB list page 4. Topology when creating a SB / bind a component 5. Topology if we found any SB in the current namespace?
Note:
Confirm the warning text with UX.
Additional info:
https://docs.openshift.com/container-platform/4.15/release_notes/ocp-4-15-release-notes.html#ocp-4-15-deprecation-sbo https://docs.google.com/document/d/1_L05xy7ZSK2xCLiqrrJBPwDoahmi78Ox6mw-l-IIT9M/edit#heading=h.jcsa7gh4tupt
We are undertaking an effort to build OKD on top of CentOS Stream.
The current status of containers can be seen at https://docs.google.com/spreadsheets/d/1s3PWv9ukytTuAqeb2Y6eXg0nbmW46q5NCmo_73Cd_lU/edit#gid=595447113
Some of the work done to produce a build for arm64 and to produce custom builds in https://github.com/okd-project/okd-centos9-rebuild required Dockerfiles and similar assets from the cluster operators repositories to be forked.
This story is to track the eventual backport that should be achieved soon to get rid of most of the forks in the repo by merging the "upstream".
As an engineer, I want to know how OLS is being used, so that I can know what to focus on and improve it.
As a OLS product owner, I want metric data about OLS usage to be reported to RH's telemetry system so I can see how users are making use of OLS and potentially identify issues
How to add a metric to telemetry:
https://rhobs-handbook.netlify.app/products/openshiftmonitoring/telemetry.md/
Notes:
When the console dashboards plugin is present, the metrics tab does not respect a custom datasource.
console dashboards plugin: 0.1.0
Openshift: 4.16
Currently the alert details supports adding links through the plugin extension, we also need buttons to add actions that do not redirect but trigger code. For example triggering the troubleshooting panel
The monitoring plugin renders buttons and links provided by the actions plugin extension
Provide the ability to export data in a CSV format from the various Observability pages in the OpenShift console.
Initially this will include exporting data from any tables that we use.
Product Requirements:
A user will have the ability to click a button which will download the data in the current table in a CSV format. The user will then be able to take this downloaded file use it to import into their system.
Provide the ability of users to export csv data from the dashboard and metrics tables
Based on the results of the spike as detailed in this document: https://docs.google.com/document/d/1EPEYd94NYS_LbFRFT4d1Mj9-ebibNYJYu_kHzwIwxaA/edit?usp=sharing we need to implement one of the suggested solutions for dashboards.
This story will focus on fixing timeout issues for label selectors and line charts. This will be done through breaking the long queries down into queries which are able to be responded to within the timeout period.
Suggested steps are:
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
User Story:
As a software developer, I need to be better informed about the process of contributing to NTO, so that people do not mistakenly contribute to https://github.com/openshift/cluster-node-tuning-operator/tree/master/assets/tuned vs to
https://github.com/redhat-performance/tuned
I also want to be able to create custom NTO images just by following HACKING.md file, which needs to be updated.
Acceptance criteria:
Additional information
There will likely be changes necessary in the NTO Makefile too.
Remark that:
Slack thread:
https://redhat-internal.slack.com/archives/CCX9DB894/p1696515395395939
Acceptance criteria
Implement shared selector-based functionality for port groups and address sets, to be re-used by network policy, ANP, egress firewall, multicast and possibly other features in the future.
This is intended to
This is required to use shared port groups in the future
We need to Bump the version of K8 and to run a library sync for OCP4.13 .Two stories will be created for each activity
As a Sample Operator Developer, I will like to run the library sync process, so the new libraries can be pushed to OCP 4.16
This is a runbook we need to execute on every release of OpenShift
NA
NA
NA
Follow instructions here: https://source.redhat.com/groups/public/appservices/wiki/cluster_samples_operator_release_activities
Library Repo
Library sync PR is merged in master
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Unknown
Verified
Unsatisfied
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
We get too many false positive bugs like https://issues.redhat.com/browse/OCPBUGS-25333 from SAST scans, especially from the vendor directory. Add a .snyk file like https://github.com/openshift/oc/blob/master/.snyk to each repo to ignore them.
After https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/89, ibm-vpc-node-label-updater is no longer deployed.
Once we're certain there is no risk of reverting to an older version of the driver/operator that requires the node labeler, we should remove it from the payload.
Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.
This includes (but is not limited to):
Operators:
EOL, do not upgrade:
Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories
Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.
This includes update of VolumeSnapshot CRDs in cluster-csi-snapshot-controller- operator assets and client API in go.mod. I.e. copy all snapshot CRDs from upstream to the operator assets + go get -u github.com/kubernetes-csi/external-snapshotter/client/v6 in the operator repo.
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
This includes ibm-vpc-node-label-updater!
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Skip default monitor tests for cloud providers, such as:
"[Jira:"Test Framework"] monitor test external-gcp-cloud-service-availability setup"
"[Jira:"Test Framework"] monitor test external-azure-cloud-service-availability setup"
"[Jira:"Test Framework"] monitor test external-aws-cloud-service-availability setup"
"[Jira:"Test Framework"] monitor test external-gcp-cloud-service-availability collection"
"[Jira:"Test Framework"] monitor test external-azure-cloud-service-availability collection"
"[Jira:"Test Framework"] monitor test external-aws-cloud-service-availability collection"
Automatically skip tests tagged with OCPFeatureGate in origin. This API is not available in MicroShift and therefore any test using it should be skipped.
Filtering by test name will avoid having to update each of the tests.
MicroShift does not support metrics yet. This monitor test is failing because the namespace (openshift-monitoring) and the possible deployments (prometheus-adapter and metrics-server) do not exist.
Skip it until a metrics feature is implemented.
The termination message policy requires access to ClusterVersion resources to determine past versions to see if it should skip the test entirely. Since this option does not exist in MicroShift, never skip it by ignoring past versions (and checking ClusterVersion altogether).
Move all patches in https://github.com/openshift/microshift/tree/main/origin/patches except the router image, this needs more work.
1. Proposed title of this feature request
sosreport(sos rpm) command to be included in the tools imagestream.
2. What is the nature and description of the request?
There is imagestream called tools which are used for oc debug node/<node>. There are several tools to debug the node but sos is not one of them. sos report command is the most necessary command to get support from Red Hat and it should be included for debugging the node.
3. Why does the customer need this? (List the business requirements here)
Telco operators build their system in disconnected env. Therefore, it is hard to get additional rpms or images for their env. If the sos command is included in OCP platform with tools imagestream, it would be very useful for them. There is toolbox image for sosreport but it is not included in openshift release manifests. Some telco operator does not allow to bring additional packages or images other than OpenShift platform itself.
4. List any affected packages or components.
tools imagestream in Openshift release manifests
1. Proposed title of this feature request
sosreport(sos rpm) command to be included in the tools imagestream.
2. What is the nature and description of the request?
There is imagestream called tools which are used for oc debug node/<node>. There are several tools to debug the node but sos is not one of them. sos report command is the most necessary command to get support from Red Hat and it should be included for debugging the node.
3. Why does the customer need this? (List the business requirements here)
Telco operators build their system in disconnected env. Therefore, it is hard to get additional rpms or images for their env. If the sos command is included in OCP platform with tools imagestream, it would be very useful for them. There is toolbox image for sosreport but it is not included in openshift release manifests. Some telco operator does not allow to bring additional packages or images other than OpenShift platform itself.
4. List any affected packages or components.
tools imagestream in Openshift release manifests
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Due to FIPS compatibility reasons, oc compiled in specific RHEL version does not work on the other versions (i.e. oc compiled in RHEL8 does not work on RHEL9). Therefore, as customers are migrating to newer RHEL versions, oc's base image should be RHEL9.
This work covers changing the base image of tools, cli, deployer, cli-artifacts.
Why is this important? (mandatory)
What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?
This improves the supportability and default oc will work on RHEL9 as default regardless of FIPS enabled or not on that host.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
ART has to perform some updates in ocp-build-data repository.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
base image of cli image that is used by various components needs to be switched to RHEL9.
https://github.com/openshift/must-gather/pull/420
https://github.com/openshift/must-gather/pull/418
https://github.com/openshift/release/pull/51578
These 3 PRs merged to force must-gather to explicitly use `oc.rhel8`.
When must-gather's base image is switched to RHEL9, these PRs should be reverted.
Similar attempt https://github.com/openshift/must-gather/pull/402
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled
We need to find a good upstream home for the Shared Resource CSI Driver. Candidate organizations/projects to host it:
This is upstream-oriented work aimed at convincing a community to adopt the CSI driver.
This epic is to track any stories for hypershift kubevirt development that do not fit cleanly within a larger effort.
Here are some examples of tasks that this "catch all" epic can capture
we need to document that the pod network cidr for the tenant cluster cannot overlap with the infra cidr range.
https://hypershift-docs.netlify.app/how-to/kubevirt/configuring-network/
An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.
Console frontend uses Yarn 1.22.15 from Sep 2021. We should update the version to v1 latest.
There is currently no automated way to check for consistency of PatternFly package version resolutions. We should add a script (executed as part of the build) to ensure that each PatternFly package has exactly one version resolution.
Recent regressions in the EventStream component have exposed some error-prone patterns. We should refactor the component to remove these patterns. This will harden the component and make it easier to maintain.
AC:
AC:
After upgrading cyress to v13 and switching from using Chrome to Electron browser, and we publish the deprecation of chrome browser, we should remove chrome from the builder image:
https://github.com/openshift/console/blob/master/Dockerfile.builder#L59
Find instances of chrome: https://github.com/search?q=org%3Aopenshift+--browser+%24%7BBRIDGE_E2E_BROWSER_NAME%3A%3Dchrome%7D&type=code
AC:
Console-operator currently is using mostly listers for fetching data. There are still controllers are fetching live data using clients. In those cases we should switch towards using listers.
AC:
A hasty bug fix resulted in another bug. We should add integration tests that utilize cy.intercept in order to prevent such bugs from happening in the future.
AC:
Convert legacy ListPage to dynamic-plugin-sdk ListPage- components in Console VolumeSnapshots Storage
The legacy ListPage components are located in /frontend/packages/console-app/src/components/
Justification: A recent replacement of the legacy ListPage to dynamic-plugin-sdk ListPage- components in VolumeSnapshotPVC tab component led to the duplication of the RowFilter logic in snapshotStatusFilters function due to incompatible type in RowFilter. Also, converting to dynamic-plugin-sdk ListPage- components would make the code more readable and simplify debugging of VolumeSnapshot components.
A.C.
Find and replace legacy ListPage volume-snapshot pages.with dynamic-plugin-sdk ListPage<—> components
As part of the spike to determine outdated plugins, the file-loader dev dependency is out of date and needs to be updated.
Acceptance criteria:
Once PR #12983 gets merged, Console application itself will use PatternFly v5 while also providing following PatternFly v4 packages as shared modules to existing dynamic plugins:
Above mentioned PR will allow dynamic plugins to bring in their own PatternFly code if the relevant Console provided shared modules (v4) are not sufficient.
Let's say we have a dynamic plugin that uses PatternFly v5 - since Console only provides v4 implementations of above shared modules, the plugin would bring in the whole v5 package(s) listed above.
There are two main issues here:
1. CSS consistency
2. Optimal code sharing
This story should address both issues mentioned above.
Acceptance criteria:
#13679 - how to setup devel env. with Console and plugin servers running locally
#13521 - improve docs on shared modules (CONSOLE-3328)
#13586 - ensure that Console vs. SDK compat table is up-to-date
#13637 - how to migrate from PatternFly 4 to 5 (CONSOLE-3908)
#13637 - using correct MIME types when serving plugin assets - use "text/javascript" for all JS assets to ensure that "X-Content-Type-Options: nosniff" security header doesn't cause JS scripts to be blocked in the browser
#13637 - disable caching of plugin manifest JSON resource
Lightspeed team wants to integrate with console via the events, Ben Parees is leading the effort. For that we need to update the Events page, so the lightspeed dynamic-plugin can create a button and pass a callback to it, which would explain the given event.
AC:
The observability team is doing a similar integration with Alerts:
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
A working group has begun migrating common components to @patternfly/react-component-groups. We need to determine that the recently migrated CloseButton component can easily be swapped for the existing one and, if not, identify any issues.
AC:
This is a clone of issue OCPBUGS-44062. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-38713. The following is the description of the original issue:
—
: [sig-network-edge] DNS should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]
failed log
[sig-network-edge] DNS should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel] github.com/openshift/origin/test/extended/dns/dns.go:499 STEP: Creating a kubernetes client @ 08/12/24 15:55:02.255 STEP: Building a namespace api object, basename dns @ 08/12/24 15:55:02.257 STEP: Waiting for a default service account to be provisioned in namespace @ 08/12/24 15:55:02.517 STEP: Waiting for kube-root-ca.crt to be provisioned in namespace @ 08/12/24 15:55:02.581 STEP: Creating a kubernetes client @ 08/12/24 15:55:02.646 Aug 12 15:55:03.941: INFO: configPath is now "/tmp/configfile2098808007" Aug 12 15:55:03.941: INFO: The user is now "e2e-test-dns-dualstack-9bgpm-user" Aug 12 15:55:03.941: INFO: Creating project "e2e-test-dns-dualstack-9bgpm" Aug 12 15:55:04.299: INFO: Waiting on permissions in project "e2e-test-dns-dualstack-9bgpm" ... Aug 12 15:55:04.632: INFO: Waiting for ServiceAccount "default" to be provisioned... Aug 12 15:55:04.788: INFO: Waiting for ServiceAccount "deployer" to be provisioned... Aug 12 15:55:04.972: INFO: Waiting for ServiceAccount "builder" to be provisioned... Aug 12 15:55:05.132: INFO: Waiting for RoleBinding "system:image-pullers" to be provisioned... Aug 12 15:55:05.213: INFO: Waiting for RoleBinding "system:image-builders" to be provisioned... Aug 12 15:55:05.281: INFO: Waiting for RoleBinding "system:deployers" to be provisioned... Aug 12 15:55:05.641: INFO: Project "e2e-test-dns-dualstack-9bgpm" has been fully provisioned. STEP: creating a dual-stack service on a dual-stack cluster @ 08/12/24 15:55:05.775 STEP: Running these commands:for i in `seq 1 10`; do [ "$$(dig +short +notcp +noall +answer +search v4v6.e2e-dns-2700.svc A | sort | xargs echo)" = "172.31.255.230" ] && echo "test_endpoints@v4v6.e2e-dns-2700.svc"; [ "$$(dig +short +notcp +noall +answer +search v4v6.e2e-dns-2700.svc AAAA | sort | xargs echo)" = "fd02::7321" ] && echo "test_endpoints_v6@v4v6.e2e-dns-2700.svc"; [ "$$(dig +short +notcp +noall +answer +search ipv4.v4v6.e2e-dns-2700.svc A | sort | xargs echo)" = "3.3.3.3 4.4.4.4" ] && echo "test_endpoints@ipv4.v4v6.e2e-dns-2700.svc"; [ "$$(dig +short +notcp +noall +answer +search ipv6.v4v6.e2e-dns-2700.svc AAAA | sort | xargs echo)" = "2001:4860:4860::3333 2001:4860:4860::4444" ] && echo "test_endpoints_v6@ipv6.v4v6.e2e-dns-2700.svc";sleep 1; done @ 08/12/24 15:55:05.935 STEP: creating a pod to probe DNS @ 08/12/24 15:55:05.935 STEP: submitting the pod to kubernetes @ 08/12/24 15:55:05.935 STEP: deleting the pod @ 08/12/24 16:00:06.034 [FAILED] in [It] - github.com/openshift/origin/test/extended/dns/dns.go:251 @ 08/12/24 16:00:06.074 STEP: Collecting events from namespace "e2e-test-dns-dualstack-9bgpm". @ 08/12/24 16:00:06.074 STEP: Found 0 events. @ 08/12/24 16:00:06.207 Aug 12 16:00:06.239: INFO: POD NODE PHASE GRACE CONDITIONS Aug 12 16:00:06.239: INFO: Aug 12 16:00:06.334: INFO: skipping dumping cluster info - cluster too large Aug 12 16:00:06.469: INFO: Deleted {user.openshift.io/v1, Resource=users e2e-test-dns-dualstack-9bgpm-user}, err: <nil> Aug 12 16:00:06.506: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients e2e-client-e2e-test-dns-dualstack-9bgpm}, err: <nil> Aug 12 16:00:06.544: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens sha256~4QgFXAn8lyosshoHOjJeddr3MJbIL2DnCsoIvJVOGb4}, err: <nil> STEP: Destroying namespace "e2e-test-dns-dualstack-9bgpm" for this suite. @ 08/12/24 16:00:06.544 STEP: dump namespace information after failure @ 08/12/24 16:00:06.58 STEP: Collecting events from namespace "e2e-dns-2700". @ 08/12/24 16:00:06.58 STEP: Found 2 events. @ 08/12/24 16:00:06.615 Aug 12 16:00:06.615: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for dns-test-d93fff7e-90a3-408e-a197-fc4ff0738b30: { } FailedScheduling: 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling. Aug 12 16:00:06.615: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for dns-test-d93fff7e-90a3-408e-a197-fc4ff0738b30: { } FailedScheduling: skip schedule deleting pod: e2e-dns-2700/dns-test-d93fff7e-90a3-408e-a197-fc4ff0738b30 Aug 12 16:00:06.648: INFO: POD NODE PHASE GRACE CONDITIONS Aug 12 16:00:06.648: INFO: Aug 12 16:00:06.743: INFO: skipping dumping cluster info - cluster too large STEP: Destroying namespace "e2e-dns-2700" for this suite. @ 08/12/24 16:00:06.743 • [FAILED] [304.528 seconds] [sig-network-edge] DNS [It] should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel] github.com/openshift/origin/test/extended/dns/dns.go:499 [FAILED] Failed: timed out waiting for the condition In [It] at: github.com/openshift/origin/test/extended/dns/dns.go:251 @ 08/12/24 16:00:06.074 ------------------------------ Summarizing 1 Failure: [FAIL] [sig-network-edge] DNS [It] should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel] github.com/openshift/origin/test/extended/dns/dns.go:251 Ran 1 of 1 Specs in 304.528 seconds FAIL! -- 0 Passed | 1 Failed | 0 Pending | 0 Skipped fail [github.com/openshift/origin/test/extended/dns/dns.go:251]: Failed: timed out waiting for the condition Ginkgo exit error 1: exit with code 1
failure reason
TODO
This is a clone of issue OCPBUGS-35535. The following is the description of the original issue:
—
: [sig-network-edge][Conformance][Area:Networking][Feature:Router] The HAProxy router should be able to connect to a service that is idled because a GET on the route will unidle it [Skipped:Disconnected] [Suite:openshift/conformance/parallel/minimal] expand_more
The reason for the failure is the incorrect configuration of the proxy.|
failed log
Will run 1 of 1 specs ------------------------------ [sig-network-edge][Conformance][Area:Networking][Feature:Router] The HAProxy router should be able to connect to a service that is idled because a GET on the route will unidle it [Skipped:Disconnected] [Suite:openshift/conformance/parallel/minimal] github.com/openshift/origin/test/extended/router/idle.go:49 STEP: Creating a kubernetes client @ 06/14/24 10:24:21.443 Jun 14 10:24:21.752: INFO: configPath is now "/tmp/configfile3569155902" Jun 14 10:24:21.752: INFO: The user is now "e2e-test-router-idling-8pjjg-user" Jun 14 10:24:21.752: INFO: Creating project "e2e-test-router-idling-8pjjg" Jun 14 10:24:21.958: INFO: Waiting on permissions in project "e2e-test-router-idling-8pjjg" ... Jun 14 10:24:22.039: INFO: Waiting for ServiceAccount "default" to be provisioned... Jun 14 10:24:22.149: INFO: Waiting for ServiceAccount "deployer" to be provisioned... Jun 14 10:24:22.271: INFO: Waiting for ServiceAccount "builder" to be provisioned... Jun 14 10:24:22.400: INFO: Waiting for RoleBinding "system:image-pullers" to be provisioned... Jun 14 10:24:22.419: INFO: Waiting for RoleBinding "system:image-builders" to be provisioned... Jun 14 10:24:22.440: INFO: Waiting for RoleBinding "system:deployers" to be provisioned... Jun 14 10:24:22.740: INFO: Project "e2e-test-router-idling-8pjjg" has been fully provisioned. STEP: creating test fixtures @ 06/14/24 10:24:22.809 STEP: Waiting for pods to be running @ 06/14/24 10:24:23.146 Jun 14 10:24:24.212: INFO: Waiting for 1 pods in namespace e2e-test-router-idling-8pjjg Jun 14 10:24:26.231: INFO: All expected pods in namespace e2e-test-router-idling-8pjjg are running STEP: Getting a 200 status code when accessing the route @ 06/14/24 10:24:26.231 Jun 14 10:24:28.315: INFO: GET#1 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:25:05.256: INFO: GET#38 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:04.256: INFO: GET#877 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:05.256: INFO: GET#878 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:06.257: INFO: GET#879 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:07.256: INFO: GET#880 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:08.256: INFO: GET#881 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:09.256: INFO: GET#882 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:10.256: INFO: GET#883 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:11.256: INFO: GET#884 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:12.256: INFO: GET#885 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:13.257: INFO: GET#886 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:14.256: INFO: GET#887 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host ... ... ... Jun 14 10:39:19.256: INFO: GET#892 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host Jun 14 10:39:20.256: INFO: GET#893 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host [INTERRUPTED] in [It] - github.com/openshift/origin/test/extended/router/idle.go:49 @ 06/14/24 10:39:20.461 ------------------------------ Interrupted by User First interrupt received; Ginkgo will run any cleanup and reporting nodes but will skip all remaining specs. Interrupt again to skip cleanup. Here's a current progress report: [sig-network-edge][Conformance][Area:Networking][Feature:Router] The HAProxy router should be able to connect to a service that is idled because a GET on the route will unidle it [Skipped:Disconnected] [Suite:openshift/conformance/parallel/minimal] (Spec Runtime: 14m59.024s) github.com/openshift/origin/test/extended/router/idle.go:49 In [It] (Node Runtime: 14m57.721s) github.com/openshift/origin/test/extended/router/idle.go:49 At [By Step] Getting a 200 status code when accessing the route (Step Runtime: 14m54.229s) github.com/openshift/origin/test/extended/router/idle.go:175 Spec Goroutine goroutine 307 [select] k8s.io/apimachinery/pkg/util/wait.waitForWithContext({0x95f5188, 0xda30720}, 0xc004cfbcf8, 0x30?) k8s.io/apimachinery@v0.29.0/pkg/util/wait/wait.go:205 k8s.io/apimachinery/pkg/util/wait.poll({0x95f5188, 0xda30720}, 0x1?, 0xc0045c2a80?, 0xc0045c2a87?) k8s.io/apimachinery@v0.29.0/pkg/util/wait/poll.go:260 k8s.io/apimachinery/pkg/util/wait.PollWithContext({0x95f5188?, 0xda30720?}, 0xc004cfbd90?, 0x88699b3?, 0x7?) k8s.io/apimachinery@v0.29.0/pkg/util/wait/poll.go:85 k8s.io/apimachinery/pkg/util/wait.Poll(0xc004cfbd00?, 0x88699b3?, 0x1?) k8s.io/apimachinery@v0.29.0/pkg/util/wait/poll.go:66 > github.com/openshift/origin/test/extended/router.waitHTTPGetStatus({0xc003d8fbc0, 0x5a}, 0xc8, 0x0?) github.com/openshift/origin/test/extended/router/idle.go:306 > github.com/openshift/origin/test/extended/router.glob..func7.2.1() github.com/openshift/origin/test/extended/router/idle.go:178 github.com/onsi/ginkgo/v2/internal.extractBodyFunction.func3({0x2e24138, 0xc0014f2d80}) github.com/onsi/ginkgo/v2@v2.13.0/internal/node.go:463 github.com/onsi/ginkgo/v2/internal.(*Suite).runNode.func3() github.com/onsi/ginkgo/v2@v2.13.0/internal/suite.go:896 github.com/onsi/ginkgo/v2/internal.(*Suite).runNode in goroutine 1 github.com/onsi/ginkgo/v2@v2.13.0/internal/suite.go:883 -----------------------------
Today, when we create an AKS cluster, we provide the catalog images like so:
--annotations hypershift.openshift.io/certified-operators-catalog-image=registry.redhat.io/redhat/certified-operator-index@sha256:fc68a3445d274af8d3e7d27667ad3c1e085c228b46b7537beaad3d470257be3e \ --annotations hypershift.openshift.io/community-operators-catalog-image=registry.redhat.io/redhat/community-operator-index@sha256:4a2e1962688618b5d442342f3c7a65a18a2cb014c9e66bb3484c687cfb941b90 \ --annotations hypershift.openshift.io/redhat-marketplace-catalog-image=registry.redhat.io/redhat/redhat-marketplace-index@sha256:ed22b093d930cfbc52419d679114f86bd588263f8c4b3e6dfad86f7b8baf9844 \ --annotations hypershift.openshift.io/redhat-operators-catalog-image=registry.redhat.io/redhat/redhat-operator-index@sha256:59b14156a8af87c0c969037713fc49be7294401b10668583839ff2e9b49c18d6 \
We need to fix this so that we don't need to override those images on create command when we are in AKS.
The current reason we are annotating the catalog images when we create an AKS cluster is because the HCP controller will try to put the images out of an ImageStream if there are no overrides here - https://github.com/openshift/hypershift/blob/64149512a7a1ea21cb72d4473f46210ac1d3efe0/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go#L3672. In AKS, ImageStreams are not available.
Done when:
We have an enhancement drafted and socialized that
Should be reviewed by/contain provisions for
Spin off of OCPBUGS-30192
The daemon process can exit due to health check failures in 4.16+, after we added apiserver server CA rotation handling. The came with the side effect that if the MCD happens to exit in the middle of the update (e.g. image pull portion), the files/units would have been updated but the OS upgrade would not, blocking the upgrade indefinitely when the new container comes up.
4.16
Only in BM CI so far, unsure if other issues contribute to this.
Get lucky and have api-int DNS break while the machine-config daemon is deploying updated files to disk. Unclear how to reliably trigger this, or distinguish from OCPBUGS-30192 and other failure modes.
Older clusters updating into or running 4.15.0-rc.0 (and possibly Engineering Candidates?) can have the Kube API server operator initiate certificate rollouts, including the api-int CA. Missing pieces in the pipeline to roll out the new CA to kubelets and other consumers lead the cluster to lock up when the Kubernetes API servers transition to using the new cert/CA pair when serving incoming requests. For example, nodes may go NotReady with kubelets unable to call in their status to an api-int signed by the new CA that they don't yet trust.
Seen in two updates from 4.14.6 to 4.15.0-rc0. Unclear if Engineering Candidates were also exposed. 4.15.0-rc.1 and later will not be exposed because they have the fix for OCPBUGS-18761. They may still have broken logic for these CA rotations in place, but until the certs are 8y or more old, they will not trigger that broken logic.
We're working on it. Maybe cluster-kube-apiserver-operator#1615.
Nodes go NotReady with kubelet failing to communicate with api-int because of tls: failed to verify certificate: x509: certificate signed by unknown authority.
Happy certificate rollout.
Rolling the api-int CA is complicated, and we seem to be missing a number of steps. It's probably worth working out details in a GDoc or something where we have a shared space to fill out the picture.
One piece is getting the api-int certificates out to the kubelet, where the flow seems to be:
That handles new-node creation, but not "Kube API-server operator rolled the CA, and now we need to update existing nodes, and systemctl status restart their kubelets. And any pods using ServiceAccount kubeconfigs? And...?". This bug is about filling in those missing pieces in the cert-rolling pipeline (including having the Kube API server not use the new CA until it has been sufficiently rolled out to api-int clients, possibly including every ServiceAccount-consuming pod on the cluster?), and anything else that seems broken with the early cert-rolls.
Somewhat relevant here is OCPBUGS-15367 currently managing /etc/kubernetes/kubeconfig permissions in the machine-config daemon to backstop for the file existing in the MCS-served Ignition config but not being a part of the rendered MachineConfig or the ControllerConfig stack.
This is intended to be a place to capture general "tech debt" items so they don't get lost. I very much doubt that this will ever get completed as a feature, but that's okay, the desire is more that stores get pulled out of here and put with feature work "opportunistically" when it makes sense.
If you find a "tech debt" item, and it doesn't have an obvious home with something else (e.g. with MCO-1 if it's metrics and alerting) then put it here, and we can start splitting these out/marrying them up with other epics when it makes sense.
As an OpenShift developer, I want to know that my code is as secure as possible by running static analysis on each PR.
Periodically, scans are performed on all OpenShift repositories and the container images produced by those repositories. These scans usually result in numerous OCP bugs being opened into our queue (see linked bugs as an example), putting us in a more reactive state. Instead, we can perform these scans on each PR by following these instructions https://docs.ci.openshift.org/docs/how-tos/add-security-scanning/ to add this to our OpenShift CI configurations.
Done When:
During my work on Project Labrador, I learned that there are advanced caching directives that one can add to their Containerfiles. These do things such as allowing the package manager cache be kept out of the image build, but to remain after the build so that subsequent builds don't have to download the packages. Golang has a great incremental build story as well, provided that one leaves the caches intact.
To begin with, my Red Hat-issued ThinkPad P16v takes approximately 2 minutes and 42 seconds to perform an MCO image build (assuming the builder and base images are already prefetched).
A preliminary test shows that by using advanced caching directives, incremental builds can be reduced to as little as 45 seconds. Additionally, by moving the nmstate binary installation into a separate build stage and limiting what files are copied into that stage, we can achieve a cache hit under most conditions. This cache hit has the additional advantage of that stage not requiring VPN in order to reach the appropriate RPM repository.
Done When:
During 4.15, the OCP team is working on allowing booting from iscsi. Today that's disabled by the assisted installer. The goal is to enable that for ocp version >= 4.15.
iscsi boot is enabled for ocp version >= 4.15 both in the UI and the backend.
When booting from iscsi, we need to make sure to add the `rd.iscsi.firmware=1` kargs during install to enable iSCSI booting.
yes
Lift validation here to allow anyone to use iSCSI boot volumes:
https://github.com/openshift/assisted-service/pull/5728/files#diff-4c707383b7985c70d26467ad3b9a6cb21aeebf6208d3fbdd4997da55c5d6b30bR161
Add an annotation (with one sentence) that explains what the Kube resource is used for, sometimes we cannot/don't want to browse the code to know what the resource is used for, and mist of the times the name is not self-explanatory.
I don't think maintaining this will be a big deal as resources' purposes don't change a lot.
there is a standard annotation for this kubernetes.io/description https://kubernetes.io/docs/reference/labels-annotations-taints/#description
See https://issues.redhat.com/browse/MON-1634?focusedId=22170770&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-22170770 for more context.
We don't clearly document what are the supported "3rd-party" monitoring APIs. We should compile the exhaustive list of API services that we officially support:
See RHDEVDOCS-4830 for the context.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Migrate:
See https://docs.google.com/document/d/1fm2SZs8HroexPQnqI0Ua85Y31-lars8gCsdwSJC3ngo/edit?usp=sharing for more details.
This epic is to track stories that are not completed in MON-3378
After we have replaced all oauth-proxy occurrences in the monitoring stack, we need to make sure that all references to oauth-proxy are removed from the cluster monitoring operator. Examples:
There are a few places in CMO where we need to remove code after the release-4.16 branch is cut.
To find them, look for the "TODO" comments.
Implement an intitial PowerVS provider for CAPI.
Generate the machine manifests. Follow how it's done for AWS [0] and the PoC code [1]
[0] https://github.com/openshift/installer/blob/master/pkg/asset/machines/aws/awsmachines.go
endpoint overrides will be used by cluster operators for disconnected scenario
When publishStrategy is Internal, we need to create the api and api-int records against an IBM DNS service instead of CIS.
we need to point the api record to the private loadbalancer in the private scenario
Epic Goal
Through this epic, we will update our CI to use a have an available agent-based workflow instead of the libvirt openshift-installer, allowing us to eliminate the use of terraform in our deployments.
Why is this important?
There is an active initiative in openshift to remove terraform from the openshift installer.
Acceptance Criteria
Done Checklist
As a CI job author, I would like to be able to reference a yaml/json parsing tool that works across architectures and doesn't need to be downloaded for each unique step.
Rafael pointed out that Alessandro add multi-arch containers for yq for the upi installer:
https://github.com/openshift/release/pull/49036#discussion_r1499870554
yq should have the ability to parse json.
We should evaluate if this can be added to the libvirt-installer image as well, and then used by all of our libvirt CI steps.
Filing a ticket based on this conversation here: https://github.com/openshift/enhancements/pull/1014#discussion_r798674314
Basically the tl;dr here is that we need a way to ensure that machinesets are properly advertising the architecture that the nodes will eventually have. This is needed so the autoscaler can predict the correct pool to scale up/down. This could be accomplished through user driven means like adding node arch labels to machinesets and if we have to do this automatically, we need to do some more research and figure out a way.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As part of the migration to external cloud providers, the CCMO and KCMO used a CloudControllerOwner condition to show which controller owned the controllers.
This is no longer required and can be removed.
The kubelet no longer needs the cloud-config flag as it is no longer running in-tree code.
It is currently handled in the templates by this function which will need to be removed, along with any instances in the templates that call the function.
This should cause the flag to be omitted from future kubelet configuration.
Code in library-go currently uses feature gates to determine if Azure and GCP clusters should be external or not. They have been promoted for at least one release and we do not see ourselves going back.
In 4.17 the code is expected to be deleted completely.
We should remove the reliance on the feature gate from this part of the code and clean up references to feature gate access at the call sites.
When a new OCP release branch is cut, there are a number of things that need to be updated manually to point to the new release.
Update community operators index image tags:
These catalogs need to be created first before we do this work.
Update Red Hat and Certified Operators index image tags:
These catalogs need to be created first before we do this work.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
This is a response to https://issues.redhat.com/browse/OCPBUGS-12893 which is more of a feature request than a bug. The ask is that we put the ingress VIP in a fault state when there is no ingress controller present on the node so it won't take the VIP even if no other node takes it either.
I don't believe this is a bug because it's related to an unsupported configuration, but I still think it's worth doing because it will simplify our remote worker process. If we put the VIP in a fault state it won't be necessary to disable keepalived on remote workers, the ingress service just needs to be placed correctly and keepalived will do the right thing.
User Story
As an SRE-P Engineer, I want to ensure the TRT nightly jobs for the ROSA test suite are free of failures.
Acceptance Criteria
References
Failing conformance tests:
Openshift conformance tests are flagging some alerts added by managed services to not be compliant.
See comments for failure messages
Goal:
Description of problem:
Failed to install OCP on the below LZ/WLZ, the common point in the below regions is that all of them have only one type of zones: LZ or WLZ. e.g. in af-south-1, only LZ is available, no WL, in ap-northeast-2, only WL is available, no LZ. Failed regions/zones: af-south-1 ['af-south-1-los-1a'] failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in af-south-1 ap-south-1 ['ap-south-1-ccu-1a', 'ap-south-1-del-1a'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in ap-south-1 ap-southeast-1 ['ap-southeast-1-bkk-1a', 'ap-southeast-1-mnl-1a'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in ap-southeast-1 me-south-1 ['me-south-1-mct-1a'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in me-south-1 ap-southeast-2 ['ap-southeast-2-akl-1a', 'ap-southeast-2-per-1a'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in ap-southeast-2 eu-north-1 ['eu-north-1-cph-1a', 'eu-north-1-hel-1a'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in eu-north-1 ap-northeast-2 ['ap-northeast-2-wl1-cjj-wlz-1', 'ap-northeast-2-wl1-sel-wlz-1'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Local Zone names: no zones with type local-zone in ap-northeast-2 ca-central-1 ['ca-central-1-wl1-yto-wlz-1'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Local Zone names: no zones with type local-zone in ca-central-1 eu-west-2 ['eu-west-2-wl1-lon-wlz-1', 'eu-west-2-wl1-man-wlz-1', 'eu-west-2-wl2-man-wlz-1'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Local Zone names: no zones with type local-zone in eu-west-2
Version-Release number of selected component (if applicable):
4.15.0-rc.3-x86_64
How reproducible:
Steps to Reproduce:
1) install OCP on above regions/zones
Actual results:
See description.
Expected results:
Don't check LZ's availability while installing OCP in WLZ Don't check WLZ's availability while installing OCP in LZ
Additional info:
Goal:
Description:
Follow up suggestion: https://github.com/openshift/origin/pull/28486#issuecomment-1884966128
Previous work/PR[1] removed the skips only for e2e failing added in past releases.
Acceptance criteria:
[1] https://github.com/openshift/origin/pull/28486
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
The installer is using a very old reference to the machine-config project for apis.
Their apis have been moved to openshift/api. Update the imports and unit tests.
This will make a import error go away and make it easier to update in the future.
Today we manually catch some regressions by eyeballing disruption graphs.
There are two focuses for this Epic, first are updates to the existing disruption logic to fix and tune the existing logic and second is to consider new methods for collecting and analyzing disruption.
For the second part considerations are:
Design some automation to detect these kinds of regressions and alert TRT.
Would this be in sippy or something new? (too openshift specific?)
Bear in mind we'll soon need it for component memory and cpu usage as well.
Alerts should eventually be targeting the SLO we discussed in Infra arch call on Feb 7: https://docs.google.com/document/d/1QOXh7Me0w-4ad-c8HaTuQPvpG5cddUyA2b1j00H-MXQ/edit?usp=sharing
Make sure we gain testing over metal and vsphere which typically do not have the min 100 runs, how can we test these broader?
Today we fail the job if you're over the P99 for the last 3 weeks, as determined by a weekly pr to origin. The mechanism for creating that pr, reading it's data, and running these tests, is broken repeatedly without anyone realizing, and often doing things we don't expect.
Disruption ebbs and flows constantly especially at the 99th percentile, the test being run is not technically the same week to week.
We do still want to at least attempt to fail a job run if disruption was significant.
Examples we would not want to fail:
P99 2s, job got 4s. We don't care about a few seconds of disruption at this level. This happens all the time, and it's not a regression. Stop failing the test.
P99 60s, job got 65s. There's already huge disruption possible, a few seconds over is largely irrelevant week to week. Stop failing the test.
Layer 3 disruption monitoring is now our main focus point for catching more subtle regressions, this is just a first line of defence, best attempt at telling a PR author that your change may have caused a drastic disruption problem.
Changes proposed (see details in comments below for the data as to why):
Certain repositories, ovn/sdn, router, sometimes MCO, are prone to cause disruption regressions. We need to give engineers in these teams better visibility into when they're about to merge something that might break payloads.
We could request /payload on all PRs but this is expensive and a manual task that could still easily be forgotten.
In our epic TRT-787 we will likely soon have a little data on disruption in sippy, enough to know what we expect is normal, and how we're doing the last few days.
This card proposes a similar approach to risk analysis, or actually plugging right into risk analysis. A spyglass panel should be visible with data on disruption:
We could then enhance the pr commenter to drop this information right infront of the user if anything looks out of the ordinary.
The intervals charts displayed at the top of all prow job runs has become a critical tool for TRT and OpenShift engineering in general, allowing us to determine what happened when, and in relation to other events. The tooling however is falling short in a number of areas we'd like to improve.
Goals:
Stretch goals:
Drop locator and message.
Move tempStructuredLocator and tempStructuredMessage to locator/message.
Ground work for this was already laid by removing all use of the legacy fields.
1. from https://issues.redhat.com/browse/RFE-1576 add output compression
2. from earlier discussion about escalation add --since to limit the size of data (https://issues.redhat.com/browse/RFE-309)
3. from workloads architecture must-gather fallback to oc adm inspect clusteroperators when it can't run a pod (https://github.com/openshift/oc/pull/749)
4. ability to get kubelet logs, refer to https://bugzilla.redhat.com/show_bug.cgi?id=1925328 (no longer relevant, feel free to reopen/re-report if needed)
5. fix error output from sync, see https://bugzilla.redhat.com/show_bug.cgi?id=1917850
6. Create link between nodes and pods from their respective nodes, so it's easier to navigate. (no longer applies)
7. Gather limited (x past hours/days) of logs see https://bugzilla.redhat.com/show_bug.cgi?id=1928468.
8. Use rsync for copying data off of pod see https://github.com/openshift/must-gather/pull/263#issuecomment-967024141 (moved under https://issues.redhat.com/browse/WRKLDS-1191)
[bug] Must-gather logs (not longed into the archive) - this also applies to inspect
[bug] timeouts don’t seem to happen at 10 min
[RFE] logging the collection of inspect in the timestamp file - the timestamp file should have more detailed information, probably similar to the first bug above (moved to https://issues.redhat.com/browse/WRKLDS-1190)
Pulling from https://issues.redhat.com/browse/WRKLDS-259. Currently, `oc adm must-gather` prints logs that are not timestamped when fallback is triggered (running `oc adm inspect` directly). E.g.:
$ ./oc adm must-gather [must-gather ] OUT Using must-gather plug-in image: registry.ci.openshift.org/ocp/4.16-2024-04-19-040249@sha256:addc9b013a4cbbe1067d652d032048e7b5f0c867174671d51cbf55f765ff35fc When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information: ClusterID: 1fa407e3-2ea7-4c34-8fba-8c4d5c3a0bda ClientVersion: v4.2.0-alpha.0-2261-g17c015a ClusterVersion: Stable at "4.16.0-0.ci-2024-04-19-040249" ClusterOperators: All healthy and stable [must-gather ] OUT namespace/openshift-must-gather-9qrqc created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-8c5gz ... Gathering data for ns/openshift-config... Warning: apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+ Gathering data for ns/openshift-config-managed... Gathering data for ns/openshift-authentication... Gathering data for ns/openshift-authentication-operator... Gathering data for ns/openshift-ingress... Gathering data for ns/openshift-oauth-apiserver... Gathering data for ns/openshift-machine-api... Gathering data for ns/openshift-cloud-controller-manager-operator... Gathering data for ns/openshift-cloud-controller-manager... ...
The missing timestamps disallow to measure how much time it takes for each "Gathering data" step to complete. The completion time can help to identify which steps take the most of the collection time.
*Testing part*: the only testing case here is to validate every "significant" log line has a timestamp. Significant = "Gathering data" line. In both cases. the normal one when a must-gather-??? pod/container is running. And when a must-gather image is not pulled (e.g. does not exist) and the must-gather falls back to running `oc adm inspect` code directly (takes significantly longer to run).
*Documenting part*: no-doc
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled
This is a clone of issue OCPBUGS-33954. The following is the description of the original issue:
—
Description of problem:
Infra machine is going to failed status:
2024-05-18 07:26:49.815 | NAMESPACE NAME PHASE TYPE REGION ZONE AGE 2024-05-18 07:26:49.822 | openshift-machine-api ostest-wgdc2-infra-0-4sqdh Running master regionOne nova 31m 2024-05-18 07:26:49.826 | openshift-machine-api ostest-wgdc2-infra-0-ssx8j Failed 31m 2024-05-18 07:26:49.831 | openshift-machine-api ostest-wgdc2-infra-0-tfkf5 Running master regionOne nova 31m 2024-05-18 07:26:49.841 | openshift-machine-api ostest-wgdc2-master-0 Running master regionOne nova 38m 2024-05-18 07:26:49.847 | openshift-machine-api ostest-wgdc2-master-1 Running master regionOne nova 38m 2024-05-18 07:26:49.852 | openshift-machine-api ostest-wgdc2-master-2 Running master regionOne nova 38m 2024-05-18 07:26:49.858 | openshift-machine-api ostest-wgdc2-worker-0-d5cdp Running worker regionOne nova 31m 2024-05-18 07:26:49.868 | openshift-machine-api ostest-wgdc2-worker-0-jcxml Running worker regionOne nova 31m 2024-05-18 07:26:49.873 | openshift-machine-api ostest-wgdc2-worker-0-t29fz Running worker regionOne nova 31m
Logs from machine-controller shows below error:
2024-05-18T06:59:11.159013162Z I0518 06:59:11.158938 1 controller.go:156] ostest-wgdc2-infra-0-ssx8j: reconciling Machine 2024-05-18T06:59:11.159589148Z I0518 06:59:11.159529 1 recorder.go:104] events "msg"="Reconciled machine ostest-wgdc2-worker-0-jcxml" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"ostest-wgdc2-worker-0-jcxml","uid":"245bac8e-c110-4bef-ac11-3d3751a93353","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"18617"} "reason"="Reconciled" "type"="Normal" 2024-05-18T06:59:12.749966746Z I0518 06:59:12.749845 1 controller.go:349] ostest-wgdc2-infra-0-ssx8j: reconciling machine triggers idempotent create 2024-05-18T07:00:00.487702632Z E0518 07:00:00.486365 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-api-provider-openstack-leader: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-api-provider-openstack-leader": http2: client connection lost 2024-05-18T07:00:00.487702632Z W0518 07:00:00.486497 1 controller.go:351] ostest-wgdc2-infra-0-ssx8j: failed to create machine: error creating bootstrap for ostest-wgdc2-infra-0-ssx8j: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-api/secrets/worker-user-data": http2: client connection lost 2024-05-18T07:00:00.487702632Z I0518 07:00:00.486534 1 controller.go:391] Actuator returned invalid configuration error: error creating bootstrap for ostest-wgdc2-infra-0-ssx8j: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-api/secrets/worker-user-data": http2: client connection lost 2024-05-18T07:00:00.487702632Z I0518 07:00:00.486548 1 controller.go:404] ostest-wgdc2-infra-0-ssx8j: going into phase "Failed"
The openstack VM is not even created:
2024-05-18 07:26:50.911 | +--------------------------------------+-----------------------------+--------+---------------------------------------------------------------------------------------------------------------------+--------------------+--------+ 2024-05-18 07:26:50.917 | | ID | Name | Status | Networks | Image | Flavor | 2024-05-18 07:26:50.924 | +--------------------------------------+-----------------------------+--------+---------------------------------------------------------------------------------------------------------------------+--------------------+--------+ 2024-05-18 07:26:50.929 | | 3a1b9af6-d284-4da5-8ebe-434d3aa95131 | ostest-wgdc2-worker-0-jcxml | ACTIVE | StorageNFS=172.17.5.187; network-dualstack=192.168.192.185, fd2e:6f44:5dd8:c956:f816:3eff:fe3e:4e7c | ostest-wgdc2-rhcos | worker | 2024-05-18 07:26:50.935 | | 5c34b78a-d876-49fb-a307-874d3c197c44 | ostest-wgdc2-infra-0-tfkf5 | ACTIVE | network-dualstack=192.168.192.133, fd2e:6f44:5dd8:c956:f816:3eff:fee6:4410, fd2e:6f44:5dd8:c956:f816:3eff:fef2:930a | ostest-wgdc2-rhcos | master | 2024-05-18 07:26:50.941 | | d2025444-8e11-409d-8a87-3f1082814af1 | ostest-wgdc2-infra-0-4sqdh | ACTIVE | network-dualstack=192.168.192.156, fd2e:6f44:5dd8:c956:f816:3eff:fe82:ae56, fd2e:6f44:5dd8:c956:f816:3eff:fe86:b6d1 | ostest-wgdc2-rhcos | master | 2024-05-18 07:26:50.947 | | dcbde9ac-da5a-44c8-b64f-049f10b6b50c | ostest-wgdc2-worker-0-t29fz | ACTIVE | StorageNFS=172.17.5.233; network-dualstack=192.168.192.13, fd2e:6f44:5dd8:c956:f816:3eff:fe94:a2d2 | ostest-wgdc2-rhcos | worker | 2024-05-18 07:26:50.951 | | 8ad98adf-147c-4268-920f-9eb5c43ab611 | ostest-wgdc2-worker-0-d5cdp | ACTIVE | StorageNFS=172.17.5.217; network-dualstack=192.168.192.173, fd2e:6f44:5dd8:c956:f816:3eff:fe22:5cff | ostest-wgdc2-rhcos | worker | 2024-05-18 07:26:50.957 | | f01d6740-2954-485d-865f-402b88789354 | ostest-wgdc2-master-2 | ACTIVE | StorageNFS=172.17.5.177; network-dualstack=192.168.192.198, fd2e:6f44:5dd8:c956:f816:3eff:fe1f:3c64 | ostest-wgdc2-rhcos | master | 2024-05-18 07:26:50.963 | | d215a70f-760d-41fb-8e30-9f3106dbaabe | ostest-wgdc2-master-1 | ACTIVE | StorageNFS=172.17.5.163; network-dualstack=192.168.192.152, fd2e:6f44:5dd8:c956:f816:3eff:fe4e:67b6 | ostest-wgdc2-rhcos | master | 2024-05-18 07:26:50.968 | | 53fe495b-f617-412d-9608-47cd355bc2e5 | ostest-wgdc2-master-0 | ACTIVE | StorageNFS=172.17.5.170; network-dualstack=192.168.192.193, fd2e:6f44:5dd8:c956:f816:3eff:febd:a836 | ostest-wgdc2-rhcos | master | 2024-05-18 07:26:50.975 | +--------------------------------------+-----------------------------+--------+---------------------------------------------------------------------------------------------------------------------+--------------------+--------+
Version-Release number of selected component (if applicable):
RHOS-17.1-RHEL-9-20240123.n.1 4.15.0-0.nightly-2024-05-16-091947
Additional info:
Must-gather link provided on private comment.
This is a clone of issue OCPBUGS-33955. The following is the description of the original issue:
—
Description of problem:
4.16.0-0.nightly-2024-05-14-095225, "logtostderr is removed in the k8s upstream and has no effect any more." log in kube-rbac-proxy-main/kube-rbac-proxy-self/kube-rbac-proxy-thanos containers
$ oc -n openshift-monitoring logs -c kube-rbac-proxy-main openshift-state-metrics-7f78c76cc6-nfbl4 W0514 23:19:50.052015 1 deprecated.go:66] ==== Removed Flag Warning ======================logtostderr is removed in the k8s upstream and has no effect any more.=============================================== ... $ oc -n openshift-monitoring logs -c kube-rbac-proxy-self openshift-state-metrics-7f78c76cc6-nfbl4 ... W0514 23:19:50.177692 1 deprecated.go:66] ==== Removed Flag Warning ======================logtostderr is removed in the k8s upstream and has no effect any more.=============================================== ... $ oc -n openshift-monitoring get pod openshift-state-metrics-7f78c76cc6-nfbl4 -oyaml | grep logtostderr -C3 spec: containers: - args: - --logtostderr - --secure-listen-address=:8443 - --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 - --upstream=http://127.0.0.1:8081/ -- name: kube-api-access-v9hzd readOnly: true - args: - --logtostderr - --secure-listen-address=:9443 - --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 - --upstream=http://127.0.0.1:8082/ $ oc -n openshift-monitoring logs -c kube-rbac-proxy-thanos prometheus-k8s-0 W0515 02:55:54.209496 1 deprecated.go:66] ==== Removed Flag Warning ======================logtostderr is removed in the k8s upstream and has no effect any more.=============================================== ... $ oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep logtostderr -C3 - --config-file=/etc/kube-rbac-proxy/config.yaml - --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 - --allow-paths=/metrics - --logtostderr=true - --tls-min-version=VersionTLS12 env: - name: POD_IP
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-14-095225
How reproducible:
always
Steps to Reproduce:
1. see the description
Actual results:
logtostderr is removed in the k8s upstream and has no effect any more
Expected results:
no such info
Additional info:
Description of problem:
When a OCB is enabled, and a new MC is created, nodes are drained twice when the resulting osImage build is applied.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Enable OCB in the worker pool oc create -f - << EOF apiVersion: machineconfiguration.openshift.io/v1alpha1 kind: MachineOSConfig metadata: name: worker spec: machineConfigPool: name: worker buildInputs: imageBuilder: imageBuilderType: PodImageBuilder baseImagePullSecret: name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy") renderedImagePushSecret: name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}') renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest" EOF 2. Wait for the image to be built 3. When the opt-in image has been finished and applied create a new MC apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: test-machine-config-1 spec: config: ignition: version: 3.1.0 storage: files: - contents: source: data:text/plain;charset=utf-8;base64,dGVzdA== filesystem: root mode: 420 path: /etc/test-file-1.test 4. Wait for the image to be built
Actual results:
Once the image is built it is applied to the worker nodes. If we have a look at the drain operation, we can see that every worker node was drained twice instead of once: oc -n openshift-machine-config-operator logs $(oc -n openshift-machine-config-operator get pods -l k8s-app=machine-config-controller -o jsonpath='{.items[0].metadata.name}') -c machine-config-controller | grep "initiating drain" I0430 13:28:48.740300 1 drain_controller.go:182] node ip-10-0-70-208.us-east-2.compute.internal: initiating drain I0430 13:30:08.330051 1 drain_controller.go:182] node ip-10-0-70-208.us-east-2.compute.internal: initiating drain I0430 13:32:32.431789 1 drain_controller.go:182] node ip-10-0-69-154.us-east-2.compute.internal: initiating drain I0430 13:33:50.643544 1 drain_controller.go:182] node ip-10-0-69-154.us-east-2.compute.internal: initiating drain I0430 13:48:08.183488 1 drain_controller.go:182] node ip-10-0-70-208.us-east-2.compute.internal: initiating drain I0430 13:49:01.379416 1 drain_controller.go:182] node ip-10-0-70-208.us-east-2.compute.internal: initiating drain I0430 13:50:52.933337 1 drain_controller.go:182] node ip-10-0-69-154.us-east-2.compute.internal: initiating drain I0430 13:52:12.191203 1 drain_controller.go:182] node ip-10-0-69-154.us-east-2.compute.internal: initiating drain
Expected results:
Nodes should drained only once when applying a new MC
Additional info:
The monitoring operator may be down or disabled, and the components it manages may be unavailable or degraded.
Upon quick check I've noticed an error:
oc get co -o json | jq -r '.items[].status | select (.conditions) '.conditions | jq -r '.[] | select( (.type == "Degraded") and (.status == "True") )' { "lastTransitionTime": "2023-12-19T10:25:24Z", "message": "syncing Thanos Querier trusted CA bundle ConfigMap failed: reconciling trusted CA bundle ConfigMap failed: updating ConfigMap object failed: Timeout: request did not complete within requested timeout - context deadline exceeded, syncing Thanos Querier trusted CA bundle ConfigMap failed: deleting old trusted CA bundle configmaps failed: error listing configmaps in namespace openshift-monitoring with label selector monitoring.openshift.io/name=alertmanager,monitoring.openshift.io/hash!=2ua4n9ob5qr8o: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps), syncing Prometheus trusted CA bundle ConfigMap failed: deleting old trusted CA bundle configmaps failed: error listing configmaps in namespace openshift-monitoring with label selector monitoring.openshift.io/name=prometheus,monitoring.openshift.io/hash!=2ua4n9ob5qr8o: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)", "reason": "MultipleTasksFailed", "status": "True", "type": "Degraded" }
i.e. updating ConfigMap object failed: Timeout: request did not complete within requested timeout - context deadline exceeded
I ran oc get co again and everything looked fine, it seems this timeout condition could be handled better to avoid alerting SRE.
operator degraded
operator retries operation
Description of problem:
Given this nmstate inside the agent-config
- name: bond0.10 type: vlan state: up vlan: base-iface: bond0 id: 10 ipv4: address: - ip: 10.10.10.116 prefix-length: 24 dhcp: false enabled: true ipv6: enabled: true autoconf: true dhcp: true auto-dns: false auto-gateway: true auto-routes: true
The installation fails due to the assisted-service validation
"message": "No connectivity to the majority of hosts in the cluster"
It misses the l2 connectivity for the ipv6 part (??)
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
when use targetCatalog, mirror failed with error: error: error rebuilding catalog images from file-based catalogs: error copying image docker://registry.redhat.io/abc/redhat-operator-index:v4.13 to docker://localhost:5000/abc/redhat-operator-index:v4.13: initializing source docker://registry.redhat.io/abc/redhat-operator-index:v4.13: (Mirrors also failed: [localhost:5000/abc/redhat-operator-index:v4.13: pinging container registry localhost:5000: Get "https://localhost:5000/v2/": http: server gave HTTP response to HTTPS client]): registry.redhat.io/abc/redhat-operator-index:v4.13: reading manifest v4.13 in registry.redhat.io/abc/redhat-operator-index: unauthorized: access to the requested resource is not authorized
Version-Release number of selected component (if applicable):
oc-mirror 4.16
How reproducible:
always
Steps to Reproduce:
1) Use following isc to do mirror2mirror for v1: kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 storageConfig: local: path: /tmp/case60597 mirror: operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.13 targetCatalog: abc/redhat-operator-index packages: - name: servicemeshoperator `oc-mirror --config config.yaml docker://localhost:5000 --dest-use-http`
Actual results:
1) mirror failed with error:
info: Mirroring completed in 420ms (0B/s)
error: error rebuilding catalog images from file-based catalogs: error copying image docker://registry.redhat.io/abc/redhat-operator-index:v4.13 to docker://localhost:5000/abc/redhat-operator-index:v4.13: initializing source docker://registry.redhat.io/abc/redhat-operator-index:v4.13: (Mirrors also failed: [localhost:5000/abc/redhat-operator-index:v4.13: pinging container registry localhost:5000: Get "https://localhost:5000/v2/": http: server gave HTTP response to HTTPS client]): registry.redhat.io/abc/redhat-operator-index:v4.13: reading manifest v4.13 in registry.redhat.io/abc/redhat-operator-index: unauthorized: access to the requested resource is not authorized
Expected results:
1) no error.
Additional information:
compared with oc-mirror 4.15.9, can't reproduce this issue
Please review the following PR: https://github.com/openshift/images/pull/159
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The pipeline operator has been removed from the operator hub so CI has been failing since https://search.ci.openshift.org/?search=Entire+pipeline+flow+from+Builder+page+%22before+all%22+hook+for+%22Background+Steps%22&maxAge=336h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-34243. The following is the description of the original issue:
—
Description of problem:
aws capi installs, particularly when running under heavy load in ci, can sometimes fail with:
level=info msg=Creating private Hosted Zone level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed provisioning resources after infrastructure ready: failed to create private hosted zone: error creating private hosted zone: HostedZoneAlreadyExists: A hosted zone has already been created with the specified caller reference. level=error msg= status code: 409, request id: f173760d-ab43-41b8-a8a0-568cf387bf5e
Version-Release number of selected component (if applicable):
How reproducible:
not reproducible - needs to be discovered in ci
Steps to Reproduce:
1. 2. 3.
Actual results:
install fails due to existing hosted zone
Expected results:
HostedZoneAlreadyExists error should not cause install to fail
Additional info:
This is a clone of issue OCPBUGS-37753. The following is the description of the original issue:
—
discoverOpenIDURLs and checkOIDCPasswordGrantFlow fail if endpoints are private to the data plane.
This enabled the oauth server traffic to flow through the dataplane to enable reaching private endpoints e.g ldap https://issues.redhat.com/browse/HOSTEDCP-421
This enabled fallback to the management cluster network so for public endpoints we are not blocking on having data plane, e.g. github https://issues.redhat.com/browse/OCPBUGS-8073
This issue is to enable the CPO oidc checks to flow through the data plane and fallback to the management side to satisfy both cases above.
This woudl cover https://issues.redhat.com/browse/RFE-5638
Description of problem:
When upgrading clusters to any 4.13 version (from either 4.12 or 4.13), clusters with Hybrid Networking enabled appear to have a few DNS pods (not all) falling into CrashLoopBackOff status. Notably, pods are failing Readiness probes, and when deleted, work without incident. Nearly all other aspects of upgrade continue as expected
Version-Release number of selected component (if applicable):
4.13.x
How reproducible:
Always for systems in use
Steps to Reproduce:
1. Configure initial cluster installation with OVNKubernetes and enable Hybrid Networking on 4.12 or 4.13, e.g. 4.13.13 2. Upgrade cluster to 4.13.z, e.g. 4.13.14
Actual results:
dns-default pods in CrashLoopBackOff status, failing Readiness probes
Expected results:
dns-default pods are rolled out without incident
Additional info:
Appears strongly related to OCPBUGS-13172. CU has kept an affected cluster with the DNS pod issue ongoing for additional investigating, if needed.
Description of problem:
During the control plane upgrade e2e test, it seems that the openshift apiserver becomes unavailable during the upgrade process. The test is run on an HA control plane, and this should not happen.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Often
Steps to Reproduce:
1. Create a hosted cluster with HA control plane and wait for it to become available 2. Upgrade the hosted cluster to a newer release 3. While upgrading, monitor whether the openshift apiserver is available by either querying APIService resources or resources served by the openshift apiserver.
Actual results:
The openshift apiserver is unavailable at some point during the upgrade
Expected results:
The openshift apiserver is available throughout the upgrade
Additional info:
Seen in 4.14 to 4.15 update CI:
: [bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should not change condition/Available expand_less Run #0: Failed expand_less 1h34m55s { 1 unexpected clusteroperator state transitions during e2e test run Nov 22 21:48:41.624 - 56ms E clusteroperator/operator-lifecycle-manager-packageserver condition/Available reason/ClusterServiceVersionNotSucceeded status/False ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver observed in phase Failed with reason: APIServiceInstallFailed, message: APIService install failed: forbidden: User "system:anonymous" cannot get path "/apis/packages.operators.coreos.com/v1"}
While a brief auth failure isn't fantastic, an issue that only persists for 56ms is not long enough to warrant immediate admin intervention. Teaching the operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required. It's also possible that this is an incoming-RBAC vs. outgoing-RBAC race of some sort, and that shifting manifest filenames around could avoid the hiccup entirely.
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/operator-lifecycle-manager-packageserver+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort periodic-ci-openshift-cluster-etcd-operator-release-4.15-periodics-e2e-aws-etcd-recovery (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 5 runs, 40% failed, 50% of failures match = 20% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 8 runs, 38% failed, 33% of failures match = 13% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 5 runs, 20% failed, 400% of failures match = 80% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-ppc64le (all) - 6 runs, 67% failed, 75% of failures match = 50% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-s390x (all) - 6 runs, 100% failed, 33% of failures match = 33% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 5 runs, 40% failed, 50% of failures match = 20% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-sdn-arm64 (all) - 5 runs, 20% failed, 300% of failures match = 60% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 5 runs, 40% failed, 100% of failures match = 40% impact periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade (all) - 5 runs, 20% failed, 100% of failures match = 20% impact periodic-ci-openshift-release-master-ci-4.15-e2e-aws-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 43 runs, 51% failed, 36% of failures match = 19% impact periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-upgrade (all) - 5 runs, 20% failed, 300% of failures match = 60% impact periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 80 runs, 44% failed, 17% of failures match = 8% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 30% failed, 63% of failures match = 19% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-uwm (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 8 runs, 25% failed, 200% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 43% failed, 50% of failures match = 21% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 50 runs, 16% failed, 50% of failures match = 8% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-vsphere-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-from-stable-4.13-e2e-aws-sdn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial (all) - 5 runs, 100% failed, 80% of failures match = 80% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-upgrade-rollback-oldest-supported (all) - 4 runs, 25% failed, 100% of failures match = 25% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 50 runs, 18% failed, 178% of failures match = 32% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-sdn-bm-upgrade (all) - 6 runs, 83% failed, 20% of failures match = 17% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 6 runs, 83% failed, 60% of failures match = 50% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.13-e2e-aws-ovn-upgrade-paused (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 6 runs, 17% failed, 100% of failures match = 17% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-sdn-bm-upgrade (all) - 5 runs, 100% failed, 40% of failures match = 40% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 6 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 19 runs, 63% failed, 33% of failures match = 21% impact periodic-ci-openshift-release-master-okd-scos-4.15-e2e-aws-ovn-upgrade (all) - 15 runs, 47% failed, 57% of failures match = 27% impact
I'm not sure if all of those are from this system:anonymous issue, or if some of them are other mechanisms. Ideally we fix all of the Available=False noise, while, again, still going Available=False when it is worth summoning an admin immediately. Checking for different reason and message strings in recent 4.15-touching update runs:
$ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/operator-lifecycle-manager-packageserver.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False.*message: \(.*\)|\1 \2 \3|' | sort | uniq -c | sort -n 3 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded APIService install failed: Unauthorized 3 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded install timeout 4 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded install strategy failed: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1.packages.operators.coreos.com": the object has been modified; please apply your changes to the latest version and try again 9 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded apiServices not installed 23 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded install strategy failed: could not create service packageserver-service: services "packageserver-service" already exists 82 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded APIService install failed: forbidden: User "system:anonymous" cannot get path "/apis/packages.operators.coreos.com/v1"
Lots of hits in the above CI search. Running one of the 100% impact flavors has a good chance at reproducing.
1. Install 4.14
2. Update to 4.15
3. Keep an eye on operator-lifecycle-manager-packageserver's ClusterOperator Available.
Available=False blips.
Available=True the whole time, or any Available=False looks like a serious issue where summoning an admin would have been appropriate.
Causes also these testcases to fail (mentioning them here for Sippy to link here on relevant component readiness failures):
Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/55
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/machine-api-provider-aws/pull/94
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
pod cannot be ready due to incompatible CNI versions, see: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_testpod1gjvg8_z1_1cdbb285-d4f4-4fbb-8e30-16933315aa65_0(8a7067a7914fbf21f0f083a97be5ac48aa562ecc21472780d4cc2af3e5b7784e): error adding pod z1_testpod1gjvg8 to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: 'ContainerID:"8a7067a7914fbf21f0f083a97be5ac48aa562ecc21472780d4cc2af3e5b7784e" Netns:"/var/run/netns/fd68f325-4141-49be-8840-a48e51c5b76d" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=z1;K8S_POD_NAME=testpod1gjvg8;K8S_POD_INFRA_CONTAINER_ID=8a7067a7914fbf21f0f083a97be5ac48aa562ecc21472780d4cc2af3e5b7784e;K8S_POD_UID=1cdbb285-d4f4-4fbb-8e30-16933315aa65" Path:"" ERRORED: error configuring pod [z1/testpod1gjvg8] networking: [z1/testpod1gjvg8/1cdbb285-d4f4-4fbb-8e30-16933315aa65:static-sriovnetwork]: error adding container to network "static-sriovnetwork": failed to set up IPAM plugin type "whereabouts" from the device "ens2f0": incompatible CNI versions; config is "1.0.0", plugin supports ["0.1.0" "0.2.0" "0.3.0" "0.3.1" "0.4.0"]
Version-Release number of selected component (if applicable):
sriov-network-operator.v4.16.0-202405110441
How reproducible:
always
Steps to Reproduce:
1. Create cluster with sriov operator 2. Create VF by snnp 3. Create NAD by sriovnetwork as below # cat sriovnetwork-whereabouts apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: static-sriovnetwork namespace: openshift-sriov-network-operator spec: ipam: | { "type": "whereabouts", "range":"10.31.0.0/30" } capabilities: | { "mac": true, "ips": true } spoofChk: "off" trust: "on" resourceName: e810 networkNamespace: z1 4. the NAD generated cni version is 1.0.0 # oc get network-attachment-definitions.k8s.cni.cncf.io -n z1 -o yaml apiVersion: v1 items: - apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: annotations: k8s.v1.cni.cncf.io/resourceName: openshift.io/e810 creationTimestamp: "2024-05-13T02:24:39Z" generation: 1 name: static-sriovnetwork namespace: z1 resourceVersion: "380833" uid: 62d05e42-24c0-4427-bcd7-77d02fce31fb spec: config: |- { "cniVersion": "1.0.0", "name": "static-sriovnetwork", "type": "sriov", "vlan": 0, "spoofchk": "off", 5. Create test pod with above NAD network
Actual results:
Warning FailedCreatePodSandBox 25s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_testpod1gjvg8_z1_1cdbb285-d4f4-4fbb-8e30-16933315aa65_0(8a7067a7914fbf21f0f083a97be5ac48aa562ecc21472780d4cc2af3e5b7784e): error adding pod z1_testpod1gjvg8 to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: 'ContainerID:"8a7067a7914fbf21f0f083a97be5ac48aa562ecc21472780d4cc2af3e5b7784e" Netns:"/var/run/netns/fd68f325-4141-49be-8840-a48e51c5b76d" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=z1;K8S_POD_NAME=testpod1gjvg8;K8S_POD_INFRA_CONTAINER_ID=8a7067a7914fbf21f0f083a97be5ac48aa562ecc21472780d4cc2af3e5b7784e;K8S_POD_UID=1cdbb285-d4f4-4fbb-8e30-16933315aa65" Path:"" ERRORED: error configuring pod [z1/testpod1gjvg8] networking: [z1/testpod1gjvg8/1cdbb285-d4f4-4fbb-8e30-16933315aa65:static-sriovnetwork]: error adding container to network "static-sriovnetwork": failed to set up IPAM plugin type "whereabouts" from the device "ens2f0": incompatible CNI versions; config is "1.0.0", plugin supports ["0.1.0" "0.2.0" "0.3.0" "0.3.1" "0.4.0"]
Expected results:
Additional info:
should be caused by this changes https://github.com/openshift/sriov-network-operator/commit/ace40a0f6b8d32c34a05fc680130c1a358d90fbd
Please review the following PR: https://github.com/openshift/oauth-apiserver/pull/101
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
OpenShift Assisted Installer reporting Dell PowerEdge C6615 node’s four 960GB SATA Solid State Disks as removable and subsequently refusing to continue installation of OpenShift on to at least one of those Disks. This issue is where by OpenShift agent installer reports installed SATA SDDs are removable and refuses to use any of them as installation targets. Linux Kernel reports: sd 4:0:0:0 [sdb] Attached SCSI removable disk sd 5:0:0:0 [sdc] Attached SCSI removable disk sd 6:0:0:0 [sdd] Attached SCSI removable disk sd 3:0:0:0 [sda] Attached SCSI removable disk Each removable disk is clean, 894.3GiB free space, no partitions etc. However - Insufficient This host does not meet the minimum hardware or networking requirements and will not be included in the cluster. Hardware: Failed Warning alert: Insufficient Minimum disks of required size: No eligible disks were found, please check specific disks to see why they are not eligible.
Version-Release number of selected component (if applicable):
4.15.z
How reproducible:
100 %
Steps to Reproduce:
1. Install with assisted Installer 2. Generate ISO using option over console. 3. Boot the ISO on dell HW mentioned in description 4. Observe journal logs for disk validations
Actual results:
Installation fails at disk validation
Expected results:
Installation should complete
Additional info:
Description of problem:
When bootstrap logs are collected (e.g. as part of a CI run when bootstrapping fails), it no longer contains most of the Ironic services. They used to be run in standalone pods, but after a recent refactoring, they are systemd services.
Description of problem:
when use the oc-mirror with v2 format , will save the .oc-mirror dir to the default jjjjuser directory , and the data is very large. Now we don't have flag to specify the path for the log, but should save to the working directory.
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.15.0-202312011230.p0.ge4022d0.assembly.stream-e4022d0", GitCommit:"e4022d08586406f3a0f92bab1d3ea6cb8856b4fa", GitTreeState:"clean", BuildDate:"2023-12-01T12:48:12Z", GoVersion:"go1.20.10 X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
always
Steps to Reproduce:
run command : oc-mirror --from file://out docker://localhost:5000/ocptest --v2 --config config.yaml
Actual results:
will save the logs to user directory.
Expected results:
Better to have flags to specify where to save the logs or use the working directory .
Description of problem:
In LGW (local gateway mode) mode, when pod is selected by an EIP thats hosted by an interface that isnt the default interface, connection to node IP fails
Version-Release number of selected component (if applicable):
4.14
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-33745. The following is the description of the original issue:
—
Description of problem:
Follow up the step described in https://github.com/openshift/installer/pull/8350 to destroy bootstrap server manually, failed with error `FATAL error destroying bootstrap resources failed to delete bootstrap machine: machines.cluster.x-k8s.io "jimatest-5sjqx-bootstrap" not found` # ./openshift-install version ./openshift-install 4.16.0-0.nightly-2024-05-15-001800 built from commit 494b79cf906dc192b8d1a6d98e56ce1036ea932f release image registry.ci.openshift.org/ocp/release@sha256:d055d117027aa9afff8af91da4a265b7c595dc3ded73a2bca71c3161b28d9d5d release architecture amd64 On AWS: # ./openshift-install create cluster --dir ipi-aws INFO Credentials loaded from the "default" profile in file "/root/.aws/credentials" WARNING failed to find default instance type: no instance type found for the zone constraint WARNING failed to find default instance type for worker pool: no instance type found for the zone constraint INFO Consuming Install Config from target directory WARNING failed to find default instance type: no instance type found for the zone constraint WARNING FeatureSet "TechPreviewNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster. INFO Creating infrastructure resources... INFO Creating IAM roles for control-plane and compute nodes INFO Started local control plane with envtest INFO Stored kubeconfig for envtest in: /tmp/jima/ipi-aws/auth/envtest.kubeconfig INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:44379 --webhook-port=44331 --webhook-cert-dir=/tmp/envtest-serving-certs-1391600832] INFO Running process: aws infrastructure provider with args [-v=4 --diagnostics-address=0 --health-addr=127.0.0.1:42725 --webhook-port=45711 --webhook-cert-dir=/tmp/envtest-serving-certs-1758849099 --feature-gates=BootstrapFormatIgnition=true,ExternalResourceGC=true] INFO Created manifest *v1.Namespace, namespace= name=openshift-cluster-api-guests INFO Created manifest *v1beta2.AWSClusterControllerIdentity, namespace= name=default INFO Created manifest *v1beta1.Cluster, namespace=openshift-cluster-api-guests name=jima16a-2xszh INFO Created manifest *v1beta2.AWSCluster, namespace=openshift-cluster-api-guests name=jima16a-2xszh INFO Waiting up to 15m0s (until 11:01PM EDT) for network infrastructure to become ready... INFO Network infrastructure is ready INFO Creating private Hosted Zone INFO Creating Route53 records for control plane load balancer INFO Created manifest *v1beta2.AWSMachine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-bootstrap INFO Created manifest *v1beta2.AWSMachine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-master-0 INFO Created manifest *v1beta2.AWSMachine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-master-1 INFO Created manifest *v1beta2.AWSMachine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-master-2 INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-bootstrap INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-master-0 INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-master-1 INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-master-2 INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jima16a-2xszh-bootstrap INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jima16a-2xszh-master INFO Waiting up to 15m0s (until 11:07PM EDT) for machines to provision... INFO Control-plane machines are ready INFO Cluster API resources have been created. Waiting for cluster to become ready... INFO Waiting up to 20m0s (until 11:12PM EDT) for the Kubernetes API at https://api.jima16a.qe.devcluster.openshift.com:6443... INFO API v1.29.4+4a87b53 up INFO Waiting up to 30m0s (until 11:25PM EDT) for bootstrapping to complete... ^CWARNING Received interrupt signal INFO Shutting down local Cluster API control plane... INFO Stopped controller: Cluster API INFO Stopped controller: aws infrastructure provider INFO Local Cluster API system has completed operations # ./openshift-install destroy bootstrap --dir ipi-aws INFO Started local control plane with envtest INFO Stored kubeconfig for envtest in: /tmp/jima/ipi-aws/auth/envtest.kubeconfig INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:45869 --webhook-port=43141 --webhook-cert-dir=/tmp/envtest-serving-certs-3670728979] INFO Running process: aws infrastructure provider with args [-v=4 --diagnostics-address=0 --health-addr=127.0.0.1:46111 --webhook-port=35061 --webhook-cert-dir=/tmp/envtest-serving-certs-3674093147 --feature-gates=BootstrapFormatIgnition=true,ExternalResourceGC=true] FATAL error destroying bootstrap resources failed to delete bootstrap machine: machines.cluster.x-k8s.io "jima16a-2xszh-bootstrap" not found INFO Shutting down local Cluster API control plane... INFO Stopped controller: Cluster API INFO Stopped controller: aws infrastructure provider INFO Local Cluster API system has completed operations Same issue on vSphere: # ./openshift-install create cluster --dir ipi-vsphere/ INFO Consuming Install Config from target directory WARNING FeatureSet "CustomNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster. INFO Creating infrastructure resources... INFO Started local control plane with envtest INFO Stored kubeconfig for envtest in: /tmp/jima/ipi-vsphere/auth/envtest.kubeconfig INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:39945 --webhook-port=36529 --webhook-cert-dir=/tmp/envtest-serving-certs-3244100953] INFO Running process: vsphere infrastructure provider with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:45417 --webhook-port=37503 --webhook-cert-dir=/tmp/envtest-serving-certs-3224060135 --leader-elect=false] INFO Created manifest *v1.Namespace, namespace= name=openshift-cluster-api-guests INFO Created manifest *v1beta1.Cluster, namespace=openshift-cluster-api-guests name=jimatest-5sjqx INFO Created manifest *v1beta1.VSphereCluster, namespace=openshift-cluster-api-guests name=jimatest-5sjqx INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=vsphere-creds INFO Waiting up to 15m0s (until 10:47PM EDT) for network infrastructure to become ready... INFO Network infrastructure is ready INFO Created manifest *v1beta1.VSphereMachine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-bootstrap INFO Created manifest *v1beta1.VSphereMachine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-master-0 INFO Created manifest *v1beta1.VSphereMachine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-master-1 INFO Created manifest *v1beta1.VSphereMachine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-master-2 INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-bootstrap INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-master-0 INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-master-1 INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-master-2 INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-bootstrap INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-master INFO Waiting up to 15m0s (until 10:47PM EDT) for machines to provision... INFO Control-plane machines are ready INFO Cluster API resources have been created. Waiting for cluster to become ready... INFO Waiting up to 20m0s (until 10:57PM EDT) for the Kubernetes API at https://api.jimatest.qe.devcluster.openshift.com:6443... INFO API v1.29.4+4a87b53 up INFO Waiting up to 1h0m0s (until 11:37PM EDT) for bootstrapping to complete... ^CWARNING Received interrupt signal INFO Shutting down local Cluster API control plane... INFO Stopped controller: Cluster API INFO Stopped controller: vsphere infrastructure provider INFO Local Cluster API system has completed operations # ./openshift-install destroy bootstrap --dir ipi-vsphere/ INFO Started local control plane with envtest INFO Stored kubeconfig for envtest in: /tmp/jima/ipi-vsphere/auth/envtest.kubeconfig INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:34957 --webhook-port=34511 --webhook-cert-dir=/tmp/envtest-serving-certs-94748118] INFO Running process: vsphere infrastructure provider with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:42073 --webhook-port=46721 --webhook-cert-dir=/tmp/envtest-serving-certs-4091171333 --leader-elect=false] FATAL error destroying bootstrap resources failed to delete bootstrap machine: machines.cluster.x-k8s.io "jimatest-5sjqx-bootstrap" not found INFO Shutting down local Cluster API control plane... INFO Stopped controller: Cluster API INFO Stopped controller: vsphere infrastructure provider INFO Local Cluster API system has completed operations
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-15-001800
How reproducible:
Always
Steps to Reproduce:
1. Create cluster 2. Interrupt the installation when waiting for bootstrap completed 3. Run command "openshift-install destroy bootstrap --dir <dir>" to destroy bootstrap manually
Actual results:
Failed to destroy bootstrap through command 'openshift-install destroy bootstrap --dir <dir>'
Expected results:
Bootstrap host is destroyed successfully
Additional info:
Please review the following PR: https://github.com/openshift/cluster-api-operator/pull/32
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cloud-provider-powervs/pull/62
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/1006
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-autoscaler-operator/pull/303
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Seen in a 4.17 nightly-to-nightly CI update:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade/1809154554084724736/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-machine-config-operator") | .reason' | sort | uniq -c | sort -n | tail -n3 82 Pulled 82 Started 2116 ValidatingAdmissionPolicyUpdated $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade/1809154554084724736/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-machine-config-operator" and .reason == "ValidatingAdmissionPolicyUpdated").message' | sort | uniq -c 705 Updated ValidatingAdmissionPolicy.admissionregistration.k8s.io/machine-configuration-guards because it changed 705 Updated ValidatingAdmissionPolicy.admissionregistration.k8s.io/managed-bootimages-platform-check because it changed 706 Updated ValidatingAdmissionPolicy.admissionregistration.k8s.io/mcn-guards because it changed
I'm not sure what those are about (which may be a bug on it's own? Would be nice to know what changed), but it smells like a hot loop to me.
Seen in 4.17. Not clear yet how to audit for exposure frequency or versions, short of teaching the origin test suite to fail if it sees too many of these kinds of events? Maybe a for openshift-... namespaces version of the current events should not repeat pathologically in e2e namespaces test-case? Which we may have, but it's not tripping?
Besides the initial update, also seen in this 4.17.0-0.nightly-2024-07-05-091056 serial run:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-serial/1809154615350923264/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-machine-config-operator" and .reason == "ValidatingAdmissionPolicyUpdated").message' | sort | uniq -c 1006 Updated ValidatingAdmissionPolicy.admissionregistration.k8s.io/machine-configuration-guards because it changed 1006 Updated ValidatingAdmissionPolicy.admissionregistration.k8s.io/managed-bootimages-platform-check because it changed 1007 Updated ValidatingAdmissionPolicy.admissionregistration.k8s.io/mcn-guards because it changed
So possibly every time, in all 4.17 clusters?
1. Unclear. Possibly just install 4.17.
2. Run oc -n openshift-machine-config-operator get -o json events | jq -r '.items[] | select(.reason == "ValidatingAdmissionPolicyUpdated")'.
Thousands of hits.
Zero to few hits.
This is a clone of issue OCPBUGS-39458. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-37819. The following is the description of the original issue:
—
Description of problem:
When we added new bundle metadata encoding as `olm.csv.metadata` in https://github.com/operator-framework/operator-registry/pull/1094 (downstreamed for 4.15+) we created situations where - konflux onboarded operators, encouraged to use upstream:latest to generate FBC from templates; and - IIB-generated catalog images which used earlier opm versions to serve content could generate the new format but not be able to serve it. One only has to `opm render` an SQLite catalog image, or expand a catalog template.
Version-Release number of selected component (if applicable):
How reproducible:
every time
Steps to Reproduce:
1. opm render an SQLite catalog image 2. 3.
Actual results:
uses `olm.csv.metadata` in the output
Expected results:
only using `olm.bundle.object` in the output
Additional info:
Please review the following PR: https://github.com/openshift/router/pull/546
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-34791. The following is the description of the original issue:
—
add-flow-ci.feature test is flaking sporadically for both console and console-operator repositories.
Running: add-flow-ci.feature (1 of 1) [23798:0602/212526.775826:ERROR:zygote_host_impl_linux.cc(273)] Failed to adjust OOM score of renderer with pid 24169: Permission denied (13) Couldn't determine Mocha version Logging in as test Create the different workloads from Add page redirect to home ensure perspective switcher is set to Developer ✓ Getting started resources on Developer perspective (16906ms) redirect to home ensure perspective switcher is set to Developer Select Template category CI/CD You are on Topology page - Graph view ✓ Deploy Application using Catalog Template "CI/CD": A-01-TC02 (example #1) (27858ms) redirect to home ensure perspective switcher is set to Developer Select Template category Databases You are on Topology page - Graph view ✓ Deploy Application using Catalog Template "Databases": A-01-TC02 (example #2) (29800ms) redirect to home ensure perspective switcher is set to Developer Select Template category Languages You are on Topology page - Graph view ✓ Deploy Application using Catalog Template "Languages": A-01-TC02 (example #3) (38286ms) redirect to home ensure perspective switcher is set to Developer Select Template category Middleware You are on Topology page - Graph view ✓ Deploy Application using Catalog Template "Middleware": A-01-TC02 (example #4) (30501ms) redirect to home ensure perspective switcher is set to Developer Select Template category Other You are on Topology page - Graph view ✓ Deploy Application using Catalog Template "Other": A-01-TC02 (example #5) (35567ms) redirect to home ensure perspective switcher is set to Developer Application Name "sample-app" is created Resource type "deployment" is selected You are on Topology page - Graph view ✓ Deploy secure image with Runtime icon from external registry: A-02-TC02 (example #1) (28896ms) redirect to home ensure perspective switcher is set to Developer Application Name "sample-app" is selected Resource type "deployment" is selected You are on Topology page - Graph view ✓ Deploy image with Runtime icon from internal registry: A-02-TC03 (example #1) (23555ms) redirect to home ensure perspective switcher is set to Developer Resource type "deployment" is selected You are on Topology page - Graph view You are on Topology page - Graph view You are on Topology page - Graph view ✓ Edit Runtime Icon while Editing Image: A-02-TC05 (47438ms) redirect to home ensure perspective switcher is set to Developer You are on Topology page - Graph view ✓ Create the Database from Add page: A-03-TC01 (19645ms) redirect to home ensure perspective switcher is set to Developer redirect to home ensure perspective switcher is set to Developer 1) Deploy git workload with devfile from topology page: A-04-TC01 redirect to home ensure perspective switcher is set to Developer Resource type "Deployment" is selected You are on Topology page - Graph view ✓ Create a workload from Docker file with "Deployment" as resource type: A-05-TC02 (example #1) (43434ms) redirect to home ensure perspective switcher is set to Developer You are on Topology page - Graph view ✓ Create a workload from YAML file: A-07-TC01 (31905ms) redirect to home ensure perspective switcher is set to Developer ✓ Upload Jar file page details: A-10-TC01 (24692ms) redirect to home ensure perspective switcher is set to Developer You are on Topology page - Graph view ✓ Create Sample Application from Add page: GS-03-TC05 (example #1) (40882ms) redirect to home ensure perspective switcher is set to Developer You are on Topology page - Graph view ✓ Create Sample Application from Add page: GS-03-TC05 (example #2) (52287ms) redirect to home ensure perspective switcher is set to Developer ✓ Quick Starts page when no Quick Start has started: QS-03-TC02 (23439ms) redirect to home ensure perspective switcher is set to Developer quick start is complete ✓ Quick Starts page when Quick Start has completed: QS-03-TC03 (28139ms) 17 passing (10m) 1 failing 1) Create the different workloads from Add page Deploy git workload with devfile from topology page: A-04-TC01: CypressError: `cy.focus()` can only be called on a single element. Your subject contained 14 elements. https://on.cypress.io/focus at Context.focus (https://console-openshift-console.apps.ci-op-lm9pvf4l-be832.origin-ci-int-aws.dev.rhcloud.com/__cypress/runner/cypress_runner.js:112944:70) at wrapped (https://console-openshift-console.apps.ci-op-lm9pvf4l-be832.origin-ci-int-aws.dev.rhcloud.com/__cypress/runner/cypress_runner.js:138021:19) From Your Spec Code: at Context.eval (webpack:///./support/step-definitions/addFlow/create-from-devfile.ts:10:59) at Context.resolveAndRunStepDefinition (webpack:////go/src/github.com/openshift/console/frontend/node_modules/cypress-cucumber-preprocessor/lib/resolveStepDefinition.js:217:0) at Context.eval (webpack:////go/src/github.com/openshift/console/frontend/node_modules/cypress-cucumber-preprocessor/lib/createTestFromScenario.js:26:0) [mochawesome] Report JSON saved to /go/src/github.com/openshift/console/frontend/gui_test_screenshots/cypress_report_devconsole.json (Results) ┌────────────────────────────────────────────────────────────────────────────────────────────────┐ │ Tests: 18 │ │ Passing: 17 │ │ Failing: 1 │ │ Pending: 0 │ │ Skipped: 0 │ │ Screenshots: 2 │ │ Video: false │ │ Duration: 10 minutes, 0 seconds │ │ Spec Ran: add-flow-ci.feature │ └────────────────────────────────────────────────────────────────────────────────────────────────┘ (Screenshots) - /go/src/github.com/openshift/console/frontend/gui_test_screenshots/cypress/scree (1280x720) nshots/add-flow-ci.feature/Create the different workloads from Add page -- Deplo y git workload with devfile from topology page A-04-TC01 (failed).png - /go/src/github.com/openshift/console/frontend/gui_test_screenshots/cypress/scree (1280x720) nshots/add-flow-ci.feature/Create the different workloads from Add page -- Deplo y git workload with devfile from topology page A-04-TC01 (failed) (attempt 2).pn g ==================================================================================================== (Run Finished) Spec Tests Passing Failing Pending Skipped ┌────────────────────────────────────────────────────────────────────────────────────────────────┐ │ ✖ add-flow-ci.feature 10:00 18 17 1 - - │ └────────────────────────────────────────────────────────────────────────────────────────────────┘ ✖ 1 of 1 failed (100%) 10:00 18 17 1 - -
Please review the following PR: https://github.com/openshift/apiserver-network-proxy/pull/46
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The code requires the `s3:HeadObject` permission (https://github.com/openshift/cloud-credential-operator/blob/master/pkg/aws/utils.go#L57) but it doesn't exist. The AWS docs say the permission needed is `s3:ListBucket`: https://docs.aws.amazon.com/AmazonS3/latest/API/API_HeadBucket.html
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. Try to install cluster with minimal permissions without s3:HeadBucket 2. 3.
Actual results:
level=warning msg=Action not allowed with tested creds action=iam:DeleteUserPolicy level=warning msg=Tested creds not able to perform all requested actions level=warning msg=Action not allowed with tested creds action=s3:HeadBucket level=warning msg=Tested creds not able to perform all requested actions level=fatal msg=failed to fetch Cluster: failed to fetch dependency of "Cluster": failed to generate asset "Platform Permissions Check": validate AWS credentials: AWS credentials cannot be used to either create new creds or use as-is Installer exit with code 1
Expected results:
Only `s3:ListBucket` should be checked.
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-api-provider-libvirt/pull/273
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The egress-router implementations under https://github.com/openshift/images/tree/master/egress have unit tests alongside the implementations, within the same repository, but the repository does not have a CI job to run those unit tests. We do not have any tests for egress-router in https://github.com/openshift/origin. This means that we are effectively lacking CI test coverage for egress-router.
All versions.
100%.
1. Open a PR in https://github.com/openshift/images and check which CI jobs are run on it.
2. Check the job definitions in https://github.com/openshift/release/blob/master/ci-operator/jobs/openshift/images/openshift-images-master-presubmits.yaml.
There are "ci/prow/e2e-aws", "ci/prow/e2e-aws-upgrade", and "ci/prow/images" jobs defined, but no "ci/prow/unit" job.
There should be a "ci/prow/unit" job, and this job should run the unit tests that are defined in the repository.
The lack of a CI job came up on https://github.com/openshift/images/pull/162.
Please review the following PR: https://github.com/openshift/machine-config-operator/pull/4070
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
ART is moving the container images to be built by Golang 1.21. We should do the same to keep our build config in sync with ART.
Version-Release number of selected component (if applicable):
4.16/master
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When deploying a cluster on Power VS, you need to wait for a short period after the workspace is created to facilitate the network configuration. This period is ignored by the DHCP service.
Version-Release number of selected component (if applicable):
How reproducible:
Easily
Steps to Reproduce:
1. Deploy a cluster on Power VS with an installer provisioned workspace 2. Observe that the terraform logs ignore the waiting period
Actual results:
Expected results:
Additional info:
Description of problem:
Follow on bug for story Add option to enable/disable tailing to Pod log viewer issues: 1. The position property for pf-5 Dropdown component doesn't workTo reproduce:Add `position="right"` property to `Dropdown` componentThe position doesn't change in `"@patternfly/react-core": "5.1.0"` 2. Clicking the `Checkbox` label wrapped with `DropdownItem` doesn't trigger the `onChange` on mobile screen. 3. The Expand button color is not blue in mobile due to replacing Button with DropdownItem4. The kebab toggle jumps to the screen top if already opened when resizing. 4. The kebab toggle jumps to the screen top if already opened when resizing.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Multipart upload issues with Cloudflare R2 using S3 api. Some S3 compatible object storage systems like R2 require that all multipart chunks are the same size. This was mostly true before, except the final chunk was larger than the requested chunk size which causes uploads to fail.
Version-Release number of selected component (if applicable):
How reproducible:
Problem shows itself on OpenShift CI clusters intermittently.
Steps to Reproduce:
This behavior has been causing 504 Gateway Timeout issues in the image registry instances in OpenShift CI clusters. It is connected to uploading big images (i.e 35GB), but we do not currently have the exact steps that reproduce it. 1. 2. 3.
Actual results:
Expected results:
Additional info:
https://github.com/distribution/distribution/issues/3873 https://github.com/distribution/distribution/issues/3873#issuecomment-2258926705 https://developers.cloudflare.com/r2/api/workers/workers-api-reference/#r2multipartupload-definition (look for "uniform in size")
Description of problem:
4.15 nightly payloads have been affected by this test multiple times: : [sig-arch] events should not repeat pathologically for ns/openshift-kube-scheduler expand_less0s{ 1 events happened too frequently event happened 21 times, something is wrong: namespace/openshift-kube-scheduler node/ci-op-2gywzc86-aa265-5skmk-master-1 pod/openshift-kube-scheduler-guard-ci-op-2gywzc86-aa265-5skmk-master-1 hmsg/2652c73da5 - reason/ProbeError Readiness probe error: Get "https://10.0.0.7:10259/healthz": dial tcp 10.0.0.7:10259: connect: connection refused result=reject body: From: 08:41:08Z To: 08:41:09Z} In each of the 10 jobs aggregated, 2 to 3 jobs failed with this test. Historically this test passed 100%. But with the past two days test data, the passing rate has dropped to 97% and aggregator started allowing this in the latest payload: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-azure-ovn-upgrade-4.15-micro-release-openshift-release-analysis-aggregator/1732295947339173888 The first payload this started appearing is https://amd64.ocp.releases.ci.openshift.org/releasestream/4.15.0-0.nightly/release/4.15.0-0.nightly-2023-12-05-071627. All the events happened during cluster-operator/kube-scheduler progressing. For comparison, here is a passed job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade/1731936539870498816 Here is a failed one: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade/1731936538192777216 They both have the same set of probe error events. For the passing jobs, the frequency is lower than 20, while for the failed job, one of those events repeated more than 20 times and therefore results in the test failure.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Repositories list page breaks with a TypeError cannot read properties of undefined (reading `pipelinesascode.tekton.dev/repository`)
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
https://drive.google.com/file/d/1TpH_PTyBxNX0b9SPZ2yS8b-q-tbvp6Ok/view?usp=sharing
Description of problem:
When using canary rollout, paused MCPs begin updating when the user triggers the cluster update.
Version-Release number of selected component (if applicable):
How reproducible:
Approximately 3/10 times that I have witnessed.
Steps to Reproduce:
1. Install cluster 2. Follow canary rollout strategy: https://docs.openshift.com/container-platform/4.11/updating/update-using-custom-machine-config-pools.html 3. Start cluster update
Actual results:
Worker nodes in paused MCPs begin update
Expected results:
Worker nodes in paused MCPs will not begin update until cluster admin unpauses the MCPs
Additional info:
This has occurred with my customer in their Azure self-managed cluster and their on-prem cluster in vSphere, as well as my lab cluster in vSphere.
Description of problem:
The problem was that namespace handler on initial sync would delete all ports (because logical port cache where it got lsp UUIDs wasn't populated) and all acls (they were just set to nil). Even though both ports and acls will be re-added by the corresponding handlers, it may cause disruption.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. create a namespace with at least 1 pod and egress firewall in it
2. pick any ovnkube-node pod, find namespace port group UUID in nbdb by external_ids["name"]=<namespace name>, e.g. for "test" namespace
_uuid : 6142932d-4084-4bc3-bdcb-1990fc71891b acls : [ab2be619-1266-41c2-bb1d-1052cb4e1e97, b90a4b4a-ceee-41ee-a801-08c37a9bf3e7, d314fa8d-7b5a-40a5-b3d4-31091d7b9eae] external_ids : {name=test} name : a18007334074686647077 ports : [55b700e4-8176-42e7-97a6-8b32a82fefe5, cb71739c-ad6c-4436-8fd6-0643a5417c7d, d8644bf1-6bed-4db7-abf8-7aaab0625324]
3. restart chosen ovn-k pod
4. check logs on restart that update chosen port group to have zero ports and zero acls
Update operations generated as: [{Op:update Table:Port_Group Row:map[acls:{GoSet:[]} external_ids:{GoMap:map[name:test]} ports:{GoSet:[]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {6142932d-4084-4bc3-bdcb-1990fc71891b}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUID: UUIDName:}]
Actual results:
Expected results:
On restart port group stays the same, no extra update with empty ports and acls is generated
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
This is a clone of issue OCPBUGS-35197. The following is the description of the original issue:
—
Description of problem:
The issue is found when QE testing the minimal Firewall list required by an AWS installation (https://docs.openshift.com/container-platform/4.15/installing/install_config/configuring-firewall.html) for 4.16. The way we're verifying this is by setting all the URLs listed in the doc into the whitelist of a proxy server[1], adding the proxy to install-config.yaml, so addresses outside of the doc will be rejected by the proxy server during cluster installation. [1]https://steps.ci.openshift.org/chain/proxy-whitelist-aws We're seeing such error from Masters' console ``` [ 344.982244] ignition[782]: GET https://api-int.ci-op-b2hcg02h-ce587.qe.devcluster.openshift.com:22623/config/master: attempt #73 [ 344.985074] ignition[782]: GET error: Get "https://api-int.ci-op-b2hcg02h-ce587.qe.devcluster.openshift.com:22623/config/master": Forbidden ``` And the deny log from proxy server ``` 1717653185.468 0 10.0.85.91 TCP_DENIED/403 2252 CONNECT api-int.ci-op-b2hcg02h-ce587.qe.devcluster.openshift.com:22623 - HIER_NONE/- text/html ``` So looks Master is using proxy to visit the MCS address, and the Internal API domain - api-int.ci-op-b2hcg02h-ce587.qe.devcluster.openshift.com is not in the whitelist of proxy, so the request is denied by proxy. But actually such Internal API address should be already in the NoProxy list, so master shouldn't use proxy to send the internal request. This is a proxy info collected from another cluster, the api-int.<cluter_domain> is added in the no proxy list by default. ``` [root@ip-10-0-11-89 ~]# cat /etc/profile.d/proxy.sh export HTTP_PROXY="http://ec2-3-16-83-95.us-east-2.compute.amazonaws.com:3128" export HTTPS_PROXY="http://ec2-3-16-83-95.us-east-2.compute.amazonaws.com:3128" export NO_PROXY=".cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-dis3.qe.devcluster.openshift.com,localhost,test.no-proxy.com" ```
Version-Release number of selected component (if applicable):
registry.ci.openshift.org/ocp/release:4.16.0-0.nightly-2024-06-02-202327
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-30949. The following is the description of the original issue:
—
Description of problem: After changing the value of enable_topology in the openshift-config/cloud-provider-config config map, the CSI controller pods should restart to pick up the new value. This is not happening.
It seems like our understanding in https://github.com/openshift/openstack-cinder-csi-driver-operator/pull/127#issuecomment-1780967488 was wrong.
Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/321
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
gstreamer1 package (and its plugins) include certain video/audio codecs, which create licensing concerns for our Partners, who embed our solutions (OCP) and deliver it to their end customers. ose-network-tools container image (seems applicable for all OCP releases) includes dependency to gstreamer1 rpm (and its plugin rpms, like gstreamer1-plugins-bad-free). The request is re-consider this dependency and if possible totally remove it. It is a blocking issue which prevents our partners to deliver their solution on the field. It is an indirect dependency. ose-network-tools includes wireshark, wireshark has dependency to qt5-multimedia, which in turn includes dependency to gstreamer1-plugins-bad-free. First question: is wireshark really needed for network-tools? Wireshirk is a GUI tool, so dependency is not clear. Second question: would wireshark-cli be sufficient for needed purposes instead? Because CLI version does not contain dependency to qt5 and so on.
Version-Release number of selected component (if applicable):
Seems applicable to all active OCP releases.
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
Until latest release we had a test which try to set dns name to 11.11.11 and was expecting BE to throw exception was succeding , (BE was throwing exception)
since last release it is no longer the case and dns name is beeing accepted as
How reproducible:
Steps to reproduce:
1.create a cluster , set base dns name to 11.11.11
2.
3.
Actual results:
according to discussion in thread dns name must start with a letter , in that case we expecting BE to throw exception
Expected results:
No exception thrown
https://prow.ci.openshift.org/job-history/gs/test-platform-results/pr-logs/directory/pull-ci-openshift-telemeter-master-integration shows that the job fails a lot while there was no recent change which could explain this.
It blocks merges to the telemeter repository for no valid reason.
This is a clone of issue OCPBUGS-35879. The following is the description of the original issue:
—
Description of problem:
Customer reports that in the OpenShift Container Platform for a single namespace they are seeing a "TypeError: Cannot read properties of null (reading 'metadata')" error when navigating to the Topology view (Developer Console):
TypeError: Cannot read properties of null (reading 'metadata') at s (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1220454) at s (https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:424007) at t.a (https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:330465) at na (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:263:58879) at Hs (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:263:111315) at xl (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:263:98327) at Cl (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:263:98255) at _l (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:263:98118) at pl (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:263:95105) at https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:263:44774
Screenshot is available in the linked Support Case. The following Stack Trace is shown:
at t.a (https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:330387) at g at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at g at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at g at a (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:245070) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at g at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at g at t.a (https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:426770) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at g at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at a (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:242507) at svg at div at https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:603940 at u (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:602181) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at e.a (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:398426) at div at https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:353461 at https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:354168 at s (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1405970) at S (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:98:86864) at i (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:452052) at withFallback(Connect(withUserSettingsCompatibility(undefined))) at div at div at c (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:62178) at div at div at c (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:545565) at d (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:775077) at div at d (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:458280) at div at div at c (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:719437) at div at c (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:9899) at div at https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:512628 at S (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:98:86864) at t (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:123:75018) at https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:511867 at https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:150:220157 at https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:375316 at div at R (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:183146) at N (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:183594) at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249) at https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:509351 at https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:548866 at S (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:98:86864) at div at div at t.b (https://console.apps.example.com/static/dev-console/code-refs/common-chunk-5e4f38c02bde64a97ae5.min.js:1:113711) at t.a (https://console.apps.example.com/static/dev-console/code-refs/common-chunk-5e4f38c02bde64a97ae5.min.js:1:116541) at u (https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:305613) at https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:509656 at i (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:452052) at withFallback() at t.a (https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:553554) at t (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:21:67625) at I (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1533554) at t (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:21:69670) at Suspense at i (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:452052) at section at m (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:720427) at div at div at t.a (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1533801) at div at div at c (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:545565) at d (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:775077) at div at d (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:458280) at l (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1175827) at https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:458912 at S (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:98:86864) at main at div at v (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:264220) at div at div at c (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:62178) at div at div at c (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:545565) at d (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:775077) at div at d (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:458280) at Un (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:36:183620) at t.default (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:880042) at e.default (https://console.apps.example.com/static/quick-start-chunk-794085a235e14913bdf3.min.js:1:3540) at s (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:239711) at t.a (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1610459) at ee (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1628636) at _t (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:36:142374) at ee (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1628636) at ee (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1628636) at ee (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1628636) at i (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:830807) at t.a (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1604651) at t.a (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1604840) at t.a (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1602256) at te (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1628767) at https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1631899 at r (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:36:121910) at t (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:21:67625) at t (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:21:69670) at t (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:21:64230) at re (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1632210) at t.a (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:804787) at t.a (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1079398) at s (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:654118) at t.a (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:150:195887) at Suspense
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.13.38 Developer Console
How reproducible:
Only on customer side, in a single namespace on a single cluster
Steps to Reproduce:
1. On a particular cluster, enter the Developer Console 2. Navigate to "Topology"
Actual results:
Loading the page fails with the error "TypeError: Cannot read properties of null (reading 'metadata')"
Expected results:
No error is shown. The Topology view is shown
Additional info:
- Screenshot available in linked Support Case - HAR file is available in linked Support Case
Description of problem:
Since the singular variant of APIVIP/IngressVIP has been removed as part of https://github.com/openshift/installer/pull/7574, the appliance disk image e2e job is now failing: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-appliance-master-e2e-compact-ipv4-static The job fails since th appliance support only 4.14, which still requires the singular variant of the VIP properties.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Invoke appliance e2e job on master: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-appliance-master-e2e-compact-ipv4-static
Actual results:
Job fails with the following validation error: "the Machine Network CIDR can be defined by setting either the API or Ingress virtual IPs" Due to missing apiVIP and ingressVIP in AgentClusterInstall.
Expected results:
AgentClusterInstall should include also the singular 'apiVIP' and 'ingressVIP', and the e2e job should successfully complete
Additional info:
This is a clone of issue OCPBUGS-29664. The following is the description of the original issue:
—
Description of problem:
Created Net-attach-def with 2 IPs in range. After that created deployment with 2 replicas using that net-attach-def. Whereabouts daemoneset is created also cronjob is enable reconsiling at every one min. When i poweroff the node one which one of pod is deployded gracefully(poweroff)/ungracefully(poweroff --force) new pod is getting created on healthy node and stuck in container creating state
Version-Release number of selected component (if applicable):
4.14.11
How reproducible:
- Create whereabout daemon set with help of [documentation]([https://docs.openshift.com/container-platform/4.14/networking/multiple_networks/configuring-additional-network.html#nw-multus-creating-whereabouts-reconciler-daemon-set_configuring-additional-network)] - Update the reconciler_cron_expression to: "*/1 * * * *" - Create net-attach-def with 2 IPs in range - Create deployment with 2 replicas - Powreoff the node on which on of the POD is running - New Pod spawned on new healthy node with Container Creating in status.
Steps to Reproduce:
1. On fresh cluster with version 4.14.11 2. Create whereabout daemon set with help of documentation 3. Update the reconciler_cron_expression to: "*/1 * * * *" $ oc create configmap whereabouts-config -n openshift-multus --from-literal=reconciler_cron_expression="*/1 * * * *" 4. Create new project $ oc new-project nadtesting 5. Apply below nad.yaml $ cat nad.yaml apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: macvlan-net-attach1 spec: config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "br-ex", "mode": "bridge", "ipam": { "type": "whereabouts", "datastore": "kubernetes", "range": "172.17.20.0/24", "range_start": "172.17.20.11", "range_end": "172.17.20.12" } }' 6. Create deployment using net-attach-def with two replica, $ cat naddeployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: deployment1 labels: app: macvlan1 spec: replicas: 2 selector: matchLabels: app: macvlan1 template: metadata: annotations: k8s.v1.cni.cncf.io/networks: macvlan-net-attach1 labels: app: macvlan1 spec: containers: - name: google image: gcr.io/google-samples/kubernetes-bootcamp:v1 ports: - containerPort: 8080 7. Two Pod will be created $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES deployment1-fbfdf5cbc-d6sgr 1/1 Running 0 15m 10.129.2.9 ci-ln-xvfy762-c1627-h7xzk-worker-0-qvzq2 <none> <none> deployment1-fbfdf5cbc-njkpz 1/1 Running 0 15m 10.128.2.16 ci-ln-xvfy762-c1627-h7xzk-worker-0-8bdfh <none> <none> 8. Power off the node using debug $ oc debug node/ci-ln-xvfy762-c1627-h7xzk-worker-0-8bdfh # chroot /host # shutdown 9. Wait for sometime new pod will created on healthy node which stuck in containercreating $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES deployment1-fbfdf5cbc-6cb8d 0/1 ContainerCreating 0 9m53s <none> ci-ln-xvfy762-c1627-h7xzk-worker-0-blzlk <none> <none> deployment1-fbfdf5cbc-d6sgr 1/1 Running 0 28m 10.129.2.9 ci-ln-xvfy762-c1627-h7xzk-worker-0-qvzq2 <none> <none> deployment1-fbfdf5cbc-njkpz 1/1 Terminating 0 28m 10.128.2.16 ci-ln-xvfy762-c1627-h7xzk-worker-0-8bdfh <none> <none> 10. Node status just for reference, $ oc get nodes NAME STATUS ROLES AGE VERSION ci-ln-xvfy762-c1627-h7xzk-master-0 Ready control-plane,master 59m v1.27.10+28ed2d7 ci-ln-xvfy762-c1627-h7xzk-master-1 Ready control-plane,master 59m v1.27.10+28ed2d7 ci-ln-xvfy762-c1627-h7xzk-master-2 Ready control-plane,master 58m v1.27.10+28ed2d7 ci-ln-xvfy762-c1627-h7xzk-worker-0-8bdfh NotReady worker 43m v1.27.10+28ed2d7 ci-ln-xvfy762-c1627-h7xzk-worker-0-blzlk Ready worker 43m v1.27.10+28ed2d7 ci-ln-xvfy762-c1627-h7xzk-worker-0-qvzq2 Ready worker 43m v1.27.10+28ed2d
Actual results:
Shutdown node's pod stuck in terminating state and not releasing IP. New Pod is stuck in container creating status.
Expected results:
New Pod should start smoothly on new-node.
Additional info:
- Just for information : If i follow manual approach the this issue will resolve for that i need to follow this step 1. remove that termination IP from overlapping $ oc delete overlappingrangeipreservations.whereabouts.cni.cncf.io <IP> 2. remove that termination IP from ippools.whereabouts.cni.cncf.io $ oc edit ippools.whereabouts.cni.cncf.io <IP Pool> Remove that stale IP from list Also, the whereabouts-reconciler logs on the Terminating pod's node report: 2024-02-19T10:48:00Z [debug] Added IP 172.17.20.12 for pod nadtesting/deployment1-fbfdf5cbc-njkpz 2024-02-19T10:48:00Z [debug] the IP reservation: IP: 172.17.20.12 is reserved for pod: nadtesting/deployment1-fbfdf5cbc-njkpz 2024-02-19T10:48:00Z [debug] pod reference nadtesting/deployment1-fbfdf5cbc-njkpz matches allocation; Allocation IP: 172.17.20.12; PodIPs: map[172.17.20.12:{}] 2024-02-19T10:48:00Z [debug] no IP addresses to cleanup 2024-02-19T10:48:00Z [verbose] reconciler success i.e. it fails to recognize the need to remove the allocation.
Slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1705425516419799
A revision controller is spinning to many revisions.
Goal: update the revision controller code to temporarily log config changes to validate the newly created revisions are valid. Or, proof some new revisions are unnecessary.
Description of problem:
Configuration files applied via the API do not have any effect on the configuration of the bare metal host.
Version-Release number of selected component (if applicable):
OpenShift 4.12.42 with ACM 2.8
How reproducible:
Reproducible.
Steps to Reproduce:
1. Applied nmstateconfig via 'oc apply', realized after it booted the subnet prefix was incorrect. 2. Deleted bmh and nmstateconfig. 3. Applied correct config via 'oc apply', machine boots with 1st config still. 4. Deleted bmh and nmstateconfig. 5. Created host via BMC form in GUI with correct config. Machine boots with correct config. 6. Tested deleting bmh and nmstateconfig, and creating new machine by just applying the bmh file with zero network config, and machine boots again with networking from step 5.
Actual results:
Bare metal host does not get config via 'oc apply'.
Expected results:
'oc apply -f nmstateconfig.yaml' should work to apply networking configuration.
Additional info:
HP Synergy 480 Gen10 (871942-B21) UEFI boot with redfish virtual media Static IP with bonding.
As part of the PatternFly update from 5.1.0 to 5.1.1 it was required to disable some dev console e2e tests.
See https://github.com/openshift/console/pull/13380
We need to re-enable and adapt at least this tests:
In frontend/packages/dev-console/integration-tests/features/addFlow/create-from-devfile.feature and in frontend/packages/dev-console/integration-tests/features/e2e/add-flow-ci.feature
In frontend/packages/helm-plugin/integration-tests/features/helm-release.feature and frontend/packages/helm-plugin/integration-tests/features/helm/actions-on-helm-release.feature:
Can we please also check why we have both broken tests in two features files? 🤷
The hypershift ignition endpoint needlessly supports APLN http2. In light of CVE-2023-39325, there is no reason to support http2 if it is not being used.
Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/158
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
LVMS multi node
requires a additional disk for the operator
however i was able to create cluster 4.15 multinode , select lvms and without adding the additional disk i see that lvm requirment passes and i am able to continue and start installaiton
How reproducible:
Steps to reproduce:
1. create cluter 4.15 multi node
2. select lvms operator
3. do not attach additional disk
Actual results:
it is possible to continue until installation page and start installation
lvm requirment is set as success
Expected results:
lvm requirement should show fail
should not be bale to proceed to installation before attaching difk
Description of problem:
The users are experiencing an issue with NodePort traffic forwarding, where the TCP traffic continues to be directed to pods which are under terminating state, the connection cannot be created sucessfully, as per the customer mentioned this issue is causing the connection disruptions in the business transaction.
Version-Release number of selected component (if applicable):
On the OpenShift 4.12.13 with RHEL8.6 workers and OVN environment.
How reproducible:
here is the code found.
https://github.com/openshift/ovn-kubernetes/blob/dd3c7ed8c1f41873168d3df26084ecbfd3d9a36b/go-controller/pkg/util/kube.go#L360;
—
func IsEndpointServing(endpoint discovery.Endpoint) bool {
if endpoint.Conditions.Serving != nil
else
{ return IsEndpointReady(endpoint) }}
// IsEndpointValid takes as input an endpoint from an endpoint slice and a boolean that indicates whether to include
// all terminating endpoints, as per the PublishNotReadyAddresses feature in kubernetes service spec. It always returns true
// if includeTerminating is true and falls back to IsEndpointServing otherwise.
func IsEndpointValid(endpoint discovery.Endpoint, includeTerminating bool) bool
—
Look like 'IsEndpointValid' function will retrun serving=true endpoint, it not checking the ready=true endpoint
I see recently the code has been changed in this section(look up Ready=true is changed to Serving=true)?
[Check the "Serving" field for endpoints]
https://github.com/openshift/ovn-kubernetes/commit/aceef010daf0697fe81dba91a39ed0fdb6563dea#diff-daf9de695e0ff81f9173caf83cb88efa138e92a9b35439bd7044aa012ff931c0
https://github.com/openshift/ovn-kubernetes/blob/release-4.12/go-controller/pkg/util/kube.go#L326-L386
—
out.Port = *port.Port
for _, endpoint := range slice.Endpoints {
// Skip endpoint if it's not valid
if !IsEndpointValid(endpoint, includeTerminating)
for _, ip := range endpoint.Addresses {
klog.V(4).Infof("Adding slice %s endpoint: %v, port: %d", slice.Name, endpoint.Addresses, *port.Port)
ipStr := utilnet.ParseIPSloppy(ip).String()
switch slice.AddressType
}
}
—
Steps to Reproduce:
Here is the customer's sample pods for you refering.
mbgateway-st-8576f6f6f8-5jc75 1/1 Running 0 104m 172.30.195.124 appn01-100.app.paas.example.com <none> <none>
mbgateway-st-8576f6f6f8-q8j6k 1/1 Running 0 5m51s 172.31.2.97 appn01-202.app.paas.example.com <none> <none>
pod yaml:
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 40
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 9190
timeoutSeconds: 5
name: mbgateway-st
ports:
- containerPort: 9190
protocol: TCP
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 40
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 9190
timeoutSeconds: 5
resources:
limits:
cpu: "2"
ephemeral-storage: 10Gi
memory: 2G
requests:
cpu: 50m
ephemeral-storage: 100Mi
memory: 1111M
when delete pod Pod(mbgateway-st-8576f6f6f8-5jc75), check the EndpointSlice status:
addressType: IPv4
apiVersion: discovery.k8s.io/v1
endpoints:
Wait for a little moment, try to check Ovn Service lb, it found the endpoints information doesn't update to the latest.
9349d703-1f28-41fe-b505-282e8abf4c40 Service_lb59-10- tcp 172.35.0.185:31693 172.30.195.124:9190,172.31.2.97:9190
dca65745-fac4-4e73-b412-2c7530cf4a91 Service_lb59-10- tcp 172.35.0.170:31693 172.30.195.124:9190,172.31.2.97:9190
a5a65766-b0f2-4ac6-8f7c-cdebeea303e3 Service_lb59-10- tcp 172.35.0.89:31693 172.30.195.124:9190,172.31.2.97:9190
a36517c5-ecaa-4a41-b686-37c202478b98 Service_lb59-10- tcp 172.35.0.213:31693 172.30.195.124:9190,172.31.2.97:9190
16d997d1-27f0-41a3-8a9f-c63c8872d7b8 Service_lb59-10- tcp 172.35.0.92:31693 172.30.195.124:9190,172.31.2.97:9190
Wait for a little moment,
addressType: IPv4
apiVersion: discovery.k8s.io/v1
endpoints:
check Ovn Service lb, it found the Pod Endpoint information is still here:
fceeaf8f-e747-4290-864c-ba93fb565a8a Service_lb59-10- tcp 172.35.0.56:31693 172.30.132.78:9190,172.30.195.124:9190,172.31.2.97:9190
bef42efd-26db-4df3-b99d-370791988053 Service_lb59-10- tcp 172.35.1.26:31693 172.30.132.78:9190,172.30.195.124:9190,172.31.2.97:9190
84172e2c-081c-496a-afec-25ebcb83cc60 Service_lb59-10- tcp 172.35.0.118:31693 172.30.132.78:9190,172.30.195.124:9190,172.31.2.97:9190
34412ddd-ab5c-4b6b-95a3-6e718dd20a4f Service_lb59-10- tcp 172.35.1.14:31693 172.30.132.78:9190,172.30.195.124:9190,172.31.2.97:9190
Actual results:
Service LB endpoint determines on the POD.status.condition[type=Serving] status.
Expected results:
Service LB endpoint should determines on the POD.status.condition[type=Ready] status.
Additional info:
The ovn-controller determines whether an endpoint should be added to the Service Load Balancer (serviceLB) based on the condition.serving. The current issue is that when a pod is in the terminating state, the condition.serving remains true. Its state determines on the POD.status.condition[type=Ready] status is being true.
However when a pod is deleted, the endpointslice condition.serving state remains unchanged, and the backend pool of the service LB still includes the IP information of the deleted pod.Why doesn't ovn-controller use the condition.ready status to decide whether the pod's IP should be added to the service LB backend pool?
Could the shift-networking experts confirm whether this is the openshift ovn service lb bug or not?
This is a clone of issue OCPBUGS-36260. The following is the description of the original issue:
—
Tooltip on Pipeline whenexpression is not shows in Pipeline visualization.
When expression tooltip is not shows on hover
Should show When expression tooltip on hover
Slack discussion here: https://redhat-internal.slack.com/archives/C02F1J9UJJD/p1702394712492839
Repo: openshift/kubernetes/pkg/controller/podautoscaler MISSING: jkyros, joelsmith Repo: openshift/kubernetes-autoscaler/vertical-pod-autoscaler MISSING: jkyros Repo: openshift/vertical-pod-autoscaler-operator/ MISSING: jkyros
The openshift/kubernetes one was the only real weird one where there might not be a precedent. Looking at the kubernetes repo it looks like the custom is to add a DOWNSTREAM_APPROVERS file as a carry patch for downstream approvers?
Description of the problem:
When create a cluster with base version 4.16 which is from candidate channel,
all versions returned with same X.Y.Z 4.16.0-ec.Z format
The returned version should be latest but we returned the first hit because we do not check versions after -ec.Z
means
4.16.0-ec.1 4.16.0-ec.2 4.16.0-ec.3
Wont work.
How reproducible:
Always , return different result
Steps to reproduce:
1. Create a cluster with Latest release from test-infra
export OPENSHIFT_VERSION=4.16
2. Once cluster created check the picked version
(Pdb++) cluster.get_details().openshift_version
2024-05-07 06:55:26,409 root INFO - 140479183103808 - Refreshing API key (/home/benny/assisted-test-infra/src/service_client/assisted_service_api.py:78)->refresh_api_key
'4.16.0-ec.2'
--> When this order 4.16.0-ec.5 chosen github.com/openshift/assisted-service/models.ReleaseImages len: 7, cap: 20, [ *{ CPUArchitecture: *"x86_64", CPUArchitectures: github.com/lib/pq.StringArray len: 1, cap: 1, ["x86_64"], Default: false, OpenshiftVersion: *"4.16", SupportLevel: "beta", URL: *"quay.io/openshift-release-dev/ocp-release:4.16.0-ec.5-x86_64", Version: *"4.16.0-ec.5",}, *{ CPUArchitecture: *"x86_64", CPUArchitectures: github.com/lib/pq.StringArray len: 1, cap: 1, ["x86_64"], Default: false, OpenshiftVersion: *"4.16", SupportLevel: "beta", URL: *"quay.io/openshift-release-dev/ocp-release:4.16.0-ec.6-x86_64", Version: *"4.16.0-ec.6",}, *{ CPUArchitecture: *"x86_64", CPUArchitectures: github.com/lib/pq.StringArray len: 1, cap: 1, ["x86_64"], Default: false, OpenshiftVersion: *"4.16", SupportLevel: "beta", URL: *"quay.io/openshift-release-dev/ocp-release:4.16.0-ec.4-x86_64", Version: *"4.16.0-ec.4",}, *{ CPUArchitecture: *"x86_64", CPUArchitectures: github.com/lib/pq.StringArray len: 1, cap: 1, ["x86_64"], Default: false, OpenshiftVersion: *"4.16", SupportLevel: "beta", URL: *"quay.io/openshift-release-dev/ocp-release:4.16.0-ec.0-x86_64", Version: *"4.16.0-ec.0",}, *{ CPUArchitecture: *"x86_64", CPUArchitectures: github.com/lib/pq.StringArray len: 1, cap: 1, ["x86_64"], Default: false, OpenshiftVersion: *"4.16", SupportLevel: "beta", URL: *"quay.io/openshift-release-dev/ocp-release:4.16.0-ec.1-x86_64", Version: *"4.16.0-ec.1",}, *{ CPUArchitecture: *"x86_64", CPUArchitectures: github.com/lib/pq.StringArray len: 1, cap: 1, ["x86_64"], Default: false, OpenshiftVersion: *"4.16", SupportLevel: "beta", URL: *"quay.io/openshift-release-dev/ocp-release:4.16.0-ec.3-x86_64", Version: *"4.16.0-ec.3",}, *{ CPUArchitecture: *"x86_64", CPUArchitectures: github.com/lib/pq.StringArray len: 1, cap: 1, ["x86_64"], Default: false, OpenshiftVersion: *"4.16", SupportLevel: "beta", URL: *"quay.io/openshift-release-dev/ocp-release:4.16.0-ec.2-x86_64", Version: *"4.16.0-ec.2",}, ]
Expected results:
Expecting to get always the latest from channel
This is a clone of issue OCPBUGS-37988. The following is the description of the original issue:
—
Description of problem:
In the Administrator view under Cluster Settings -> Update Status Pane, the text for the versions is black instead of white when Dark mode is selected on Firefox (128.0.3 Mac). Also happens if you choose System default theme and the system is set to Dark mode.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Open /settings/cluster using Firefox with Dark mode selected 2. 3.
Actual results:
The version numbers under Update status are black
Expected results:
The version numbers under Update status are white
Additional info:
This is a clone of issue OCPBUGS-35528. The following is the description of the original issue:
—
Cluster-update keys has some old Red Hat keys which are self-signed with SHA-1. The keys that we use have recently been resigned with SHA256. We don't rely on the self-signing to establish trust in the keys (that trust is established by baking a ConfigMap manifest into release images, where it can be read by the cluster-version operator), but we do need to avoid spooking the key-loading library. Currently Go-1.22-build CVOs in FIPS mode fail to bootstrap,
like this aws-ovn-fips run > Artifacts > install artifacts:
$ curl -s [https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-fips/1800906552731766784/artifacts/e2e-aws-ovn-fips/ipi-install-install/artifacts/log-bundle-20240612161314.tar] | tar -tvz | grep 'cluster-version.*log' -rw-r--r-- core/core 54653 2024-06-12 09:13 log-bundle-20240612161314/bootstrap/containers/cluster-version-operator-bd9f61984afa844dcd284f68006ffc9548377c045eff840096c74bcdcbe5cca3.log $ curl -s [https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-fips/1800906552731766784/artifacts/e2e-aws-ovn-fips/ipi-install-install/artifacts/log-bundle-20240612161314.tar] | tar -xOz log-bundle-20240612161314/bootstrap/containers/cluster-version-operator-bd9f61984afa844dcd284f68006ffc9548377c045eff840096c74bcdcbe5cca3.log | grep GPG I0612 16:06:15.952567 1 start.go:256] Failed to initialize from payload; shutting down: the config map openshift-config-managed/release-verification has an invalid key "verifier-public-key-redhat" that must be a GPG public key: openpgp: invalid data: tag byte does not have MSB set: openpgp: invalid data: tag byte does not have MSB set E0612 16:06:15.952600 1 start.go:309] Collected payload initialization goroutine: the config map openshift-config-managed/release-verification has an invalid key "verifier-public-key-redhat" that must be a GPG public key: openpgp: invalid data: tag byte does not have MSB set: openpgp: invalid data: tag byte does not have MSB set
That's this code attempting to call ReadArmoredKeyRing (which fails with a currently-unlogged openpgp: invalid data: user ID self-signature invalid: openpgp: invalid signature: RSA verification failure complaining about the SHA-1 signature, and then a fallback to ReadKeyRing, which fails on the reported openpgp: invalid data: tag byte does not have MSB set.
To avoid these failures, we should:
Only 4.17 will use Go 1.22, so that's the only release that needs patching. But the changes would be fine to backport if we wanted.
100%.
1. Build the CVO with Go 1.22
2. Launch a FIPS cluster.
Fails to bootstrap, with the bootstrap CVO complaining, as shown in the Description of problem section.
Successful install
Description of problem:
We upgraded our OpenShift Cluster from 4.4.16 to 4.15.3 and multiple operators are now in "Failed" status with the following CSV conditions such as: - NeedsReinstall installing: deployment changed old hash=5f6b8fc6f7, new hash=5hFv6Gemy1Zri3J9ulXfjG9qOzoFL8FMsLNcLR - InstallComponentFailed install strategy failed: rolebindings.rbac.authorization.k8s.io "openshift-gitops-operator-controller-manager-service-auth-reader" already exists All other failures refer to a similar "auth-reader" rolebinding that already exist.
Version-Release number of selected component (if applicable):
OpenShift 4.15.3
How reproducible:
Happened on several installed operators but on the only cluster we upgraded (our staging cluster)
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
All operators should be up-to-date
Additional info:
This may be related to https://github.com/operator-framework/operator-lifecycle-manager/pull/3159
This is a clone of issue OCPBUGS-33570. The following is the description of the original issue:
—
Description of problem:
Install OCP with capi, when setting bootType: "UEFI", got unsupported value error. Installing with terrform did not met this issue.
platform:
nutanix:
bootType: "UEFI"
# ./openshift-install create cluster --dir cluster --log-level debug ... ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to create control-plane manifest: NutanixMachine.infrastructure.cluster.x-k8s.io "sgao-nutanix-zonal-jwp6d-bootstrap" is invalid: spec.bootType: Unsupported value: "UEFI": supported values: "legacy", "uefi"
Set bootType: "uefi" also won't work
# ./openshift-install create manifests --dir cluster ... FATAL failed to fetch Master Machines: failed to generate asset "Master Machines": failed to create master machine objects: platform.nutanix.bootType: Invalid value: "uefi": valid bootType: "", "Legacy", "UEFI", "SecureBoot".
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-08-222442
How reproducible:
always
Steps to Reproduce:
1.Create install config with bootType: "UEFI" and enable capi by setting: featureSet: CustomNoUpgrade featureGates: - ClusterAPIInstall=true 2.Install cluster
Actual results:
Install failed
Expected results:
Install passed
Additional info:
I was seeing the following error running `build.sh` with go v1.19.5 until I upgraded to v1.22.4:
```
❯ ./build.sh
pkg/auth/sessions/server_session.go:7:2: cannot find package "." in:
/Users/rhamilto/Git/console/vendor/slices
```
Description of problem:
The nodeip-configuration service does not log to the serial console, which makes it difficult to debug problems when networking is not available and there is no access to the node.
Version-Release number of selected component (if applicable):
Reported against 4.13, but present in all releases
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-33815. The following is the description of the original issue:
—
In our hypershift test, we see the openshift-controller-manager undoing the work of our controllers to set an imagePullSecrets entry on our ServiceAccounts. The result is a rapid updating of ServiceAccounts as the controllers fight.
This started happening after https://github.com/openshift/openshift-controller-manager/pull/305
Description of problem: MCN lister fires in the operator pod before the CRD exists. This causes API issues and could impact upgrades.
Version-Release number of selected component (if applicable):{code:none}
How reproducible: always
Steps to Reproduce:{code:none} 1. upgrade to 4.15 from any version 2. 3.
Actual results:
I1211 18:44:40.972098 1 operator.go:347] Starting MachineConfigOperator I1211 18:44:40.982079 1 event.go:298] Event(v1.ObjectReference{Kind:"", Namespace:"openshift-machine-config-operator", Name:"machine-config", UID:"68bc5e8f-b7f5-4506-a870-2eecaa5afd35", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator started a version change from [{operator 4.14.6}] to [{operator 4.15.0-0.nightly-2023-12-11-033133}] W1211 18:44:41.255502 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 18:44:41.255587 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 18:58:04.915119 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 18:58:06.425952 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 18:58:06.426037 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 18:58:09.396004 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 18:58:09.396068 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 18:58:14.540488 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 18:58:14.540560 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 18:58:25.293029 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 18:58:25.293095 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 18:58:50.166866 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 18:58:50.166903 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 18:59:39.950454 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 18:59:39.950523 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 19:00:23.432005 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 19:00:23.432038 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 19:01:13.237298 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 19:01:13.237382 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 19:02:02.035555 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 19:02:02.035628 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 19:02:52.111260 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 19:02:52.111332 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 19:03:38.243461 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 19:03:38.243499 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 19:04:27.848493 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 19:04:27.848585 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 19:05:37.064033 1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)" E1211 19:05:38.057685 1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)" E1211 19:05:39.036638 1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)" E1211 19:05:40.039736 1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)" E1211 19:05:41.039696 1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)" E1211 19:05:42.034840 1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)" E1211 19:05:43.044901 1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)" E1211 19:05:44.033229 1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)" E1211 19:05:45.034792 1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)" E1211 19:05:46.052866 1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)"
Expected results:
Additional info:
Description of problem:
Check on OperatorHub page, the long catalogsource display name will overflow the operator item tile Version-Release number of selected component (if applicable):{code:none} 4.15.0-0.nightly-2023-12-19-033450
How reproducible:
Always
Steps to Reproduce:
1. Create a catalogsource with a long display name. 2. Check operator items supplied by the created catalogsource on OperatorHub page 3.
Actual results:
2. The catalogsource display name overflows from the item tile
Expected results:
2. Show show catalogsource display name in the item tile dynamically without overflow.
Additional info:
screenshot: https://drive.google.com/file/d/1GOHJOxoBmtZX3QWDsIvc2RT5a2inkpzM/view?usp=sharing
Jose added few other rendering tests that utilize different inputs and outputs. make render-sync should be able to prepare them.
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver/pull/53
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Due to structural changes in openshift/api our generate make target fails after an api update
make: *** No rule to make target 'vendor/github.com/openshift/api/monitoring/v1/0000_50_monitoring_01_alertingrules.crd.yaml', needed by 'jsonnet/crds/alertingrules-custom-resource-definition.json'. Stop.
This is a clone of issue OCPBUGS-32812. The following is the description of the original issue:
—
Description of problem:
When the image from a build is rolling out on the nodes, the update progress on the node is not displaying correctly.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Enable OCL functionality 2. Opt the pool in by MachineOSConfig 3. Wait for the image to build and roll out 4. Track mcp update status by oc get mcp
Actual results:
The MCP start with O ready pool. While there are 1-2 pools got updated already, the count still remains 0. The count jump to 3 when all the pools are ready.
Expected results:
The update progress should be reflected in the mcp status correctly.
Additional info:
Description of problem:
The cluster-network-operator in hypershift when templating in cluster resources does not use the node local address of the client side haproxy load balancer that runs on all nodes. This bypasses a level of health checks for the backend redundant apiserver addresses that is performed by the local kube-apiserver-proxy pods that run on every node in a hypershift environment. In environments where the backend api servers are not fronted through an additional cloud load balancer: this leads to a percentage of request failures from the in cluster components occuring when a control plane endpoint goes down even if other endpoints are available.
Version-Release number of selected component (if applicable):
4.16 4.15 4.14
How reproducible:
100%
Steps to Reproduce:
1. Setup a hypershift cluster in a baremetal/non cloud environment where there are redundant API servers behind a DNS that point directly to the node IPs. 2. Power down one of the control plane nodes 3. Schedule workload into cluster that depends on kube-proxy and/or multus to setup networking configuration 4. You will see errors like the following ``` add): Multus: [openshiftai/moe-8b-cmisale-master-0/9c1fd369-94f5-481c-a0de-ba81a3ee3583]: error getting pod: Get "https://[p9d81ad32fcdb92dbb598-6b64a6ccc9c596bf59a86625d8fa2202-c000.us-east.satellite.appdomain.cloud]:30026/api/v1/namespaces/openshiftai/pods/moe-8b-cmisale-master-0?timeout=1m0s": dial tcp 192.168.98.203:30026: connect: timeout ```
Actual results:
When a control plane node fails intermittent timeouts occur when kube-proxy/multus resolve the dns and a failed control plane node ip is returned
Expected results:
No requests fail (which will occur if all traffic is routed through the node local load balancer instance
Additional info:
Additionally: control plane components in the management cluster that live next to the apiserver are adding uneeded dependencies by using an external DNS entry to talk to the kube-apiserver when it can use the local kube-apiserver address to have it all go over cluster local networking
This is a clone of issue OCPBUGS-36176. The following is the description of the original issue:
—
Description of problem:
The PowerVS CI uses the installer image to do some necessary setup. The openssl binary was recently removed from that image. So we need to switch to the upi-installer image.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Always
Steps to Reproduce:
1. Look at CI runs
This is continuation of OCPBUGS-23342, now the vmware-vsphere-csi-driver-operator cannot connect to vCenter at all. Tested using invalid credentials.
The operator ends up with no Progressing condition during upgrade from 4.11 to 4.12, and cluster-storage-operator interprets it as Progressing=true.
Please review the following PR: https://github.com/openshift/containernetworking-plugins/pull/142
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
In the self-managed HCP use case, if the on-premise baremetal management cluster does not have nodes labeled with the "topology.kubernetes.io/zone" key, then all HCP pods for a High Available cluster are scheduled to a single mgmt cluster node. This is a result of the way the affinity rules are constructed. Take the pod affinity/antiAffinity example below, which is generated for a HA HCP cluster. If the "topology.kubernetes.io/zone" label does not exist on the mgmt cluster nodes, then the pod will still get scheduled but that antiAffinity rule is effectively ignored. That seems odd due to the usage of the "requiredDuringSchedulingIgnoredDuringExecution" value, but I have tested this and the rule truly is ignored if the topologyKey is not present.
podAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchLabels: hypershift.openshift.io/hosted-control-plane: clusters-vossel1 topologyKey: kubernetes.io/hostname weight: 100 podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app: kube-apiserver hypershift.openshift.io/control-plane-component: kube-apiserver topologyKey: topology.kubernetes.io/zone
In the event that no "zones" are configured for the baremetal mgmt cluster, then the only other pod affinity rule is one that actually colocates the pods together. This results in a HA HCP having all the etcd, apiservers, etc... scheduled to a single node.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100%
Steps to Reproduce:
1. Create a self-managed HA HCP cluster on a mgmt cluster with nodes that lack the "topology.kubernetes.io/zone" label
Actual results:
all HCP pods are scheduled to a single node.
Expected results:
HCP pods should always be spread across multiple nodes.
Additional info:
A way to address this is to add another anti-affinity rule which prevents every component from being scheduled on the same node as its replicas
Current description of HighOverallControlPlaneCPU is wrong for SNO cases and can mislead users. We need to add information regarding SNO clusters to the description of the alert
Document URL:
[1] https://docs.openshift.com/container-platform/4.15/installing/installing_aws/installing-aws-account.html#installation-aws-permissions_installing-aws-account
Section Number and Name:
* Required EC2 permissions for installation
Description of problem:
The permission ec2:DisassociateAddress is required for OCP 4.16+ install, but it's missing the official doc [1] - we would like to understand why/if this permission is necessary. level=info msg=Destroying the bootstrap resources... ... level=error msg=Error: disassociating EC2 EIP (eipassoc-01e8cc3f06f2c2499): UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::301721915996:user/ci-op-0xjvtwb0-4e979-minimal-perm is not authorized to perform: ec2:DisassociateAddress on resource: arn:aws:ec2:us-east-1:301721915996:elastic-ip/eipalloc-0274201623d8569af because no identity-based policy allows the ec2:DisassociateAddress action.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-03-13-061822
How reproducible:
Always
Steps to Reproduce:
1. Create OCP cluster with permissions listed in the official doc. 2. 3.
Actual results:
See description.
Expected results:
Cluster is created successfully.
Suggestions for improvement:
Add ec2:DisassociateAddress to `Required EC2 permissions for installation` in [1]
Additional info:
This impacts the permission list in ROSA Installer-Role as well.
This is a clone of issue OCPBUGS-35801. The following is the description of the original issue:
—
Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
at:
github.com/openshift/cluster-openshift-controller-manager-operator/pkg/operator/internalimageregistry/cleanup_controller.go:146 +0xd65
Description of the problem:
Impossible to create an extra partition on the main disk at installation time with OCP 4.15. It works perfectly with 4.14 and under
I supply a custom machineconfig manifest to do so, and the behavior is that during installation, after reboot, screen is blank, and host has no networking (no route to host)
A slack thread explaining the issue with further debugging can be consulted in https://redhat-internal.slack.com/archives/C999USB0D/p1707991107757299
The bug seems to be introduced in https://github.com/openshift/assisted-installer/pull/713 , which allows for one less reboot on installation time, and to do that, it implements part of the post-reboot code. This code runs BEFORE the extra partition is created, and this creates a problem
How reproducible:
Always
Steps to reproduce:
1. Create a 4.15 cluster with an extra manifest that creates a extra partition at the end of the main disk
Example of machineconfig (change device to match installation disk):
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: master
name: 98-extra-partition
spec:
config:
ignition:
version: 3.2.0
storage:
disks:
- device: /dev/vda
partitions:
- label: javi
startMiB: 110000 # space left for the CoreOS partition.
sizeMiB: 0 # Use all available space
2. Proceed with the installation
Actual results:
After reboot, node never comes back up
Expected results:
Cluster installs without problem
Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/102
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Builds from a buildconfig are failing on OCP 4.12.48. The developers are impacted since large files cant be cloned anymore within a BuildConfig.
Version-Release number of selected component (if applicable):
4.12.48
How reproducible:
Always
Steps to Reproduce:
The issue is fixed in version 4.12.45 as per https://issues.redhat.com/browse/OCPBUGS-23419 but still the issue persists in 4.12.48
Actual results:
The build is failing.
Expected results:
The build should work without any issues.
Additional info:
Build fails with error: ``` Adding cluster TLS certificate authority to trust store Cloning "https://<path>.git" ... error: Downloading <github-repo>/projects/<path>.mp4 (70 MB) Error downloading object: <github-repo>/projects/<path>.mp4 (a11ce74): Smudge error: Error downloading <github-repo>/projects/<path>.mp4 (a11ce745c147aa031dd96915716d792828ae6dd17c60115b675aba75342bb95a): batch request: missing protocol: "origin.git/info/lfs" Errors logged to /tmp/build/inputs/.git/lfs/logs/20240430T112712.167008327.log Use `git lfs logs last` to view the log. error: external filter 'git-lfs filter-process' failed fatal: <github-repo>/projects/<path>.mp4: smudge filter lfs failed warning: Clone succeeded, but checkout failed. You can inspect what was checked out with 'git status' and retry with 'git restore --source=HEAD :/' ```
In 4.15 when the agent installer is run using the openshift-baremetal-installer binary using an install-config containing platform data, it attempts to contact libvirt to validate the provisioning network interfaces for the bootstrap VM. This should never happen, as the agent installer doesn't use the bootstrap VM.
It is possible that users in the process of converting from baremetal IPI to the agent installer might run into this issue, since they would already be using the openshift-baremetal-installer binary.
This is a clone of issue OCPBUGS-35430. The following is the description of the original issue:
—
Description of problem:
Query the CAPI provider for the timeouts needed during provisioning. This is optional to support. The current default of 15 minutes is sufficient for normal CAPI installations. However, given how the current PowerVS CAPI provider waits for some resources to be created before creating the load balancers, it is possible that the LBs will not create before the 15 minute timeout. An issue was created to track this [1]. [1] kubernetes-sigs/cluster-api-provider-ibmcloud#1837
Description of problem:
Invalid volume size when restoring as new PVC from VolumeSnapshot. The size unit is undefined and the Size unit appears TiB. Please check the attachment.
[Env]
dell isilon CSI volumes
Please review the following PR: https://github.com/openshift/installer/pull/7818
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-35467. The following is the description of the original issue:
—
Description of problem:
openshift-install is creating user-defined tags (platform.aws.userTags) in subnets on AWS of BYO VPC (unmanaged VPC) deployment when using CAPA. The documentation[1] for userTags state: > A map of keys and values that the installation program adds as tags to all resources that it creates. So when the network (VPC and subnets) are managed by user (BYO VPC), the installer should not create additional tags when provided in install-config.yaml. Investigating in CAPA codebase, the feature gate TagUnmanagedNetworkResources is enabled, and the subnet is propagating the userTags in the reconciliation loop[2]. [1] https://docs.openshift.com/container-platform/4.15/installing/installing_aws/installation-config-parameters-aws.html [2] https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/pkg/cloud/services/network/subnets.go#L618
Version-Release number of selected component (if applicable):
4.16.0-ec.6-x86_64
How reproducible:
always
Steps to Reproduce:
- 1. create VPC and subnets using CloudFormation. Example template: https://github.com/openshift/installer/blob/master/upi/aws/cloudformation/01_vpc.yaml - 2. create install-config with user-tags and subnet IDs to install the cluster: - 3. create the cluster with feature gate for CAPI ``` featureSet: CustomNoUpgrade featureGates: - ClusterAPIInstall=true metadata: name: "${CLUSTER_NAME}" platform: aws: region: us-east-1 subnets: - subnet-0165c70573a45651c - subnet-08540527fffeae3e9 userTags: x-red-hat-clustertype: installer x-red-hat-managed: "true" ```
Actual results:
installer/CAPA is setting the user-defined tags in unmanaged subnets
Expected results:
- installer/CAPA does not create userTags on unmanaged subnets - userTags is applied for regular/standard workflow (managed VPC) with CAPA
Additional info:
- Impacting on SD/ROSA: https://redhat-internal.slack.com/archives/CCPBZPX7U/p1717588837289489
Seen in this 4.15 to 4.16 CI run:
: [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers 0s { event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times}
The operator recovered, and the update completed, but it's still probably worth cleaning up whatever's happening to avoid alarming anyone.
Seems like all recent CI runs that match this string touch 4.15, 4.16, or development branches:
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=Back-off+restarting+failed+container+cluster-baremetal-operator+in+pod+cluster-baremetal-operator' | grep 'failures match' pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade-local-gateway (all) - 11 runs, 36% failed, 25% of failures match = 9% impact periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 15 runs, 20% failed, 33% of failures match = 7% impact pull-ci-openshift-kubernetes-master-e2e-aws-ovn-downgrade (all) - 3 runs, 67% failed, 50% of failures match = 33% impact periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 15 runs, 27% failed, 25% of failures match = 7% impact periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade (all) - 32 runs, 91% failed, 7% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 40 runs, 25% failed, 20% of failures match = 5% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 3 runs, 33% failed, 100% of failures match = 33% impact pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change (all) - 4 runs, 25% failed, 100% of failures match = 25% impact periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade (all) - 40 runs, 8% failed, 33% of failures match = 3% impact pull-ci-openshift-azure-file-csi-driver-operator-main-e2e-azure-ovn-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 7 runs, 43% failed, 33% of failures match = 14% impact pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade (all) - 10 runs, 30% failed, 33% of failures match = 10% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-gcp-ovn-arm64 (all) - 6 runs, 33% failed, 50% of failures match = 17% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 11 runs, 18% failed, 50% of failures match = 9% impact
Looks like ~8% impact. h2. Steps to Reproduce: 1. Run ~20 exposed job types. 2. Check for {{: [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers}} failures with {{Back-off restarting failed container cluster-baremetal-operator}} messages. h2. Actual results: ~8% impact. h2. Expected results: ~0% impact. h2. Additional info: Dropping into Loki for the run I'd picked: {code:none} {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016"} | unpack | pod="cluster-baremetal-operator-574577fbcb-z8nd4" container="cluster-baremetal-operator" |~ "220 06:0"
includes:
E1220 06:04:18.794548 1 main.go:131] "unable to create controller" err="unable to put \"baremetal\" ClusterOperator in Available state: Operation cannot be fulfilled on clusteroperators.config.openshift.io \"baremetal\": the object has been modified; please apply your changes to the latest version and try again" controller="Provisioning" I1220 06:05:40.753364 1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080" I1220 06:05:40.766200 1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks I1220 06:05:40.780426 1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform" E1220 06:05:40.795555 1 main.go:131] "unable to create controller" err="unable to put \"baremetal\" ClusterOperator in Available state: Operation cannot be fulfilled on clusteroperators.config.openshift.io \"baremetal\": the object has been modified; please apply your changes to the latest version and try again" controller="Provisioning" I1220 06:08:21.730591 1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080" I1220 06:08:21.747466 1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks I1220 06:08:21.768138 1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform" E1220 06:08:21.781058 1 main.go:131] "unable to create controller" err="unable to put \"baremetal\" ClusterOperator in Available state: Operation cannot be fulfilled on clusteroperators.config.openshift.io \"baremetal\": the object has been modified; please apply your changes to the latest version and try again" controller="Provisioning"
So some kind of ClusterOperator-modification race?
Description of problem:
CredentialsRequest for Azure AD workload identity contains unnecessary permissions under `virtualMachines/extensions`. Specifically write and delete.
Version-Release number of selected component (if applicable):
4.14.0+
How reproducible:
Every time
Steps to Reproduce:
1. Create a cluster without the CredentialsRequest permissions mentioned 2. Scale machineset 3. See no permission errors
Actual results:
We have unnecessary permissions, but still no errors
Expected results:
Still no permission errors after removal.
Additional info:
RHCOS doesn't leverage virtual machine extensions. It appears as though the code path is dead.
Description of problem:
Ran into a problem with our testing this morning on a newly create ROKS cluster ``` Error running /usr/bin/oc --namespace=e2e-test-oc-service-p4fz2 --kubeconfig=/tmp/configfile2694323048 create service nodeport mynodeport --tcp=8080:7777 --node-port=30000: StdOut> error: failed to create NodePort service: Service "mynodeport" is invalid: spec.ports[0].nodePort: Invalid value: 30000: provided port is already allocated StdErr> error: failed to create NodePort service: Service "mynodeport" is invalid: spec.ports[0].nodePort: Invalid value: 30000: provided port is already allocated exit status 1 ``` The port was already used by a different service, we would like to make a feature request to the testing to make the port number dynamic so that if that port is taken up, it can choose an available one.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/machine-api-provider-powervs/pull/70
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
default value of option --parallelism cannot be parsed to int
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-02-02-002725
How reproducible:
reproduce with cmd copy-to-node
Steps to Reproduce:
Cmd: "oc --namespace=e2e-test-mco-4zb88 --kubeconfig=/tmp/kubeconfig-3071675436 adm copy-to-node node/ip-10-0-17-85.ec2.internal --copy=/tmp/fetch-w637bgyv=/etc/mco-compressed-test-file", StdErr: "error: --parallelism must be either N or N%: strconv.ParseInt: parsing \"10%%\": invalid syntax",
Actual results:
default value of --parallelism cannot be parsed
Expected results:
no error
Additional info:
there is a hack code to append % to the default value ref: https://github.com/openshift/oc/blob/79dc671bdaeafa74b92f14ad9f6d84e344608034/pkg/cli/admin/pernodepod/command.go#L75-L79 https://github.com/openshift/oc/blob/79dc671bdaeafa74b92f14ad9f6d84e344608034/pkg/cli/admin/pernodepod/command.go#L94 err var percentParseErr should be used
This has been reported by lance5890 upstream: https://github.com/openshift/cluster-etcd-operator/issues/1237
Description of problem:
During master node removal (out of 3), the etcd cert signer controller might still rollout a revision even though quorum is obviously going to be broken with that. Important events: 08:06:26.674067 1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MasterNodeRemoved' Observed removal of master node node3 08:06:26.909780 1 base_controller.go:272] EtcdEndpointsController reconciliation failed: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available 08:06:27.005308 1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SecretUpdated' Updated Secret/etcd-all-certs -n openshift-etcd because it changed 08:06:27.149860 1 base_controller.go:272] EtcdCertSignerController reconciliation failed: EtcdCertSignerController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
Version-Release number of selected component (if applicable):
all versions where we introduced the quorum guard (> 4.12 current applicable).
How reproducible:
depends on the timing of the removal and the controller runs, but somewhat frequent.
Steps to Reproduce:
1. remove a master node 2. wait for quorum loss / downtime due to revision rollout
Actual results:
quorum is lost and there is brief api downtime during the revision is rolled out
Expected results:
the revisioned secret should not be updated when quorum is about to be lost
Additional info:
Please review the following PR: https://github.com/openshift/operator-framework-operator-controller/pull/51
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Hosted control plane kube scheduler pods crashloop on clusters created with Kube 1.29 rebase
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. Create a hosted cluster using 4.16 kube rebase code base 2. Wait for the cluster to come up
Actual results:
Cluster never comes up because kube scheduler pod crashloops
Expected results:
Cluster comes up
Additional info:
The kube scheduler configuration generated by the control plane operator is using the v1beta3 version of the configuration. That version is no longer included in Kubernetes v1.29
This is a clone of issue OCPBUGS-33973. The following is the description of the original issue:
—
Description of problem:
The network resource provisioning playbook for 4.15 dualstack UPI contains a task for adding an IPv6 subnet to the existing external router [1]. This task fails with: - ansible-2.9.27-1.el8ae.noarch & ansible-collections-openstack-1.8.0-2.20220513065417.5bb8312.el8ost.noarch in OSP 16 env (RHEL 8.5) or - openstack-ansible-core-2.14.2-4.1.el9ost.x86_64 & ansible-collections-openstack-1.9.1-17.1.20230621074746.0e9a6f2.el9ost.noarch in OSP 17 env (RHEL 9.2) Besides that we need to have a way for identifying resources for particular deployment, as it may interfere with existing one.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-22-160236
How reproducible:
Always
Steps to Reproduce:
1. Set the os_subnet6 in the inventory file for setting dualstack 2. Run the 4.15 network.yaml playbook
Actual results:
Playbook fails: TASK [Add IPv6 subnet to the external router] ********************************** fatal: [localhost]: FAILED! => {"changed": false, "extra_data": {"data": null, "details": "Invalid input for external_gateway_info. Reason: Validation of dictionary's keys failed. Expected keys: {'network_id'} Provided keys: {'external_fixed_ips'}.", "response": "{\"NeutronError\": {\"type\": \"HTTPBadRequest\", \"message\": \"Invalid input for external_gateway_info. Reason: Validation of dictionary's keys failed. Expected keys: {'network_id'} Provided keys: {'external_fixed_ips'}.\", \"detail\": \"\"}}"}, "msg": "Error updating router 8352c9c0-dc39-46ed-94ed-c038f6987cad: Client Error for url: https://10.46.43.81:13696/v2.0/routers/8352c9c0-dc39-46ed-94ed-c038f6987cad, Invalid input for external_gateway_info. Reason: Validation of dictionary's keys failed. Expected keys: {'network_id'} Provided keys: {'external_fixed_ips'}."}
Expected results:
Successful playbook execution
Additional info:
The router can be created in two different tasks, the playbook [2] worked for me.
[1] https://github.com/openshift/installer/blob/1349161e2bb8606574696bf1e3bc20ae054e60f8/upi/openstack/network.yaml#L43
[2] https://file.rdu.redhat.com/juriarte/upi/network.yaml
Hello Team,
After the hard reboot of all nodes due to a power outage, failure of image pull of NTO preventing "ocp-tuned-one-shot.service" startup result in dependency failure for kubelet and crio services,
------------
journalctl_--no-pager
Aug 26 17:07:46 ocp05 systemd[1]: Reached target The firstboot OS update has completed.
Aug 26 17:07:46 ocp05 resolv-prepender.sh[3577]: NM resolv-prepender: Starting download of baremetal runtime cfg image
Aug 26 17:07:46 ocp05 systemd[1]: Starting Writes IP address configuration so that kubelet and crio services select a valid node IP...
Aug 26 17:07:46 ocp05 systemd[1]: Starting TuneD service from NTO image...
Aug 26 17:07:46 ocp05 nm-dispatcher[3687]: NM resolv-prepender triggered by lo up.
Aug 26 17:07:46 ocp05 resolv-prepender.sh[3644]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cf4faeb258c222ba4e04806fd3a7373d3bc1f43a66e141d4b7ece0307f597c72...
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + [[ OVNKubernetes == \O\V\N\K\u\b\e\r\n\e\t\e\s ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + [[ lo == \W\i\r\e\d\ \C\o\n\n\e\c\t\i\o\n ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + '[' -z ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + echo 'Not a DHCP4 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: Not a DHCP4 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + exit 0
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: + '[' -z '' ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: + echo 'Not a DHCP6 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: Not a DHCP6 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: + exit 0
Aug 26 17:07:46 ocp05 bash[3655]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cf4faeb258c222ba4e04806fd3a7373d3bc1f43a66e141d4b7ece0307f597c72...
Aug 26 17:07:46 ocp05 podman[3661]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b6ace44ba73bc0cef451bcf755c7fcddabe66b79df649058dc4b263e052ae26...
Aug 26 17:07:46 ocp05 podman[3661]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b6ace44ba73bc0cef451bcf755c7fcddabe66b79df649058dc4b263e052ae26: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 10.112.227.10:53: server misbehaving
Aug 26 17:07:46 ocp05 systemd[1]: ocp-tuned-one-shot.service: Main process exited, code=exited, status=125/n/a
Aug 26 17:07:46 ocp05 nm-dispatcher[3793]: NM resolv-prepender triggered by brtrunk up.
Aug 26 17:07:46 ocp05 systemd[1]: ocp-tuned-one-shot.service: Failed with result 'exit-code'.
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + [[ OVNKubernetes == \O\V\N\K\u\b\e\r\n\e\t\e\s ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + [[ brtrunk == \W\i\r\e\d\ \C\o\n\n\e\c\t\i\o\n ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + '[' -z ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + echo 'Not a DHCP4 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: Not a DHCP4 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + exit 0
Aug 26 17:07:46 ocp05 systemd[1]: Failed to start TuneD service from NTO image.
Aug 26 17:07:46 ocp05 systemd[1]: Dependency failed for Dependencies necessary to run kubelet.
Aug 26 17:07:46 ocp05 systemd[1]: Dependency failed for Kubernetes Kubelet.
Aug 26 17:07:46 ocp05 systemd[1]: kubelet.service: Job kubelet.service/start failed with result 'dependency'.
Aug 26 17:07:46 ocp05 systemd[1]: Dependency failed for Container Runtime Interface for OCI (CRI-O).
Aug 26 17:07:46 ocp05 systemd[1]: crio.service: Job crio.service/start failed with result 'dependency'.
Aug 26 17:07:46 ocp05 systemd[1]: kubelet-dependencies.target: Job kubelet-dependencies.target/start failed with result 'dependency'.
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: + '[' -z '' ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: + echo 'Not a DHCP6 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: Not a DHCP6 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: + exit 0
-----------
-----------
$ oc get proxy config cluster -oyaml
status:
httpProxy: http://proxy_ip:8080
httpsProxy: http://proxy_ip:8080
$ cat /etc/mco/proxy.env
HTTP_PROXY=http://proxy_ip:8080
HTTPS_PROXY=http://proxy_ip:8080
-----------
-----------
× ocp-tuned-one-shot.service - TuneD service from NTO image
Loaded: loaded (/etc/systemd/system/ocp-tuned-one-shot.service; enabled; preset: disabled)
Active: failed (Result: exit-code) since Mon 2024-08-26 17:07:46 UTC; 2h 30min ago
Main PID: 3661 (code=exited, status=125)
Aug 26 17:07:46 ocp05 podman[3661]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b6ace44ba73bc0cef451bcf755c7fcddabe66b79df649058dc4b263e052ae26: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 10.112.227.10:53: server misbehaving
-----------
This is a clone of issue OCPBUGS-23332. The following is the description of the original issue:
—
Description of problem:
Navigate to Node overview and check the Utilization of CPU and memory, it shows something like: "6.53 GiB available of 300 MiB total limit", which looks very confuse.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Navigate to Node overview 2. Check the Utilization of CPU and memory 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
API tests , running from test-infra , set the OPENSHIFT_VERSION=4.15
We expect from service to return latest stable version (x.y.z)
The returned version is 4.15.8-multi which it not from stable stream but from candidate and should not be chosen
This behviour break the API tests because we expect to pick latest stream when sending Major.Minor.
How reproducible:
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/1980
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
: [bz-Routing] clusteroperator/ingress should not change |
Has been failing for over a month in the e2e-metal-ipi-sdn-bm-upgrade jobs
I think this is because there are only two worker nodes in the BM environment and some HA services loose redundancy when one of the workers is rebooted.
In the medium term I hope to add another node to each cluster but in the sort term we should skip the test.
Description of problem:
When using the OpenShift Assisted Installer with a password containing the `:` colon character.
Version-Release number of selected component (if applicable):
OpenShift 4.15
How reproducible:
Everytime
Steps to Reproduce:
1. Attempt to install using the Agentbased installer with a pull-secret which includes a colon character. The following snippet of. code appears to be hit when there is a colon within the user/password section of the pull-secret. https://github.com/openshift/assisted-service/blob/d3dd2897d1f6fe108353c9241234a724b30262c2/internal/cluster/validations/validations.go#L132-L135
Actual results:
Install fails
Expected results:
Install succeeds
Additional info:
Description of problem:
Observing the following test case failure in 4.14 to 4.15 and 4.15 to 4.16 upgrade CI runs continuously. [bz-Image Registry] clusteroperator/image-registry should not change condition/Available
4.14 Image: registry.ci.openshift.org/ocp-ppc64le/release-ppc64le:4.14.0-0.nightly-ppc64le-2024-01-15-085349
4.15 Image: registry.ci.openshift.org/ocp-ppc64le/release-ppc64le:4.15.0-0.nightly-ppc64le-2024-01-15-042536
Recent periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade failure caused by
: [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers expand_less 0s { event [namespace/openshift-machine-api node/ci-op-j666c60n-23cd9-nb7wr-master-1 pod/cluster-baremetal-operator-79b78c4548-n5vrt hmsg/b7cb271b13 - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-79b78c4548-n5vrt_openshift-machine-api(32835332-fc25-4ddf-84ce-d3aa447d3ce0)] happened 25 times}
Shows in Component Readiness as unknown component
/ ovn upgrade-minor amd64 gcp rt > Unknown> [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers
We should update testBackoffStartingFailedContainer to check for / use known namespaces where a junit is created for known namespaces indicating pass, fail or flake.
The code already handles testBackoffStartingFailedContainerForE2ENamespaces
It looks as though we only check for known or e2e namespaces, need to double check that if that is the case we are ok with potentially unknown namespaced events getting through.
We should also review Sippy pathological tests for results that don't contain `for ns/namespace` format and review if they need to be broken out as well.
For each test that we break out we need to map the new namespace specific test to the correct component in the test mapping repository
Please review the following PR: https://github.com/openshift/csi-node-driver-registrar/pull/59
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
If ccm disabled in cloud such as aws, installation will continue until failed in ingress LoadBalancerPending
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Build image with pr openshift/cluster-cloud-controller-manager-operator#284,openshift/installer#7546,openshift/cluster-version-operator#979,openshift/machine-config-operator#3999 2. Install cluster on aws with "baselineCapabilitySet: v4.14" 3.
Actual results:
Installation failed, ingress LoadBalancerPending. $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-25-230.us-east-2.compute.internal Ready control-plane,master 86m v1.28.3+20a5764 ip-10-0-3-101.us-east-2.compute.internal Ready worker 78m v1.28.3+20a5764 ip-10-0-46-198.us-east-2.compute.internal Ready control-plane,master 87m v1.28.3+20a5764 ip-10-0-48-220.us-east-2.compute.internal Ready worker 80m v1.28.3+20a5764 ip-10-0-79-203.us-east-2.compute.internal Ready control-plane,master 86m v1.28.3+20a5764 ip-10-0-95-83.us-east-2.compute.internal Ready worker 78m v1.28.3+20a5764 $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest False False True 85m OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.zhsun-aws1.qe.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.zhsun-aws1.qe.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server) baremetal 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 84m cloud-credential 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 86m cluster-autoscaler 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 84m config-operator 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 85m console 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest False True False 79m DeploymentAvailable: 0 replicas available for console deployment... control-plane-machine-set 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 81m csi-snapshot-controller 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 85m dns 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 84m etcd 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 83m image-registry 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 78m ingress False True True 78m The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (LoadBalancerPending: The LoadBalancer service is pending) insights 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 78m kube-apiserver 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 71m kube-controller-manager 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 82m kube-scheduler 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 82m kube-storage-version-migrator 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 85m machine-api 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 77m machine-approver 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 84m machine-config 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 84m marketplace 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 84m monitoring 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 73m network 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 86m node-tuning 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 78m openshift-apiserver 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 71m openshift-controller-manager 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 75m openshift-samples 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 78m service-ca 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 85m storage 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False
Expected results:
Tell users not to turn CCM off for cloud.
Additional info:
Description of problem:
The installer supports pre-rendering of the PerformanceProfile related manifests. However the MCO render is executed after the PerfProfile render and so the master and worker MachineConfigPools are created too late. This causes the installation process to fail with: Oct 18 18:05:25 localhost.localdomain bootkube.sh[537963]: I1018 18:05:25.968719 1 render.go:73] Rendering files into: /assets/node-tuning-bootstrap Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.008421 1 render.go:133] skipping "/assets/manifests/99_feature-gate.yaml" [1] manifest because of unhandled *v1.FeatureGate Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.013043 1 render.go:133] skipping "/assets/manifests/cluster-dns-02-config.yml" [1] manifest because of unhandled *v1.DNS Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.021978 1 render.go:133] skipping "/assets/manifests/cluster-ingress-02-config.yml" [1] manifest because of unhandled *v1.Ingress Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.023016 1 render.go:133] skipping "/assets/manifests/cluster-network-02-config.yml" [1] manifest because of unhandled *v1.Network Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.023160 1 render.go:133] skipping "/assets/manifests/cluster-proxy-01-config.yaml" [1] manifest because of unhandled *v1.Proxy Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.023445 1 render.go:133] skipping "/assets/manifests/cluster-scheduler-02-config.yml" [1] manifest because of unhandled *v1.Scheduler Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.024475 1 render.go:133] skipping "/assets/manifests/cvo-overrides.yaml" [1] manifest because of unhandled *v1.ClusterVersion Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: F1018 18:05:26.037467 1 cmd.go:53] no MCP found that matches performance profile node selector "node-role.kubernetes.io/master="
Version-Release number of selected component (if applicable):
4.14.0-rc.6
How reproducible:
Always
Steps to Reproduce:
1. Add an SNO PerformanceProfile to extra manifest in the installer. Node selector should be: "node-role.kubernetes.io/master=" 2. 3.
Actual results:
no MCP found that matches performance profile node selector "node-role.kubernetes.io/master="
Expected results:
Installation completes
Additional info:
apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: openshift-node-workload-partitioning-sno spec: cpu: isolated: 4-X <- must match the topology of the node reserved: 0-3 nodeSelector: node-role.kubernetes.io/master: ""
Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/1020
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The funky job renaming done in these is breaking risk analysis. The disruption checking actually ran if you look in the logs, but we don't get far enough to generate the html template, I suspect because of the error returned by risk analysis.
Would be nice if both would work but I'd be happy just to get the disruption portion populated for now. I don't think the overall risk analysis will be easy.
Currently when creating an Azure cluster, only the first node of the nodePool will be ready and join the cluster, all other azure machines are stuck in the `Creating` state.
Description of problem:
The DaemonSet code any taints to be ignored - therefore the Operator executes on the IBM Cloud Bare Metal
Version-Release number of selected component (if applicable):
IBM Cloud Infrastructure Services (formerly known as VPC Infrastructure Environment), using IBM Cloud Bare Metal profiles with either Gen2 (Intel Cascade Lake) or Gen3 (Intel Sapphire Rapids) hardware. Special note - this refers to IBM Cloud Bare Metal, and NOT applicable to IBM Cloud Bare Metal (Classic) in the legacy Classic Infrastructure environment (AKA. Softlayer).
How reproducible:
Reproducible
Steps to Reproduce:
IBM LAB team found a bug that is causing errors on the bare metal worker nodes, and is requesting a patch to ibm-vpc-block-csi-driver The proposed solution, enforce Namespace to select nodes where instance-type NOT CONTAINS substring 'metal'. This will stop the Namespace's DaemonSet from scheduling the Operator on IBM Cloud Bare Metals: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/manifests/01_namespace.yaml ``` kind: Namespace apiVersion: v1 metadata: annotations: openshift.io/node-selector: 'node.openshift.io/instance-type notin (metal)'
Actual results:
Expected results:
enforce Namespace to select nodes where instance-type NOT CONTAINS substring 'metal'. This will stop the Namespace's DaemonSet from scheduling the Operator on IBM Cloud Bare Metals: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/manifests/01_namespace.yaml
Additional info:
03802506
Description of problem:
1) Customer tag a image which including # (Hashtag) in the tag name
uk302-img-app-j:v0.6.12-build0000#000
2)When customer using OADP to backup images , they got below error
error excuting custom action(groupResource=imagestream.image.openshift.io namespace=dbp-p0010001, name=uk302-image-app-j): rpc error: code= Unknown= Invalid destination name udistribution-s3-c9814a92-67a4-4251-bd0d-142dfc4d3c80://dbp-p0010001/uk302-image-app-j:v0.6.12-build0000#00: invalid reference format
3) when check the source code below, we found that there are check towards tag name , seems # (Hashtag) is not allowed in regexp check
func copyImage(log logr.Logger, src, dest string, copyOptions *copy.Options) ([]byte, error) { policyContext, err := getPolicyContext() if err != nil { return []byte{}, fmt.Errorf("Error loading trust policy: %v", err) } defer policyContext.Destroy() srcRef, err := alltransports.ParseImageName(src) if err != nil { return []byte{}, fmt.Errorf("Invalid source name %s: %v", src, err) } destRef, err := alltransports.ParseImageName(dest) if err != nil { return []byte{}, fmt.Errorf("Invalid destination name %s: %v", dest, err) }
https://github.com/containers/image/blob/main/docker/reference/regexp.go#L111
const ( // alphaNumeric defines the alpha numeric atom, typically a // component of names. This only allows lower case characters and digits. alphaNumeric = `[a-z0-9]+` // separator defines the separators allowed to be embedded in name // components. This allow one period, one or two underscore and multiple // dashes. Repeated dashes and underscores are intentionally treated // differently. In order to support valid hostnames as name components, // supporting repeated dash was added. Additionally double underscore is // now allowed as a separator to loosen the restriction for previously // supported names. separator = `(?:[._]|__|[-]*)` // repository name to start with a component as defined by DomainRegexp // and followed by an optional port. domainComponent = `(?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9])` // The string counterpart for TagRegexp. tag = `[\w][\w.-]{0,127}` // The string counterpart for DigestRegexp. digestPat = `[A-Za-z][A-Za-z0-9]*(?:[-_+.][A-Za-z][A-Za-z0-9]*)*[:][[:xdigit:]]{32,}` // The string counterpart for IdentifierRegexp. identifier = `([a-f0-9]{64})` // The string counterpart for ShortIdentifierRegexp. shortIdentifier = `([a-f0-9]{6,64})`
Expected results: Customer want to know if this should be a bug that , when doing {code:java} oc tag
We should have some checking towards the tag name to prevent the #(Hashtag) or other non allowed code been setting in the image tag which causing unexpected issue like in using OADP or other tools.
please have a check , thank you!
Regards
Jacob
This is a clone of issue OCPBUGS-43112. The following is the description of the original issue:
—
HCP fails to deploy with SR CSI driver failing to pull its image
Description of problem:
found typo in 4.14/4.15 branch when review PR: https://github.com/openshift/cluster-monitoring-operator/pull/2073
example typo in 4.14 branch
1. systemd unit pattern valiation error
valiation should be validation
2. enable systemd collector with invalid units parttern
parttern should be pattern
3. t.Fatalf("invalid secret namepace, got %s, want %s", s.Namespace, "openshift-user-workload-monitoring")
namepace should be namespace
4. spread contraints
should be spread constraints
Version-Release number of selected component (if applicable):
4.14/4.15
How reproducible:
always
I remember we've added golang-lint to repo, it should find the errors
Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/154
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
AWS EBS, Azure Disk and Azure File operators are now built from cmd/ and pkg/, there is no code used from legacy/ dir and we should remove it.
There are still test manifests in legacy/ directory that are still used! They need to be moved somewhere else + Dockerfile.*.test and CI steps must be updated!
Technically, this is a copy of STOR-1797, but we need a bug to be able to backport aws-ebs changes to 4.15 and not use legacy/ directory there too.
[sig-arch][Late] collect certificate data [Suite:openshift/conformance/parallel]
Test is currently making the Unknown component red, but this test should be aligned to the kube-apiserver component. Looks like two others in the same file should be as well.
Assisted installer agent's api_vip check doesn't accept multiple headers (src). This poses an issue when there are different ignition servers (e.g. hypershift) that expect different headers.
Latest use case: Hypershift's ignition server expects this header: Nodepool name and targetconfigversionhash
Description of problem:
When using OpenShift 4.15 on ROSA Hosted Control Planes, after disabling the ImageRegistry, the default secrets and service accounts are still being created. This functionality should not be occurring once the registry is removed: https://docs.openshift.com/rosa/nodes/pods/nodes-pods-secrets.html#auto-generated-sa-token-secrets_nodes-pods-secrets
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. Deploy ROSA 4.15 HCP Cluster 2. Set spec.managementState = "Removed" on the cluster.config.imageregistry.operator.openshift.io. The image registry will be removed 3. Create a new OpenShift Project 4. Observe the builder, default and deployer ServiceAccounts and their associated Secrets are still created
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/openshift-apiserver/pull/409
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Install a private cluster, the base domain set in install-config.yaml is same as another existed cis domain name. After destroy the private cluster, the dns resource-records remains.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1.create a DNS service instance, setting its domain to "ibmcloud.qe.devcluster.openshift.com", Note, this domain name is also being used in another existing CIS domain. 2.Install a private ibmcloud cluster, the base domain set in install-config is "ibmcloud.qe.devcluster.openshift.com" 3.Destroy the cluster 4.Check the remains dns records
Actual results:
$ ibmcloud dns resource-records 5f8a0c4d-46c2-4daa-9157-97cb9ad9033a -i preserved-openshift-qe-private | grep ci-op-17qygd06-23ac4 api-int.ci-op-17qygd06-23ac4.ibmcloud.qe.devcluster.openshift.com *.apps.ci-op-17qygd06-23ac4.ibmcloud.qe.devcluster.openshift.com api.ci-op-17qygd06-23ac4.ibmcloud.qe.devcluster.openshift.com
Expected results:
No more dns records about the cluster
Additional info:
$ ibmcloud dns zones -i preserved-openshift-qe-private | awk '{print $2}' Name private-ibmcloud.qe.devcluster.openshift.com private-ibmcloud-1.qe.devcluster.openshift.com ibmcloud.qe.devcluster.openshift.com $ ibmcloud cis domains Name ibmcloud.qe.devcluster.openshift.com When use private-ibmcloud.qe.devcluster.openshift.com and private-ibmcloud-1.qe.devcluster.openshift.com as domain, no such issue, when use ibmcloud.qe.devcluster.openshift.com as domain the dns records remains.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Description of problem:
There is one pod of metal3 operator in constant failure state. The cluster was acting as Hub cluster with ACM + GitOps for SNO installation. It was working well for a few days, until this moment when no other sites could be deployed. oc get pods -A | grep metal3 openshift-machine-api metal3-64cf86fb8b-fg5b9 3/4 CrashLoopBackOff 35 (108s ago) 155m openshift-machine-api metal3-baremetal-operator-84875f859d-6kj9s 1/1 Running 0 155m openshift-machine-api metal3-image-customization-57f8d4fcd4-996hd 1/1 Running 0 5h
Version-Release number of selected component (if applicable):
OCP version: 4.16.ec5
How reproducible:
Once it starts to fail, it does not recover.
Steps to Reproduce:
1. Unclear. Install Hub cluster with ACM+GitOps 2. (Perhaps: Update AgentServiceConfig
Actual results:
Pod crashing and installation of spoke cluster fails
Expected results:
Pod running and installation of spoke cluster succeds.
Additional info:
Logs of metal3-ironic-inspector: `[kni@infra608-1 ~]$ oc logs pods/metal3-64cf86fb8b-fg5b9 -c metal3-ironic-inspector + CONFIG=/etc/ironic-inspector/ironic-inspector.conf + export IRONIC_INSPECTOR_ENABLE_DISCOVERY=false + IRONIC_INSPECTOR_ENABLE_DISCOVERY=false + export INSPECTOR_REVERSE_PROXY_SETUP=true + INSPECTOR_REVERSE_PROXY_SETUP=true + . /bin/tls-common.sh ++ export IRONIC_CERT_FILE=/certs/ironic/tls.crt ++ IRONIC_CERT_FILE=/certs/ironic/tls.crt ++ export IRONIC_KEY_FILE=/certs/ironic/tls.key ++ IRONIC_KEY_FILE=/certs/ironic/tls.key ++ export IRONIC_CACERT_FILE=/certs/ca/ironic/tls.crt ++ IRONIC_CACERT_FILE=/certs/ca/ironic/tls.crt ++ export IRONIC_INSECURE=true ++ IRONIC_INSECURE=true ++ export 'IRONIC_SSL_PROTOCOL=-ALL +TLSv1.2 +TLSv1.3' ++ IRONIC_SSL_PROTOCOL='-ALL +TLSv1.2 +TLSv1.3' ++ export 'IPXE_SSL_PROTOCOL=-ALL +TLSv1.2 +TLSv1.3' ++ IPXE_SSL_PROTOCOL='-ALL +TLSv1.2 +TLSv1.3' ++ export IRONIC_VMEDIA_SSL_PROTOCOL=ALL ++ IRONIC_VMEDIA_SSL_PROTOCOL=ALL ++ export IRONIC_INSPECTOR_CERT_FILE=/certs/ironic-inspector/tls.crt ++ IRONIC_INSPECTOR_CERT_FILE=/certs/ironic-inspector/tls.crt ++ export IRONIC_INSPECTOR_KEY_FILE=/certs/ironic-inspector/tls.key ++ IRONIC_INSPECTOR_KEY_FILE=/certs/ironic-inspector/tls.key ++ export IRONIC_INSPECTOR_CACERT_FILE=/certs/ca/ironic-inspector/tls.crt ++ IRONIC_INSPECTOR_CACERT_FILE=/certs/ca/ironic-inspector/tls.crt ++ export IRONIC_INSPECTOR_INSECURE=true ++ IRONIC_INSPECTOR_INSECURE=true ++ export IRONIC_VMEDIA_CERT_FILE=/certs/vmedia/tls.crt ++ IRONIC_VMEDIA_CERT_FILE=/certs/vmedia/tls.crt ++ export IRONIC_VMEDIA_KEY_FILE=/certs/vmedia/tls.key ++ IRONIC_VMEDIA_KEY_FILE=/certs/vmedia/tls.key ++ export IPXE_CERT_FILE=/certs/ipxe/tls.crt ++ IPXE_CERT_FILE=/certs/ipxe/tls.crt ++ export IPXE_KEY_FILE=/certs/ipxe/tls.key ++ IPXE_KEY_FILE=/certs/ipxe/tls.key ++ export RESTART_CONTAINER_CERTIFICATE_UPDATED=false ++ RESTART_CONTAINER_CERTIFICATE_UPDATED=false ++ export MARIADB_CACERT_FILE=/certs/ca/mariadb/tls.crt ++ MARIADB_CACERT_FILE=/certs/ca/mariadb/tls.crt ++ export IPXE_TLS_PORT=8084 ++ IPXE_TLS_PORT=8084 ++ mkdir -p /certs/ironic ++ mkdir -p /certs/ironic-inspector ++ mkdir -p /certs/ca/ironic mkdir: cannot create directory '/certs/ca/ironic': Permission denied
Description of problem:
Agent based installation is stuck on the booting screen for the arm64 SNO cluster.
The installer shuold validate the architecture set by the users in the install-config.yaml with the payload image being used.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
100%
Steps to Reproduce:
[Fixed original version] 1. Create agent ISO with the amd64 payload 2. Boot the created ISO on arm64 server 3. Monitor the booting screen for error [Generalized] 1. Set the install-config.yaml controlPlane.architecture to arm64 2. Try to install with an
Actual results:
The installation is currently stuck on the initial booting screen.
Expected results:
The SNO cluster should be installed without any issues.
Additional info:
Compact cluster installation was successful, here is the prow ci link: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.13-arm64-nightly-baremetal-compact-agent-ipv4-static-connected-p1-f7/1665833590451081216/artifacts/baremetal-compact-agent-ipv4-static-connected-p1-f7/baremetal-lab-agent-install/build-log.txt
This is a clone of issue OCPBUGS-34699. The following is the description of the original issue:
—
Description of problem:
In the RBAC which is set up for networkTypes other than OVNKubernetes, the cluster-network-operator role allows access to a configmap named "openshift-service-ca.crt", but the configmap which is actually used is named "root-ca".
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
If two clusters share a single OpenStack projects, cloud-provider-openstack won't distinguish type=LoadBalancer Services between them if they have the same namespace name and service name. https://github.com/kubernetes/cloud-provider-openstack/issues/2241
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Deploy 2 clusters. 2. Create LoadBalancer Services of the same name in default namespaces of both clusters.
Actual results:
cloud-provider-openstack fights over ownership of the LB.
Expected results:
LBs are distinguished.
Additional info:
The cluster-ingress-operator repository vendors controller-runtime v0.16.3, which uses Kubernetes 1.28 packages. OpenShift 4.16 is based on Kubernetes 1.29.
4.16.
Always.
Check https://github.com/openshift/cluster-ingress-operator/blob/release-4.16/go.mod.
The sigs.k8s.io/controller-runtime package is at v0.16.3.
The sigs.k8s.io/controller-runtime package is at v0.17.0 or newer.
https://github.com/openshift/cluster-ingress-operator/pull/1016 already bumped the k8s.io/* packages to v0.29.0, but ideally the controller-runtime package should be bumped too. The controller-runtime v0.17 release includes some breaking changes, such as the removal of apiutil.NewDiscoveryRESTMapper; see the release notes at https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.17.0.
Description of problem:
After running ./openshift-install destroy cluster, TagCategory still exist # ./openshift-install destroy cluster --dir cluster --log-level debug DEBUG OpenShift Installer 4.15.0-0.nightly-2023-12-18-220750 DEBUG Built from commit 2b894776f1653ab818e368fa625019a6de82a8c7 DEBUG Power Off Virtual Machines DEBUG Powered off VirtualMachine=sgao-devqe-spn2w-master-2 DEBUG Powered off VirtualMachine=sgao-devqe-spn2w-master-1 DEBUG Powered off VirtualMachine=sgao-devqe-spn2w-master-0 DEBUG Powered off VirtualMachine=sgao-devqe-spn2w-worker-0-kpg46 DEBUG Powered off VirtualMachine=sgao-devqe-spn2w-worker-0-w5rrn DEBUG Delete Virtual Machines INFO Destroyed VirtualMachine=sgao-devqe-spn2w-rhcos-generated-region-generated-zone INFO Destroyed VirtualMachine=sgao-devqe-spn2w-master-2 INFO Destroyed VirtualMachine=sgao-devqe-spn2w-master-1 INFO Destroyed VirtualMachine=sgao-devqe-spn2w-master-0 INFO Destroyed VirtualMachine=sgao-devqe-spn2w-worker-0-kpg46 INFO Destroyed VirtualMachine=sgao-devqe-spn2w-worker-0-w5rrn DEBUG Delete Folder INFO Destroyed Folder=sgao-devqe-spn2w DEBUG Delete StoragePolicy=openshift-storage-policy-sgao-devqe-spn2w INFO Destroyed StoragePolicy=openshift-storage-policy-sgao-devqe-spn2w DEBUG Delete Tag=sgao-devqe-spn2w INFO Deleted Tag=sgao-devqe-spn2w DEBUG Delete TagCategory=openshift-sgao-devqe-spn2w INFO Deleted TagCategory=openshift-sgao-devqe-spn2w DEBUG Purging asset "Metadata" from disk DEBUG Purging asset "Master Ignition Customization Check" from disk DEBUG Purging asset "Worker Ignition Customization Check" from disk DEBUG Purging asset "Terraform Variables" from disk DEBUG Purging asset "Kubeconfig Admin Client" from disk DEBUG Purging asset "Kubeadmin Password" from disk DEBUG Purging asset "Certificate (journal-gatewayd)" from disk DEBUG Purging asset "Cluster" from disk INFO Time elapsed: 29s INFO Uninstallation complete! # govc tags.category.ls | grep sgao openshift-sgao-devqe-spn2w
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-18-220750
How reproducible:
always
Steps to Reproduce:
1. IPI install OCP on vSphere 2. Destroy cluster installed, check TagCategory
Actual results:
TagCategory still exist
Expected results:
TagCategory should be deleted
Additional info:
Also reproduced in openshift-install-linux-4.14.0-0.nightly-2023-12-20-184526,4.13.0-0.nightly-2023-12-21-194724, while 4.12.0-0.nightly-2023-12-21-162946 have not this issue
Description of problem:
Missing Source column header in PVC > VolumeSnapshots tab
Version-Release number of selected component (if applicable):
Cluster 4.10, 4.14, 4.16
How reproducible:
Yes
Steps to Reproduce:
1. Create a PVC i.e. "my-pvc" 2. Create a Pod and bind it to the "my-pvc" 3. Create a VolumeSnapshots and associate it with the "my-pvc" 4. Goto to PVC detail > VolumeSnapshots tab
Actual results:
The Source column header is not displayed
Expected results:
the Source column header should be displayed
Additional info:
Please review the following PR: https://github.com/openshift/azure-disk-csi-driver/pull/65
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When a namespace has a Resource Quota applied to it the Workload Graphs in the Observe view does not renders properly.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create a new project/namespace 2. Apply the following Resource Quota (just a sample) to it ``` kind: ResourceQuota apiVersion: v1 metadata: name: staging-workshop-quota spec: hard: limits.cpu: '3' limits.memory: 3Gi pods: '10' ``` 3. From the Developer Console, access the Observe View 3. From teh Dashboard list, select `Kubernetes / Compute Resources / Namespaces (Workloads)` Option
Actual results:
The Graph is not rendered (see attached screenshot)
Expected results:
The Graph should render even with no data point
Additional info:
When you have a Resource Quota applied to the namespace you can try this query to see the `NaN` value returned. ``` curl -G -s -k -H "Authorization: Bearer $(oc whoami -t)" 'https://thanos-querier-openshift-monitoring.apps.cluster-your cluster domain here/api/v1/query' --data-urlencode 'query=scalar(kube_resourcequota{cluster="", namespace="user9-staging", type="hard",resource="requests.memory"})' ``` sample respose ``` {"status":"success","data":{"resultType":"scalar","result":[1682600794.396,"NaN"]}} ```
Description of the problem:
Installation of cluster using OCP image 4.15.0-rc.0 and using HTTP Proxy configuration failed on
3/3 control plane nodes failed to install. Please check the installation logs for more information."
and
"error Host master-0-1: updated status from installing-in-progress to error (Host failed to install because its installation stage Waiting for control plane took longer than expected 1h0m0s)"
After looked at master-0-1 node jpurnactl log found the error:
"Dec 21 00:54:29 master-0-1 kubelet.sh[5111]: E1221 00:54:29.290568 5111 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=etcd pod=etcd-bootstrap-member-master-0-1_openshift-etcd(97cad44a9feb70b1091eaa3fb1e565ca)\"" pod="openshift-etcd/etcd-bootstrap-member-master-0-1" podUID="97cad44a9feb70b1091eaa3fb1e565ca"
"
HTTP Proxy configuration works fine with OCP images 4.14.5 and 4.13.26
Steps to reproduce:
1. Setup HTTP Proxy server on hypervisor using quay.io/sameersbn/squid:3.5.27-2
2. Create cluster and got Host Discovery
3. Press Add Host. Fill out SSH public key
4. Select Show proxy settings and fill out
HTTP proxy URL and No proxy domains
5. Generate ISO image, download it and boot 3 masters and 2 workers node.
6. Continue regular cluster installation
Actual results:
Installation failed on 69%
Expected results:
Installation passed
Please review the following PR: https://github.com/openshift/csi-driver-manila-operator/pull/227
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-33913. The following is the description of the original issue:
—
CI is occasionally bumping into failures like:
: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] expand_less 53m22s { fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:186]: during upgrade to registry.build05.ci.openshift.org/ci-op-kj8vc4dt/release@sha256:74bc38fc3a1d5b5ac8e84566d54d827c8aa88019dbdbf3b02bef77715b93c210: the "master" pool should be updated before the CVO reports available at the new version Ginkgo exit error 1: exit with code 1}
where the machine-config operator is rolling the control-plane MachineConfigPool after the ClusterVersion update completes:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/machineconfigpools.json | jq -r '.items[] | select(.metadata.name == "master").status | [.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message] | sort[]' 2024-05-17T12:57:04Z RenderDegraded=False : 2024-05-17T12:58:35Z Degraded=False : 2024-05-17T12:58:35Z NodeDegraded=False : 2024-05-17T15:13:22Z Updated=True : All nodes are updated with MachineConfig rendered-master-4fcadad80c9941813b00ca7e3eef8e69 2024-05-17T15:13:22Z Updating=False : $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[0].completionTime' 2024-05-17T14:15:22Z
Because of changes to registry pull secrets:
$ dump() { > curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade /gather-extra/artifacts/machineconfigs.json | jq -r ".items[] | select(.metadata.name == \"$1\").spec.config.storage.files[] | select(.path == \"/etc/mco/internal-registry-pull-secret.json\").contents.source" | python3 -c 'import urllib.parse, sys; print(urllib.parse.unquote(sys.stdin.read()).split(",", 1)[-1])' | jq -c '.auths | to_entries[]' > } $ diff -u0 <(dump rendered-master-d6a8cd53ae132250832cc8267e070af6) <(dump rendered-master-4fcadad80c9941813b00ca7e3eef8e69) | sed 's/"value":.*/.../' --- /dev/fd/63 2024-05-17 12:28:37.882351026 -0700 +++ /dev/fd/62 2024-05-17 12:28:37.883351026 -0700 @@ -1 +1 @@ -{"key":"172.30.124.169:5000",... +{"key":"172.30.124.169:5000",... @@ -3,3 +3,3 @@ -{"key":"default-route-openshift-image-registry.apps.ci-op-kj8vc4dt-6c39f.ci2.azure.devcluster.openshift.com",... -{"key":"image-registry.openshift-image-registry.svc.cluster.local:5000",... -{"key":"image-registry.openshift-image-registry.svc:5000",... +{"key":"default-route-openshift-image-registry.apps.ci-op-kj8vc4dt-6c39f.ci2.azure.devcluster.openshift.com",... +{"key":"image-registry.openshift-image-registry.svc.cluster.local:5000",... +{"key":"image-registry.openshift-image-registry.svc:5000",...
Seen in 4.16-to-4.16 Azure update CI. Unclear what the wider scope is.
Sippy reports Success Rate: 94.27% post regression, so a rare race.
But using CI search to pick jobs with 10 or more runs over the past 2 days:
$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=48h&type=junit&search=master.*pool+should+be+updated+before+the+CVO+reports+available' | grep '[0-9][0-9] runs.*failures ma tch' | sort periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade (all) - 52 runs, 50% failed, 12% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade (all) - 80 runs, 20% failed, 25% of failures match = 5% impact periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade (all) - 82 runs, 21% failed, 59% of failures match = 12% impact periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade (all) - 80 runs, 53% failed, 14% of failures match = 8% impact periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-sdn-upgrade (all) - 50 runs, 12% failed, 50% of failures match = 6% impact pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-ovn-upgrade (all) - 14 runs, 21% failed, 33% of failures match = 7% impact pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn-upgrade (all) - 11 runs, 36% failed, 75% of failures match = 27% impact pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change (all) - 11 runs, 18% failed, 100% of failures match = 18% impact pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade (all) - 19 runs, 21% failed, 25% of failures match = 5% impact pull-ci-openshift-machine-config-operator-master-e2e-azure-ovn-upgrade-out-of-change (all) - 21 runs, 48% failed, 50% of failures match = 24% impact pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade (all) - 16 runs, 81% failed, 15% of failures match = 13% impact pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade (all) - 16 runs, 25% failed, 75% of failures match = 19% impact pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade (all) - 26 runs, 35% failed, 67% of failures match = 23% impact
shows some flavors like pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn-upgrade up at a 27% hit rates.
Unclear.
Pull secret changes after the ClusterVersion update cause an unexpected master MachineConfigPool roll.
No MachineConfigPool roll after the ClusterVersion update completes.
Description of problem:
"create serverless function" functionality in the Openshift UI. When you add a (random) repository it shows a warning saying "func.yaml is not present and builder strategy is not s2i" but without any further link or information. That's not a very good UX imo. Could we add a link to explain to the user what that entails?
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
https://redhat-internal.slack.com/archives/CJYKV1YAH/p1706639383940559
Please review the following PR: https://github.com/openshift/route-override-cni/pull/54
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/machine-api-provider-ibmcloud/pull/32
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
If OLMPlacement is set to management, the cluster is up with disableAllDefaultSources set to true, remove it in the HostedCluster CR, in the guest cluster disableAllDefaultSources isn't removed and still set to true
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
'Oh no somthing went wrong' shown on Image Manifest Vulnerability page after create IMV via CLI
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-02-03-192446
How reproducible:
Always
Steps to Reproduce:
1. Installed the operator of 'Red Hat Quay Container Security Operator' 2. Use Command Line to created the IMV $ oc create -f imv.yaml imagemanifestvuln.secscan.quay.redhat.com/example created $ cat IMV.yaml apiVersion: secscan.quay.redhat.com/v1alpha1 kind: ImageManifestVuln metadata: name: example namespace: openshift-operators spec: {} 3. Navigate to page /k8s/ns/openshift-operators/operators.coreos.com~v1alpha1~ClusterServiceVersion/container-security-operator.v3.10.3/secscan.quay.redhat.com~v1alpha1~ImageManifestVuln
Actual results:
Oh no! Something went wrong. will be shown Description:Cannot read properties of undefined (reading 'replace')Component trace:Copy to clipboardat T (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/container-security-chunk-c75b48f176a6a5981ee2.min.js:1:3465) at https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/main-chunk-d635f27942d1b5fdaad6.min.js:1:631947 at tr at x (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/main-chunk-d635f27942d1b5fdaad6.min.js:1:630876) at t (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/vendors~main-chunk-e839d85039c974dbb9bb.min.js:82:73479) at tbody at table at g (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-69dd6dd4312fdf07fedf.min.js:6:199268) at l (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-69dd6dd4312fdf07fedf.min.js:10:88631) at D (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/main-chunk-d635f27942d1b5fdaad6.min.js:1:632038) at div at div at t (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/vendors~main-chunk-e839d85039c974dbb9bb.min.js:50:39294) at t (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/vendors~main-chunk-e839d85039c974dbb9bb.min.js:49:16122) at o (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/main-chunk-d635f27942d1b5fdaad6.min.js:1:642088) at div at M (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/main-chunk-d635f27942d1b5fdaad6.min.js:1:631697) at div
Expected results:
no error issue
Additional info:
Description of problem:
When adding another IP address to br-ex, geneve traffic sent from this node may be sent with the new IP address rather than the one configured for this tunnel. This will cause traffic to be dropped by the destination with the error: [root@ovn-control-plane openvswitch]# cat ovs-vswitchd.log | grep fc00:f853:ccd:e793::4 2024-04-17T16:47:02.146Z|00012|tunnel(revalidator10)|WARN|receive tunnel port not found (tcp6,tun_id=0xff0003,tun_src=0.0.0.0,tun_dst=0.0.0.0,tun_ipv6_src=fc00:f853:ccd:e793:ffff::1,tun_ipv6_dst=fc00:f853:ccd:e793::3,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=64,tun_erspan_ver=0,gtpu_flags=0,gtpu_msgtype=0,tun_flags=csum|key,in_port=5,vlan_tci=0x0000,dl_src=0a:58:2b:22:eb:86,dl_dst=0a:58:92:3f:71:e5,ipv6_src=fc00:f853:ccd:e793::4,ipv6_dst=fd00:10:244:1::7,ipv6_label=0x630b1,nw_tos=0,nw_ecn=0,nw_ttl=63,nw_frag=no,tp_src=8080,tp_dst=59130,tcp_flags=syn|ack) This is more likely to occur on ipv6 than ipv4, due to IP address ordering on the NIC and linux rules used to determine source IP to use when sending host originated traffic.
Version-Release number of selected component (if applicable):
All versions
How reproducible:
Always
To workaround with ipv6, set preferred_lft 0 on the address, which will cause it to become deprecated and linux will choose an alternative. Alternatively set external_ids:ovn-set-local-ip="true" in openvswitch on each node, which will force OVN to use the configured geneve-encap-ip. Related OVN issue: https://issues.redhat.com/browse/FDP-570
Description of the problem:
this issue was discovered while trying to verify bug opened by Javipolo
MGMT-16966
It seems that the avoid extra reboot not working properly for 4.15 cluster with partition on installation disk
Here are 3 clusters:
1) test-infra-cluster-7a4cb4cc OCP 4.15 with partition on installation disk
2) test-infra-cluster-066749e2 OCP 4.14 with partition on installation disk
3) test-infra-cluster-f9051a36 OCP 4.15 without partition on installation disk
i see indications:
*1) test-infra-cluster-7a4cb4cc *
2/19/2024, 9:19:53 PM Node test-infra-cluster-7a4cb4cc-worker-0 has been rebooted 1 times before completing installation 2/19/2024, 9:19:51 PM Node test-infra-cluster-7a4cb4cc-master-2 has been rebooted 2 times before completing installation 2/19/2024, 9:19:08 PM Node test-infra-cluster-7a4cb4cc-worker-1 has been rebooted 1 times before completing installation 2/19/2024, 8:49:59 PM Node test-infra-cluster-7a4cb4cc-master-1 has been rebooted 2 times before completing installation 2/19/2024, 8:49:55 PM Node test-infra-cluster-7a4cb4cc-master-0 has been rebooted 2 times before completing installation
2) test-infra-cluster-066749e2
2/19/2024, 8:32:36 PM Node test-infra-cluster-066749e2-master-2 has been rebooted 2 times before completing installation 2/19/2024, 8:32:35 PM Node test-infra-cluster-066749e2-worker-1 has been rebooted 2 times before completing installation 2/19/2024, 8:32:31 PM Node test-infra-cluster-066749e2-worker-0 has been rebooted 2 times before completing installation 2/19/2024, 8:05:26 PM Node test-infra-cluster-066749e2-master-0 has been rebooted 2 times before completing installation 2/19/2024, 8:05:25 PM Node test-infra-cluster-066749e2-master-1 has been rebooted 2 times before completing installation
3) test-infra-cluster-f9051a36
2/18/2024, 5:13:49 PM Node test-infra-cluster-f9051a36-worker-1 has been rebooted 1 times before completing installation 2/18/2024, 5:10:10 PM Node test-infra-cluster-f9051a36-worker-0 has been rebooted 1 times before completing installation 2/18/2024, 5:08:46 PM Node test-infra-cluster-f9051a36-worker-2 has been rebooted 1 times before completing installation 2/18/2024, 5:03:12 PM Node test-infra-cluster-f9051a36-master-1 has been rebooted 1 times before completing installation 2/18/2024, 4:33:39 PM Node test-infra-cluster-f9051a36-master-2 has been rebooted 1 times before completing installation 2/18/2024, 4:33:38 PM Node test-infra-cluster-f9051a36-master-0 has been rebooted 1 times before completing installation
according to Ori analysis
It seems skip MCO reboot didn't happen for masters. The ignition was not accessible
Feb 19 18:38:19 test-infra-cluster-7a4cb4cc-master-0 installer[3403]: time="2024-02-19T18:38:19Z" level=warning msg="failed getting encapsulated machine config. Continuing installation without skipping MCO reboot" error="failed after 240 attempts, last error: unexpected end of JSON input"
How reproducible:
Steps to reproduce:
1.create cluster with 4.15
2.add ustom manifest which modify ignition to create a partition on disk
3.start installation
Actual results:
seems that reboot avoid did not work properly
Expected results:
Description of problem:
Add image configuration for hypershift Hosted Cluster not working as expected.
Version-Release number of selected component (if applicable):
# oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-rc.8 True False 6h46m Cluster version is 4.13.0-rc.8
How reproducible:
Always
Steps to Reproduce:
1. Get hypershift hosted cluster detail from management cluster. # hostedcluster=$( oc get -n clusters hostedclusters -o json | jq -r '.items[].metadata.name') 2. Apply image setting for hypershift hosted cluster. # oc patch hc/$hostedcluster -p '{"spec":{"configuration":{"image":{"registrySources":{"allowedRegistries":["quay.io","registry.redhat.io","image-registry.openshift-image-registry.svc:5000","insecure.com"],"insecureRegistries":["insecure.com"]}}}}}' --type=merge -n clusters hostedcluster.hypershift.openshift.io/85ea85757a5a14355124 patched # oc get HostedCluster $hostedcluster -n clusters -ojson | jq .spec.configuration.image { "registrySources": { "allowedRegistries": [ "quay.io", "registry.redhat.io", "image-registry.openshift-image-registry.svc:5000", "insecure.com" ], "insecureRegistries": [ "insecure.com" ] } } 3. Check Pod or operator restart to apply configuration changes. # oc get pods -l app=kube-apiserver -n clusters-${hostedcluster} NAME READY STATUS RESTARTS AGE kube-apiserver-67b6d4556b-9nk8s 5/5 Running 0 49m kube-apiserver-67b6d4556b-v4fnj 5/5 Running 0 47m kube-apiserver-67b6d4556b-zldpr 5/5 Running 0 51m #oc get pods -l app=kube-apiserver -n clusters-${hostedcluster} -l app=openshift-apiserver NAME READY STATUS RESTARTS AGE openshift-apiserver-7c69d68f45-4xj8c 3/3 Running 0 136m openshift-apiserver-7c69d68f45-dfmk9 3/3 Running 0 135m openshift-apiserver-7c69d68f45-r7dqn 3/3 Running 0 136m 4. Check image.config in hosted cluster. # oc get image.config -o yaml ... spec: allowedRegistriesForImport: [] status: externalRegistryHostnames: - default-route-openshift-image-registry.apps.hypershift-ci-32506.qe.devcluster.openshift.com internalRegistryHostname: image-registry.openshift-image-registry.svc:5000 #oc get node NAME STATUS ROLES AGE VERSION ip-10-0-128-61.us-east-2.compute.internal Ready worker 6h42m v1.26.3+b404935 ip-10-0-130-68.us-east-2.compute.internal Ready worker 6h42m v1.26.3+b404935 ip-10-0-134-89.us-east-2.compute.internal Ready worker 6h42m v1.26.3+b404935 ip-10-0-138-169.us-east-2.compute.internal Ready worker 6h42m v1.26.3+b404935 # oc debug node/ip-10-0-128-61.us-east-2.compute.internal Temporary namespace openshift-debug-mtfcw is created for debugging node... Starting pod/ip-10-0-128-61us-east-2computeinternal-debug-mctvr ... To use host binaries, run `chroot /host` Pod IP: 10.0.128.61 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-5.1# cat /etc/containers/registries.conf unqualified-search-registries = ["registry.access.redhat.com", "docker.io"] short-name-mode = ""[[registry]] prefix = "" location = "registry-proxy.engineering.redhat.com" [[registry.mirror]] location = "brew.registry.redhat.io" pull-from-mirror = "digest-only"[[registry]] prefix = "" location = "registry.redhat.io" [[registry.mirror]] location = "brew.registry.redhat.io" pull-from-mirror = "digest-only"[[registry]] prefix = "" location = "registry.stage.redhat.io" [[registry.mirror]] location = "brew.registry.redhat.io" pull-from-mirror = "digest-only"
Actual results:
Config changes not applied in backend.Not operator & pod restart
Expected results:
Configuration should applied and pod & operator should restart after config changes.
Additional info:
Description of problem:
While doing the migration of Pipeline details page, it is expecting customData from Details page - https://github.com/openshift/console/blob/master/frontend/packages/pipelines-plugin/src/components/pipelines/pipeline-metrics/PipelineMetrics.tsx but in horizontalnav component exposed to dynamic plugin, we don't have customData prop. https://github.com/openshift/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#horizontalnav
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Pipeline details page PR to be up for testing this[WIP][Story - https://issues.redhat.com/browse/ODC-7525] 2. Install Pipelines Operator and don't install Tekton result 3. Enabled Pipeline details page in dynamic plugin 4. create a pipeline and go to Metrics tab in details page
Actual results:
Expected results:
Additional info:
Description of problem:
With cloud-credential-operator moving to rhel9 by default, we added rhel8 binaries. However, users currently have no way of downloading them using `oc`
Version-Release number of selected component (if applicable):
4.16
How reproducible:
When attempting to extract ccoctl.rhel8
Steps to Reproduce:
1. oc adm release extract --tools 2. 3.
Actual results:
Only contains ccoctl tarball
Expected results:
Should include ccoctl.rhel8 and ccoctl.rhel9 tarballs
Additional info:
ccoctl.rhel8 and ccoctl.rhel9 binaries added in https://issues.redhat.com//browse/OCPBUGS-31290
Description of problem:
With the introduction of the Pod Security Adminssion, the recommended best practice is to enforce the `restricted` policy of admission. However, if the user creates the CatalogSource in the namespace running with `restricted` policy, the CatalogSource Pod fails to be created. This is because when the `.spec.grpcPodConfig.securityContextConfig` is NOT SET in the CatalogSource, OLM treats the value's default as "legacy", which means that the Catalog Pod does NOT set the `restricted` securityContext, meaning that a Catalog pod will fail to run.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
100%
Steps to Reproduce:
1. On a OCP 4.15 cluster, create a custom CatalogSource object without `.spec.grpcPodConfig.securityContextConfig` being specified 2. See if the CatalogSource Pod started successfully without errors.
Actual results:
1. the CatalogSource Pod fails to be created with the error like:
status: message: >- couldn't ensure registry server - error ensuring pod: : error creating new pod: foobar-: pods "foobar-6ttkb" is forbidden: violates PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "registry-server" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "registry-server" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "registry-server" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "registry-server" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") reason: RegistryServerError
Expected results:
The CatalogSource Pod started successfully by default without specifying `.spec.grpcPodConfig.securityContextConfig` as `restricted`
Additional info:
Description of problem:
The archive tar file size should respect the archiveSize setting when mirror with V2 format
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202403070215.p0.gc4f8295.assembly.stream.el9-c4f8295", GitCommit:"c4f829512107f7d0f52a057cd429de2030b9b3b3", GitTreeState:"clean", BuildDate:"2024-03-07T03:46:24Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
always
Steps to Reproduce:
1) With following imagesetconfigure : cat config.yaml kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 archiveSize: 8 storageConfig: local: path: /app1/ocmirror/offline mirror: platform: channels: - name: stable-4.12 type: ocp minVersion: '4.12.46' maxVersion: '4.12.46' shortestPath: true graph: true operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 packages: - name: advanced-cluster-management channels: - name: release-2.9 - name: compliance-operator channels: - name: stable - name: multicluster-engine channels: - name: stable-2.4 - name: stable-2.5 additionalImages: - name: registry.redhat.io/ubi8/ubi:latest - name: registry.redhat.io/rhel8/support-tools:latest - name: registry.access.redhat.com/ubi8/nginx-120:latest - name: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.8.0 - name: registry.k8s.io/sig-storage/csi-resizer:v1.8.0 - name: quay.io/openshifttest/hello-openshift@sha256:4200f438cf2e9446f6bcff9d67ceea1f69ed07a2f83363b7fb52529f7ddd8a83 - name: quay.io/openshifttest/hello-openshift@sha256:61b8f5e1a3b5dbd9e2c35fd448dc5106337d7a299873dd3a6f0cd8d4891ecc27 2) Run `oc-mirror --config config.yaml file://out --v2`
Actual results:
2) The archive size is still 49G , not following the setting in imagesetconfigure. ll out/ -h total 49G -rw-r--r--. 1 root root 49G Mar 20 09:03 mirror_000001.tar drwxr-xr-x. 11 root root 4.0K Mar 20 08:54 working-dir
Expected results:
multiple tar files with size greater or equal to 8G should be generated
This is a clone of issue OCPBUGS-37345. The following is the description of the original issue:
—
OTA-941 landed a rollback guard in 4.14 that blocked all rollbacks. OCPBUGS-24535 drilled a hole in that guard to allow limited rollbacks to the previous release the cluster had been aiming at, as long as that previous release was part of the same 4.y z stream. We decided to block that hole back up in OCPBUGS-35994. And now folks want the hole re-opened in this bug. We also want to bring back the oc adm upgrade rollback ... subcommand. Hopefully this new plan sticks
Folks want the guard-hole and rollback subcommand restored for 4.16 and 4.17.
Every time.
Try to perform the rollbacks that OCPBUGS-24535 allowed.
They stop working, with reasonable ClusterVersion conditions explaining that even those rollback requests will not be accepted.
They work, as verified in OCPBUGS-24535.
This is a clone of issue OCPBUGS-41328. The following is the description of the original issue:
—
Description of problem:
Rotating the root certificates (root CA) requires multiple certificates during the rotation process to prevent downtime as the server and client certificates are updated in the control and data planes. Currently, the HostedClusterConfigOperator uses the cluster-signer-ca from the control plane to create a kublet-serving-ca on the data plane. The cluster-signer-ca contains only a single certificate that is used for signing certificates for the kube-controller-manager. During a rotation, the kublet-serving-ca will be updated with the new CA which triggers the metrics-server pod to restart and use the new CA. This will lead to an error in the metrics-server where it cannot scrape metrics as the kublet has yet to pickup the new certificate. E0808 16:57:09.829746 1 scraper.go:149] "Failed to scrape node" err="Get \"https://10.240.0.29:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate signed by unknown authority" node="pres-cqogb7a10b7up68kvlvg-rkcpsms0805-default-00000130" rkc@rmac ~> kubectl get pods -n openshift-monitoring NAME READY STATUS RESTARTS AGE metrics-server-594cd99645-g8bj7 0/1 Running 0 2d20h metrics-server-594cd99645-jmjhj 1/1 Running 0 46h The HostedClusterConfigOperator should likely be using the KubeletClientCABundle from the control plane for the kublet-serving-ca in the data plane. This CA bundle will contain both the new and old CA such that all data plane components can remain up during the rotation process.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-20061. The following is the description of the original issue:
—
Possibly reviving OCPBUGS-10771, the control-plane-machine-set ClusterOperator occasionally goes Available=False with reason=UnavailableReplicas. For example, this run includes:
: [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available expand_less 1h34m30s { 3 unexpected clusteroperator state transitions during e2e test run. These did not match any known exceptions, so they cause this test-case to fail: Oct 03 22:03:29.822 - 106s E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s) Oct 03 22:08:34.162 - 98s E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s) Oct 03 22:13:01.645 - 118s E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s)
But those are the nodes rebooting into newer RHCOS, and do not warrant immediate admin intervention. Teaching the CPMS operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.
4.15. Possibly all supported versions of the CPMS operator have this exposure.
Looks like many (all?) 4.15 update jobs have near 100% reproducibility for some kind of issue with CPMS going Available=False, see Actual results below. These are likely for reasons that do not require admin intervention, although figuring that out is tricky today, feel free to push back if you feel that some of these do warrant admin immediate admin intervention.
w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/control-plane-machine-set+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort
periodic-ci-openshift-cluster-etcd-operator-release-4.15-periodics-e2e-aws-etcd-recovery (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 19 runs, 42% failed, 225% of failures match = 95% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 18 runs, 61% failed, 127% of failures match = 78% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-e2e-aws-sdn-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 19 runs, 47% failed, 200% of failures match = 95% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-sdn-arm64 (all) - 9 runs, 78% failed, 114% of failures match = 89% impact periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade (all) - 11 runs, 64% failed, 143% of failures match = 91% impact periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 70 runs, 41% failed, 207% of failures match = 86% impact periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-upgrade (all) - 7 runs, 43% failed, 200% of failures match = 86% impact periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn (all) - 6 runs, 50% failed, 33% of failures match = 17% impact periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 71 runs, 24% failed, 382% of failures match = 92% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 70 runs, 30% failed, 281% of failures match = 84% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 8 runs, 50% failed, 175% of failures match = 88% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 71 runs, 38% failed, 233% of failures match = 89% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 69 runs, 49% failed, 171% of failures match = 84% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-upgrade (all) - 7 runs, 57% failed, 175% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-sdn-upgrade (all) - 6 runs, 33% failed, 250% of failures match = 83% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 63 runs, 37% failed, 222% of failures match = 81% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-upgrade (all) - 6 runs, 33% failed, 250% of failures match = 83% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 7 runs, 43% failed, 233% of failures match = 100% impact periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 13 runs, 54% failed, 100% of failures match = 54% impact periodic-ci-openshift-release-master-okd-scos-4.15-e2e-aws-ovn-upgrade (all) - 16 runs, 63% failed, 90% of failures match = 56% impact
CPMS goes Available=False if and only if immediate admin intervention is appropriate.
Tracker issue for bootimage bump in 4.16. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-29384.
Description of problem:
PublicAndPrivate and Private clusters fail to provision due to missing IngressController RBAC in control plane operator. This RBAC was recently removed from the HyperShift operator.
Version-Release number of selected component (if applicable):
4.14.z
How reproducible:
Always
Steps to Reproduce:
1. Install hypershift operator from main 2. Create an AWS PublicAndPrivate cluster using a 4.14.z release
Actual results:
The cluster never provisions because the cpo is stuck
Expected results:
The cluster provisions successfully
Additional info:
With the change from PatternFly's `PageHeader` to `Masthead`, there is no longer a max-height of 60px restricting the size of the masthead logo. As a result, logos that are larger than 60px high display at their native size and cause the masthead to get taller (see https://drive.google.com/file/d/11enMtMU1cfzXQqRfd0eTdsKFkBVPWoFc/view?usp=sharing). This went unnoticed in the change because OpenShift and OKD logos are sized appropriately for the masthead and do not need the restriction. Further, the docs state a custom logo "is constrained to a max-width of 200px and a max-height` of 68px.", which is a separate bug that needs to be addressed (should read "is constrained to a max-height of 60px").
Description of problem:
On February 27th endpoints were turned off that were being queried for account details. The check is not vital so we are fine with removing it, however it is currently blocking all Power VS installs.
Version-Release number of selected component (if applicable):
4.13.0 - 4.16.0
How reproducible:
Easily
Steps to Reproduce:
1. Try to deploy with Power VS 2. Fail at the platform credentials check
Actual results:
Check fails
Expected results:
Check should succeed
Additional info:
This is a clone of issue OCPBUGS-35347. The following is the description of the original issue:
—
Description of problem:
OCP/RHCOS system daemon(s) like ovs-vswitchd (revalidator process) use the same vCPU (from isolated vCPU pool) that is already reserved by CPU Manager for CNF workloads, causing intermittent issues for CNF workloads performance (and also causing vCPU level overload). Note: NCP 23.11 uses CPU Manager with static policy and Topology Manager set to "single-numa-node". Also, specific isolated and reserved vCPU pools have been defined.
Version-Release number of selected component (if applicable):
4.14.22
How reproducible:
Intermittent at customer environment.
Steps to Reproduce:
1. 2. 3.
Actual results:
ovs-vswitchd is using isolated CPUs
Expected results:
ovs-vswitchd to use only reserved CPUs
Additional info:
We want to understand if customer is hitting the bug: https://issues.redhat.com/browse/OCPBUGS-32407 This bug was fixed at 4.14.25. Customer cluster is 4.14.22. Customer is also asking if it is possible to get a private fix since they cannot update at the moment. All case files have been yanked at both US and EU instances of Supportshell. In case case updates or attachments are not accessible please let me know.
Description of problem:
sha256 sum for "https://mirror.openshift.com/pub/openshift-v4/clients/ocp/4.14.9/openshift-install-mac-arm64-4.14.9.tar.gz" does not match what it should be # sha256sum openshift-install-mac-arm64-4.14.9.tar.gz 61cccc282f39456b7db730a0625d0a04cd6c1c2ac0f945c4c15724e4e522a073 openshift-install-mac-arm64-4.14.9.tar.gz Which does not match what is posted here: https://mirror.openshift.com/pub/openshift-v4/clients/ocp/4.14.9/sha256sum.txt It should be : c765c90a32b8a43bc62f2ba8bd59dc8e620b972bcc2a2e217c36ce139d517e29 openshift-install-mac-arm64-4.14.9.tar.gz
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This comes from this bug https://issues.redhat.com/browse/OCPBUGS-29940
After applying the workaround suggested [1][2] with "oc adm must-gather --node-name" we found another issue where must-gather creates the debug pod on all master nodes and gets stuck for a while because the script gather_network_logs_basics loop. Filtering out the NotReady nodes would allow us to apply the workaround.
The script gather_network_logs_basics gets the master nodes by label (node-role.kubernetes.io/master) and saves them in the CLUSTER_NODES variable. It then passes this as a parameter to the function gather_multus_logs $CLUSTER_NODES, where it loops through the list of master nodes and performs debugging for each node.
collection-scripts/gather_network_logs_basics
...
CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -oname)}"
/usr/bin/gather_multus_logs $CLUSTER_NODES
...
collection-scripts/gather_multus_logs ... function gather_multus_logs { for NODE in "$@"; do nodefilename=$(echo "$NODE" | sed -e 's|node/||') out=$(oc debug "${NODE}" -- \ /bin/bash -c "cat $INPUT_LOG_PATH" 2>/dev/null) && echo "$out" 1> "${OUTPUT_LOG_PATH}/multus-log-$nodefilename.log" done }
This could be resolved with something similar to this:
CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -o json | jq -r '.items[] | select(.status.conditions[] | select(.type=="Ready" and .status=="True")).metadata.name')}"
/usr/bin/gather_multus_logs $CLUSTER_NODES
[1] - https://access.redhat.com/solutions/6962230
[2] - https://issues.redhat.com/browse/OCPBUGS-29940
In looking at a component readiness test page we see some failures that take a long time to load:
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-ovn-dualstack/1758641985364692992 (I noticed that this one resulted in messages asking me to restart chrome)
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-ovn-dualstack/1767279909555671040
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-ovn-dualstack/1766663255406678016
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-ovn-dualstack/1765279833048223744
We'd like to understand why it takes a long time to load these jobs and possible take some action to remediate as much of that slowness as possible.
Taking a long time to load prow jobs will make our TRT tools seem unusable and might make it difficult for managers to inspect Component Readiness failures which would slow down getting them resolved.
Some idea of what to look at:
Enable websockets (https://github.com/kubernetes/enhancements/issues/4006) in 4.16 so https://issues.redhat.com/browse/OCPBUGS-20515 can test whether websockets allows to timeout idle connection.
Description of problem:
The ingress cluster capability has been introduced in OCP 4.16 (https://github.com/openshift/enhancements/pull/1415). It includes the cluster ingress operator and all its controllers. If the ingress capability is disabled all the routes of the cluster become unavailable (no router to back them up). The console operator heavily depends on the working (admitted/active) routes to do the health checks, configure the authentication flows, client downloads, etc. The console operator goes degraded if the routes are not served by a router. The console operator needs to be able to tolerate the absence of the ingress capability.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Create ROSA HCP cluster. 2. Scale the default ingresscontroller to 0: oc -n openshift-ingress-operator patch ingresscontroller default --type='json' -p='[{"op": "replace", "path": "/spec/replicas", "value":0}]' 3. Check the status of console cluster operator: oc get co console
Actual results:
$ oc get co console NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE console 4.16.0 False False False 53s RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.49e4812b7122bc833b72.hypershift.aws-2.ci.openshift.org): Get "https://console-openshift-console.apps.49e4812b7122bc833b72.hypershift.aws-2.ci.openshift.org": EOF
Expected results:
$ oc get co console NAME VERSION AVAILABLE PROGRESSING DEGRADED console 4.16.0 True False False
Additional info:
The ingress capability cannot be disabled on a standalone OpenShift (when the payload is managed by ClusterVersionOperator). Only clusters managed HyperShift with HostedControlPlane are impacted.
Description of problem:
When the cluster is in upgrade, scroll sidebar to bottom on cluster settings page, there is blank space at the bottom.
Version-Release number of selected component (if applicable):
upgrade 4.14.0-0.nightly-2023-09-09-164123 to 4.14.0-0.nightly-2023-09-10-184037
How reproducible:
Always
Steps to Reproduce:
1.Launch a 4.14 cluster, trigger an upgrade. 2.Go to "Cluster Settings"->"Details" page during upgrade, scroll down the right sidebar to the bottom. 3.
Actual results:
2. It's blank at the bottom
Expected results:
2. Should not show blank.
Additional info:
screenshot: https://drive.google.com/drive/folders/1DenrQTX7K0chbs9hG9ZbSZyY-viRRy1k?ths=true
https://drive.google.com/drive/folders/10dgToTxZf7gOfmL2Mp5gAMVQnM06XvAf?usp=sharing
Please review the following PR: https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/308
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/270
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
There is no kubernetes service associated with the kube-scheduler, so it does not require a readiness probe.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
# In the control plane: kubectl get services | grep scheduler kubectl get deploy kube-scheduler | grep readiness
Actual results:
Probe exists, but no service
Expected results:
No probe or service
Additional info:
Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/1979
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/99
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/136
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-35300. The following is the description of the original issue:
—
Description of problem:
ARO cluster fails to install with disconnected networking. We see master nodes bootup hang on the service machine-config-daemon-pull.service. Logs from the service indicate it cannot reach the public IP of the image registry. In ARO, image registries need to go via a proxy. Dnsmasq is used to inject proxy DNS answers, but machine-config-daemon-pull is starting before ARO's dnsmasq.service starts.
Version-Release number of selected component (if applicable):
4.14.16
How reproducible:
Always
Steps to Reproduce:
For Fresh Install: 1. Create the required ARO vnet and subnets 2. Attach a route table to the subnets with a blackhole route 0.0.0.0/0 3. Create 4.14 ARO cluster with --apiserver-visibility=Private --ingress-visibility=Private --outbound-type=UserDefinedRouting [OR] Post Upgrade to 4.14: 1. Create a ARO 4.13 UDR. 2. ClusterUpgrade the cluster 4.13-> 4.14 , upgrade was successful 3. Create a new node (scale up), we run into the same issue.
Actual results:
For Fresh Install of 4.14: ERROR: (InternalServerError) Deployment failed. [OR] Post Upgrade to 4.14: Node doesn't come into a Ready State and Machine is stuck in Provisioned status.
Expected results:
Succeeded
Additional info:
We see in the node logs that machine-config-daemon-pull.service is unable to reach the image registry. ARO's dnsmasq was not yet started.
Previously, systemd ordering was set for ovs-configuration.service to start after (ARO's) dnsmasq.service. Perhaps that should have gone on machine-config-daemon-pull.service.
See https://issues.redhat.com/browse/OCPBUGS-25406.
Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/202
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/image-customization-controller/pull/122
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
So that the OKD build has the correct config by default.
Looks like we're accidentally passing the JavaScript `window.alert()` method instead of the Prometheus alert object.
Rukpak – an alpha tech preview API – has pushed a breaking change upstream. This bug tracks the need for us to disable and then reenable the cluster-olm-operator and platform-operators components which both depend on rukpak in order to push the breaking API change. This bug can be closed once those components are all updated and available on the cluster again.
Images to include in the payload:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
This is a clone of issue OCPBUGS-39467. The following is the description of the original issue:
—
Description of problem:
Hi team, The customer is performing RHOCP IPI testing for H/W certification and is referencing this document: https://docs.openshift.com/container-platform/4.15/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#bmc-addressing_ipi-install-installation-workflow The issue occurred during the IPI installation. The redfish ISO mount point shows it only supports HTTP, as per the error message below. However, the official document mentioned that both HTTP and HTTPS are supported parameter types for TransferProtocolTypes. ~~~ VirtualMedia.InsertMedia BEF (/redfish/v1/Managers/Self/VirtualMedia/CD1/Actions/VirtualMedia.InsertMedia) =============================================== {"error":"@Message.ExtendedInfo": "@odata.type":"#Message.v1_0_8.Message", "Message":"The value 'HTTP' for the property TransferProtocolType is not in the list of acceptable values.", "MessageArgs": ["HTTP", "TransferProtocolType"], "MessageId":"Base.1.12.PropertyValueNotInList", "RelatedProperties":["/TransferProtocolType"], "Resolution": "Choose a value from the enumeration list that the implementation can support and resubmit the request if the operation failed.", "Severity": "Warning"},"code":"Base.1.12.PropertyValueNotInList", "message":"The value 'HTTP' for the property TransferProtocolType is not in the list of acceptable values."} ~~~ Could you please confirm if we currently support HTTPS? ***Business impact: We have business visibility on current telco project, the customer needs passed the IPI testing for H/W certification. The problem is the customer's BMC currently only supports HTTPS mounting per the AMI code-base requirement. ACM/ZTP based installation on the fleet of 1000s of these servers. Support for https will be great. Please help to check HTTPS support plan. Any recommendation would be appreciated!
Version-Release number of selected component (if applicable):
How reproducible:
Follow the document steps. https://docs.openshift.com/container-platform/4.15/installing/installing_bare_metal_ipi/ipi-install-overview.html
Steps to Reproduce:
The installation steps we follow are baed on Overview - Deploying installer-provisioned clusters on bare metal | Installing | OpenShift Container Platform 4.14 The DNS and DHCP are setup for provisioning but not disconnected registry. We failed at the following command:./openshift-baremetal-install --dir ~/clusterconfigs --log-level debug create cluster The console log : ~~~ ERROR Error: could not inspect: inspect failed , last error was 'Failed to inspect hardware. Reason: unable to start inspection: ('All virtual media mount attempts failed. Most recent error: ', ('Inserting virtual media into %(boot_device)s failed for node %(node)s, moving to next virtual media device, if available', {'node': '861f2cf6-3638-43c3-aa51-f1a2dee43c93', 'boot_device': <VirtualMediaType.CD: 'CD'>}))' ERROR ERROR with ironic_node_v1.openshift-master-host[0], ERROR on main.tf line 13, in resource "ironic_node_v1" "openshift-master-host": ERROR 13: resource "ironic_node_v1" "openshift-master-host" { ERROR ERROR failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "masters" stage: failed to create cluster: failed to apply Terraform: exit status 1 ~~~ Part of the ironic.service log in bootstrap node: Jun 28 03:33:44 localhost.localdomain ironic[7091]: 2024-06-28 03:33:44.019 1 DEBUG sushy.exceptions [None req-922eeb65-bb47-44e5-9aed-0a86779b77b6 - - - - - -] HTTP response for POST https://10.102.13.230:443/redfish/v1/Managers/Self/VirtualMedia/CD1/Actions/VirtualMedia.InsertMedia: status code: 400, error: Base.1.5.PropertyValueNotInList: The value HTTP for the property TransferProtocolType is not in the list of acceptable values., extended: [{'@odata.type': '#Message.v1_0_8.Message', 'Message': 'The value HTTP for the property TransferProtocolType is not in the list of acceptable values.', 'MessageArgs': ['HTTP', 'TransferProtocolType'], 'MessageId': 'Base.1.5.PropertyValueNotInList', 'RelatedProperties': ['#/TransferProtocolType'], 'Resolution': 'Choose a value from the enumeration list that the implementation can support and resubmit the request if the operation failed.', 'Severity': 'Warning'}] __init__ /usr/lib/python3.9/site-packages/sushy/exceptions.py:122 Jun 28 03:33:44 localhost.localdomain ironic[7091]: 2024-06-28 03:33:44.020 1 WARNING ironic.drivers.modules.redfish.boot [None req-922eeb65-bb47-44e5-9aed-0a86779b77b6 - - - - - -] ('Inserting virtual media into %(boot_device)s failed for node %(node)s, moving to next virtual media device, if available', {'node': '861f2cf6-3638-43c3-aa51-f1a2dee43c93', 'boot_device': <VirtualMediaType.CD: 'CD'>}): sushy.exceptions.BadRequestError: HTTP POST https://10.102.13.230:443/redfish/v1/Managers/Self/VirtualMedia/CD1/Actions/VirtualMedia.InsertMedia returned code 400. Base.1.5.PropertyValueNotInList: The value HTTP for the property TransferProtocolType is not in the list of acceptable values. Extended information: [{'@odata.type': '#Message.v1_0_8.Message', 'Message': 'The value HTTP for the property TransferProtocolType is not in the list of acceptable values.', 'MessageArgs': ['HTTP', 'TransferProtocolType'], 'MessageId': 'Base.1.5.PropertyValueNotInList', 'RelatedProperties': ['#/TransferProtocolType'], 'Resolution': 'Choose a value from the enumeration list that the implementation can support and resubmit the request if the operation failed.', 'Severity': 'Warning'}] Jun 28 03:33:44 localhost.localdomain ironic[7091]: 2024-06-28 03:33:44.024 1 ERROR ironic.drivers.modules.inspector.interface [None req-922eeb65-bb47-44e5-9aed-0a86779b77b6 - - - - - -] Unable to start managed inspection for node 861f2cf6-3638-43c3-aa51-f1a2dee43c93: ('All virtual media mount attempts failed. Most recent error: ', ('Inserting virtual media into %(boot_device)s failed for node %(node)s, moving to next virtual media device, if available', {'node': '861f2cf6-3638-43c3-aa51-f1a2dee43c93', 'boot_device': <VirtualMediaType.CD: 'CD'>})): ironic.common.exception.InvalidParameterValue: ('All virtual media mount attempts failed. Most recent error: ', ('Inserting virtual media into %(boot_device)s failed for node %(node)s, moving to next virtual media device, if available', {'node': '861f2cf6-3638-43c3-aa51-f1a2dee43c93', 'boot_device': <VirtualMediaType.CD: 'CD'>}))
Actual results:
HTTPS is not supported.
Expected results:
Per the doc mentioned, HTTPS should be supported.
Additional info:
Also raised the bug ticket for document check: https://issues.redhat.com/browse/OCPBUGS-36280
Use the new CRD when available, as the current one is being deprecated
Description of the problem:
BMH and Machine resources not created for ZTP day-2 control-plane nodes
How reproducible:
100%
Steps to reproduce:
1. Use ZTP to add control-plane nodes to an existing baremetal spoke cluster that was installed using ZTP
Actual results:
CSRs are not being approved automatically because Machine and BMH resources are not being created due to this condition which excludes control plane nodes. This condition seems to be old and no longer relevant, as it was written before adding day-2 control plane nodes was supported
Expected results:
Machine and BMH resources are being created and as a result CSRs are being approved automatically
Description of problem:
New deployment of BM IPI using provisioning network with IPV6 is showing: http://XXXX:XXXX:XXXX:XXXX::X:6180/images/ironic-python-agernt.kernel.... connection timed out (http://ipxe.org/4c0a6092)" error
Version-Release number of selected component (if applicable):
Openshift 4.12.32 Also seen in Openshift 4.14.0-rc.5 when adding new nodes
How reproducible:
Very frequent
Steps to Reproduce:
1. Deploy cluster using BM with provided config 2. 3.
Actual results:
Consistent failures depending of the version of OCP used to deploy
Expected results:
No error, successful deployment
Additional info:
Things checked while the bootstrap host is active and the installation information is still valid (and failing): - tried downloading the "ironic-python-agent.kernel" file from different places (bootstrap, bastion hosts, another provisioned host) and in all cases it worked: [core@control-1-ru2 ~]$ curl -6 -v -o ironic-python-agent.kernel http://[XXXX:XXXX:XXXX:XXXX::X]:80/images/ironic-python-agent.kernel \* Trying XXXX:XXXX:XXXX:XXXX::X... \* TCP_NODELAY set % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Connected to XXXX:XXXX:XXXX:XXXX::X (xxxx:xxxx:xxxx:xxxx::x) port 80 #0) > GET /images/ironic-python-agent.kernel HTTP/1.1 > Host: [xxxx:xxxx:xxxx:xxxx::x] > User-Agent: curl/7.61.1 > Accept: */* > < HTTP/1.1 200 OK < Date: Fri, 27 Oct 2023 08:28:09 GMT < Server: Apache < Last-Modified: Thu, 26 Oct 2023 08:42:16 GMT < ETag: "a29d70-6089a8c91c494" < Accept-Ranges: bytes < Content-Length: 10657136 < { [14084 bytes data] 100 10.1M 100 10.1M 0 0 597M 0 --:--:-- --:--:-- --:--:-- 597M \* Connection #0 to host xxxx:xxxx:xxxx:xxxx::x left intact This verifies some of the components like the network setup and the httpd service running on ironic pods. - Also gathered listing of the contents of the ironic pod running in podman, specially in the shared directory. The contents of /shared/html/inspector.ipxe seems correct compared to a working installation, also all files look in place. - Logs from the ironic container shows the errors coming from the node being deployed, we also show here the curl log to compare: xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx - - [27/Oct/2023:08:19:55 +0000] "GET /images/ironic-python-agent.kernel HTTP/1.1" 400 226 "-" "iPXE/1.0.0+ (4bd064de)" xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx - - [27/Oct/2023:08:19:55 +0000] "GET /images/ironic-python-agent.kernel HTTP/1.1" 400 226 "-" "iPXE/1.0.0+ (4bd064de)" xxxx:xxxx:xxxx:xxxx::x - - [27/Oct/2023:08:20:23 +0000] "GET /images/ironic-python-agent.kernel HTTP/1.1" 200 10657136 "-" "curl/7.61.1" cxxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx - - [27/Oct/2023:08:20:23 +0000] "GET /images/ironic-python-agent.kernel HTTP/1.1" 400 226 "-" "iPXE/1.0.0+ (4bd064de)" xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx - - [27/Oct/2023:08:20:23 +0000] "GET /images/ironic-python-agent.kernel HTTP/1.1" 400 226 "-" "iPXE/1.0.0+ (4bd064de)" Seems like an issue with iPXE and IPV6
OCPBUGS-11856 added a patch to change termination log permissions manually. Since 4.15.z this is not longer necessary as its fixed by lumberjack dep bump.
This bug tracks carry revert
A hostedcluster/hostedcontrolplane were stuck uninstalling. Inspecting the CPO logs, it showed that "error": "failed to delete AWS default security group: failed to delete security group sg-04abe599e5567b025: DependencyViolation: resource sg-04abe599e5567b025 has a dependent object\n\tstatus code: 400, request id: f776a43f-8750-4f04-95ce-457659f59095" Unfortunately, I do not have enough access to the AWS account to inspect this security group, though I know it is the default worker security group because it's recorded in the hostedcluster .status.platform.aws.defaultWorkerSecurityGroupID
Version-Release number of selected component (if applicable):
4.14.1
How reproducible:
I haven't tried to reproduce it yet, but can do so and update this ticket when I do. My theory is:
Steps to Reproduce:
1. Create an AWS HostedCluster, wait for it to create/populate defaultWorkerSecurityGroupID 2. Attach the defaultWorkerSecurityGroupID to anything else in the AWS account unrelated to the HCP cluster 3. Attempt to delete the HostedCluster
Actual results:
CPO logs: "error": "failed to delete AWS default security group: failed to delete security group sg-04abe599e5567b025: DependencyViolation: resource sg-04abe599e5567b025 has a dependent object\n\tstatus code: 400, request id: f776a43f-8750-4f04-95ce-457659f59095"
HostedCluster Status Condition - lastTransitionTime: "2023-11-09T22:18:09Z" message: "" observedGeneration: 3 reason: StatusUnknown status: Unknown type: CloudResourcesDestroyed
Expected results:
I would expect that the CloudResourcesDestroyed status condition on the hostedcluster would reflect this security group as holding up the deletion instead of having to parse through logs.
Additional info:
Description of problem:
This is an issue that IBM Cloud found and it likely effects Power VS. See https://issues.redhat.com/browse/OCPBUGS-28870 Install a private cluster, the base domain set in install-config.yaml is same as another existed cis domain name. After destroy the private cluster, the dns resource-records remains.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1.create a DNS service instance, setting its domain to "ibmcloud.qe.devcluster.openshift.com", Note, this domain name is also being used in another existing CIS domain. 2.Install a private ibmcloud cluster, the base domain set in install-config is "ibmcloud.qe.devcluster.openshift.com" 3.Destroy the cluster 4.Check the remains dns records
Actual results:
$ ibmcloud dns resource-records 5f8a0c4d-46c2-4daa-9157-97cb9ad9033a -i preserved-openshift-qe-private | grep ci-op-17qygd06-23ac4 api-int.ci-op-17qygd06-23ac4.ibmcloud.qe.devcluster.openshift.com *.apps.ci-op-17qygd06-23ac4.ibmcloud.qe.devcluster.openshift.com api.ci-op-17qygd06-23ac4.ibmcloud.qe.devcluster.openshift.com
Expected results:
No more dns records about the cluster
Additional info:
$ ibmcloud dns zones -i preserved-openshift-qe-private | awk '{print $2}' Name private-ibmcloud.qe.devcluster.openshift.com private-ibmcloud-1.qe.devcluster.openshift.com ibmcloud.qe.devcluster.openshift.com $ ibmcloud cis domains Name ibmcloud.qe.devcluster.openshift.com When use private-ibmcloud.qe.devcluster.openshift.com and private-ibmcloud-1.qe.devcluster.openshift.com as domain, no such issue, when use ibmcloud.qe.devcluster.openshift.com as domain the dns records remains.
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/63
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When creating alerting silence from RHOCP UI without specifying "Creator" field, error "createdBy in body is required" even though field "Creator" is not marked as mandatory.
Version-Release number of selected component (if applicable):
4.15.5
How reproducible:
100%
Steps to Reproduce:
1. Login to webconsole (Admin view) 2. Observe > Alerting 3. Select the alert to silence 4. Click Create Silence. 5. in Info section, update the "Comment" field and skip the "Creator" field. Now, click on Create button. 6. It will throw an error "createdBy in body is required".
Actual results:
Able to create alerting silence without specifying "Creator" field.
Expected results:
User should not be able to create silences without specifying "Creator" field as it should be a mandatory.
Additional info:
The steps works well for prior version of RHOCP 4.15 (tested on 4.14)
This is a clone of issue OCPBUGS-35971. The following is the description of the original issue:
—
Description of problem:
Since 4.16.0 pods with memory limits tend to OOM very frequently when writing files larger than memory limit to PVC
Version-Release number of selected component (if applicable):
4.16.0-rc.4
How reproducible:
100% on certain types of storage (AWS FSx, certain LVMS setups, see additional info)
Steps to Reproduce:
1. Create pod/pvc that writes a file larger than the container memory limit (attached example) 2. 3.
Actual results:
OOMKilled
Expected results:
Success
Additional info:
For simplicity, I will focus on BM setup that produces this with LVM storage. This is also reproducible on AWS clusters with NFS backed NetApp ONTAP FSx. Further reduced to exclude the OpenShift layer, LVM on a separate (non root) disk: Prepare disk lvcreate -T vg1/thin-pool-1 -V 10G -n oom-lv mkfs.ext4 /dev/vg1/oom-lv mkdir /mnt/oom-lv mount /dev/vg1/oom-lv /mnt/oom-lv Run container podman run -m 600m --mount type=bind,source=/mnt/oom-lv,target=/disk --rm -it quay.io/centos/centos:stream9 bash [root@2ebe895371d2 /]# curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-x86_64-9-20240527.0.x86_64.qcow2 -o /disk/temp % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 47 1157M 47 550M 0 0 111M 0 0:00:10 0:00:04 0:00:06 111MKilled (Notice the process gets killed, I don't think podman ever whacks the whole container over this though) The same process on the same hardware on a 4.15 node (9.2) does not produce an OOM (vs 4.16 which is RHEL 9.4) For completeness, I will provide some details about the setup behind the LVM pool, though I believe it should not impact the decision about whether this is an issue: sh-5.1# pvdisplay --- Physical volume --- PV Name /dev/sdb VG Name vg1 PV Size 446.62 GiB / not usable 4.00 MiB Allocatable yes PE Size 4.00 MiB Total PE 114335 Free PE 11434 Allocated PE 102901 PV UUID <UUID> Hardware: SSD (INTEL SSDSC2KG480G8R) behind a RAID 0 of a PERC H330 Mini controller At the very least, this seems like a change in behavior but tbh I am leaning towards an outright bug.
It's been independently verified that setting /sys/kernel/mm/lru_gen/enabled = 0 avoids the oomkills. So verifying that nodes get this value applied is the main testing concern at this point, new installs, upgrades, and new nodes scaled after an upgrade.
If we want to go so far as to verify that the oomkills don't happen the kernel QE team have a simplified reproducer here which involves mounting an NFS volume and using podman to create a container with a memory limit and writing data to that NFS volume.
Please review the following PR: https://github.com/openshift/cluster-api-provider-gcp/pull/215
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Adding test case when exceed openshift.io/image-tags will ban to create new image references in the project
Version-Release number of selected component (if applicable):
4.16
pr - https://github.com/openshift/origin/pull/28464
Description of problem:
Similar to https://bugzilla.redhat.com/show_bug.cgi?id=1996624, when the AWS root credential (must possesses the "iam:SimulatePrincipalPolicy" permission) exists on a BM cluster, the CCO Pod crashes when running the secretannotator controller.
Steps to Reproduce:
1. Install a BM cluster fxie-mac:cloud-credential-operator fxie$ oc get infrastructures.config.openshift.io cluster -o yaml apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2024-01-28T19:50:05Z" generation: 1 name: cluster resourceVersion: "510" uid: 45bc2a29-032b-4c74-8967-83c73b0141c4 spec: cloudConfig: name: "" platformSpec: type: None status: apiServerInternalURI: https://api-int.fxie-bm1.qe.devcluster.openshift.com:6443 apiServerURL: https://api.fxie-bm1.qe.devcluster.openshift.com:6443 controlPlaneTopology: SingleReplica cpuPartitioning: None etcdDiscoveryDomain: "" infrastructureName: fxie-bm1-x74wn infrastructureTopology: SingleReplica platform: None platformStatus: type: None 2. Create an AWS user with IAMReadOnlyAccess permissions: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "iam:GenerateCredentialReport", "iam:GenerateServiceLastAccessedDetails", "iam:Get*", "iam:List*", "iam:SimulateCustomPolicy", "iam:SimulatePrincipalPolicy" ], "Resource": "*" } ] } 3. Create AWS root credentials with a set of access keys of the user above 4. Trigger a reconcile of the secretannotator controller, e.g. via editting cloudcredential/cluster
Logs:
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:CreateAccessKey" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:CreateUser" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:DeleteAccessKey" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:DeleteUser" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:DeleteUserPolicy" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:PutUserPolicy" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:TagUser" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Tested creds not able to perform all requested actions" controller=secretannotator
I0129 04:47:27.988535 1 reflector.go:289] Starting reflector *v1.Infrastructure (10h37m20.569091933s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233
I0129 04:47:27.988546 1 reflector.go:325] Listing and watching *v1.Infrastructure from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233
I0129 04:47:27.989503 1 reflector.go:351] Caches populated for *v1.Infrastructure from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1a964a0]
goroutine 341 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:115 +0x1e5
panic({0x3fe72a0?, 0x809b9e0?})
/usr/lib/golang/src/runtime/panic.go:914 +0x21f
github.com/openshift/cloud-credential-operator/pkg/operator/utils/aws.LoadInfrastructureRegion({0x562e1c0?, 0xc002c99a70?}, {0x5639ef0, 0xc0001b6690})
/go/src/github.com/openshift/cloud-credential-operator/pkg/operator/utils/aws/utils.go:72 +0x40
github.com/openshift/cloud-credential-operator/pkg/operator/secretannotator/aws.(*ReconcileCloudCredSecret).validateCloudCredsSecret(0xc0008c2000, 0xc002586000)
/go/src/github.com/openshift/cloud-credential-operator/pkg/operator/secretannotator/aws/reconciler.go:206 +0x1a5
github.com/openshift/cloud-credential-operator/pkg/operator/secretannotator/aws.(*ReconcileCloudCredSecret).Reconcile(0xc0008c2000, {0x30?, 0xc000680c00?}, {0x4f38a3d?, 0x0?}, {0x4f33a20?, 0x416325?})
/go/src/github.com/openshift/cloud-credential-operator/pkg/operator/secretannotator/aws/reconciler.go:166 +0x605
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x561ff20?, {0x561ff20?, 0xc002ff3b00?}, {0x4f38a3d?, 0x3b180c0?}, {0x4f33a20?, 0x55eea08?})
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118 +0xb7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000189360, {0x561ff58, 0xc0007e5040}, {0x4589f00?, 0xc000570b40?})
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314 +0x365
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000189360, {0x561ff58, 0xc0007e5040})
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265 +0x1c9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226 +0x79
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 183
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:222 +0x565
Actual results:
CCO Pod crashes and restarts in a loop: fxie-mac:cloud-credential-operator fxie$ oc get po -n openshift-cloud-credential-operator -w NAME READY STATUS RESTARTS AGE cloud-credential-operator-657bdffdff-9wzrs 2/2 Running 3 (2m35s ago) 8h
This is a clone of issue OCPBUGS-35519. The following is the description of the original issue:
—
Description of problem:
In an attempt to fix https://issues.redhat.com/browse/OCPBUGS-35300, we introduced an Azure-specific dependency on dnsmasq, which introduced a dependency loop. This bug aims to revert that chain.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/222
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-35316. The following is the description of the original issue:
—
Description of problem:
Live migration gets stuck when the ConfigMap MTU is absent. The ConfigMap mtu should be created by the mtu-prober job at the installation time since 4.11. But if the cluster was upgrade from a very early releases, such as 4.4.4, the ConfigMap mtu may be absent.
Version-Release number of selected component (if applicable):
4.16.rc2
How reproducible:
Steps to Reproduce:
1. build a 4.16 cluster with OpenShiftSDN 2. remove the configmap mtu from the namespace cluster-network-operator. 3. start live migration.
Actual results:
Live migration gets stuck with error NetworkTypeMigrationFailed Failed to process SDN live migration (configmaps "mtu" not found)
Expected results:
Live migration finished successfully.
Additional info:
A workaround is to create the configmap mtu manually before starting live migration.
This is a clone of issue OCPBUGS-30986. The following is the description of the original issue:
—
Description of problem:
After we applied the old tlsSecurityProfile to the Hypershift hosted clsuter, the apiserver ran into CrashLoopBackOff failure, this blocked our test.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-03-13-061822 True False 129m Cluster version is 4.16.0-0.nightly-2024-03-13-061822
How reproducible:
always
Steps to Reproduce:
1. Specify KUBECONFIG with kubeconfig of the Hypershift management cluster 2. hostedcluster=$( oc get -n clusters hostedclusters -o json | jq -r .items[].metadata.name) 3. oc patch hostedcluster $hostedcluster -n clusters --type=merge -p '{"spec": {"configuration": {"apiServer": {"tlsSecurityProfile":{"old":{},"type":"Old"}}}}}' hostedcluster.hypershift.openshift.io/hypershift-ci-270930 patched 4. Checked the tlsSecurityProfile, $ oc get HostedCluster $hostedcluster -n clusters -ojson | jq .spec.configuration.apiServer { "audit": { "profile": "Default" }, "tlsSecurityProfile": { "old": {}, "type": "Old" } }
Actual results:
One of the kube-apiserver of Hosted cluster ran into CrashLoopBackOff, stuck in this status, unable to complete the old tlsSecurityProfile configuration. $ oc get pods -l app=kube-apiserver -n clusters-${hostedcluster} NAME READY STATUS RESTARTS AGE kube-apiserver-5b6fc94b64-c575p 5/5 Running 0 70m kube-apiserver-5b6fc94b64-tvwtl 5/5 Running 0 70m kube-apiserver-84c7c8dd9d-pnvvk 4/5 CrashLoopBackOff 6 (20s ago) 7m38s
Expected results:
Applying the old tlsSecurityProfile should be successful.
Additional info:
This also can be reproduced on 4.14, 4.15. We have the last passed log of the test case as below: passed API_Server 2024-02-19 13:34:25(UTC) aws 4.14.0-0.nightly-2024-02-18-123855 hypershift passed API_Server 2024-02-08 02:24:15(UTC) aws 4.15.0-0.nightly-2024-02-07-062935 hypershift passed API_Server 2024-02-17 08:33:37(UTC) aws 4.16.0-0.nightly-2024-02-08-073857 hypershift From the history of the test, it seems that some code changes were introduced in February that caused the bug.
Description of problem:
Based on Azure doc [1], NCv2 series Azure virtual machines (VMs) are retired on September 6, 2023. VM could not be provisioned on those instance types. So remove standardNCSv2Family from azure doc tested_instance_types_x86_64 on 4.13+. [1] https://learn.microsoft.com/en-us/azure/virtual-machines/ncv2-series
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. cluster is installed failed on NCv2 series instance type 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The installer - in some cases - will not report an error when the API is failed to be detected. On Azure, the bootstrap node is under provisioned for IOPS. The detection logic with check_url is checking against an endpoint that 403s on HEAD requests.
Version-Release number of selected component (if applicable):
How reproducible:
All the time
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-41824. The following is the description of the original issue:
—
Description of problem:
The kubeconfigs for the DNS Operator and the Ingress Operator are managed by Hypershift and they should only be managed by the cloud service provider. This can lead to the kubeconfig/certificate being invalid in the cases where the cloud service provider further manages the kubeconfig (for example ca-rotation).
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Documentation for using Red Hat subscriptions in builds is missing a few important steps, especially for customers that have not turned on the tech preview feature for the Shared Resource CSI driver. These are the following: 1. Customer needs Simple Content Access import enabled in the Insights Operator: https://docs.openshift.com/container-platform/4.12/support/remote_health_monitoring/insights-operator-simple-access.html 2. Customer needs to copy the secret data from openshift-config-managed/etc-pki-entitlement to the workspace the build is running in. We should provide oc commands that a cluster admin/platform team can execute. For builds that are running in a network-restricted environment and access RHEL content through Satellite, the documentation must also provide instructions on how to obtain an `rhsm.conf` file for the Satellite instance and mount it into the build container.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
Read the documentation for https://docs.openshift.com/container-platform/4.12/cicd/builds/running-entitled-builds.html#builds-create-imagestreamtag_running-entitled-builds and execute the commands as is.
Actual results:
Build probably won't run because the required secret is not created.
Expected results:
Customers should be able to run a build that requires RHEL entitlements following the exact steps as described in the doc.
Additional info:
https://docs.openshift.com/container-platform/4.12/cicd/builds/running-entitled-builds.html
This is a clone of issue OCPBUGS-35483. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-13-084629
How reproducible:
100%
Steps to Reproduce:
1.apply configmap ***** apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | prometheusK8s: remoteWrite: - url: "http://invalid-remote-storage.example.com:9090/api/v1/write" queue_config: max_retries: 1 ***** 2. check logs % oc logs -c prometheus prometheus-k8s-0 -n openshift-monitoring ... ts=2024-06-14T01:28:01.804Z caller=dedupe.go:112 component=remote level=warn remote_name=5ca657 url=http://invalid-remote-storage.example.com:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://invalid-remote-storage.example.com:9090/api/v1/write\": dial tcp: lookup invalid-remote-storage.example.com on 172.30.0.10:53: no such host" 3.query after 15mins % oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=ALERTS{alertname="PrometheusRemoteStorageFailures"}' | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 145 100 78 100 67 928 797 --:--:-- --:--:-- --:--:-- 1726 { "status": "success", "data": { "resultType": "vector", "result": [], "analysis": {} } } % oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=prometheus_remote_storage_failures_total' | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 124 100 78 100 46 1040 613 --:--:-- --:--:-- --:--:-- 1653 { "status": "success", "data": { "resultType": "vector", "result": [], "analysis": {} } }
Actual results:
alert did not triggeted
Expected results:
alert triggered, able to see the alert and metrics
Additional info:
below metrics show as `No datapoints found.` prometheus_remote_storage_failures_total prometheus_remote_storage_samples_dropped_total prometheus_remote_storage_retries_total
`prometheus_remote_storage_samples_failed_total` value is 0
Description of problem:
For special operators, there are warning info on operator detail modal page and installation page when it's Azure WI/FI cluster. The warning info titles are not consistent on these two pages.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-21-155123
How reproducible:
Always
Steps to Reproduce:
1.Prepare a special operator which would show warning info on Azure WI/FI cluster. 2.Login console of Azure WI/FI cluster, check the warning info title on the operator detail item modal and installation page. 3.
Actual results:
2. On operator detail item modal, the warning title is "Cluster in Azure Workload Identity / Federated Identity Mode", and on installation page, the warning info title is "Cluster in Workload Identity / Federated Identity Mode". The word "Azure" is missed on the second page.
Expected results:
2. The warning title should keep consistent.
Additional info:
screenshot: https://drive.google.com/drive/folders/1alFBEtO1gN4q5_mAtHCNzuLTOe5zXp0K?usp=drive_link
Description of problem:
The node-network-identity deployment should be set to assist in a controlled rollout of the microservice pods. The general goal is to have a microservice pod only report to Kubernetes as being ready when it has completed initialization and is stable enough to complete tasks.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/coredns/pull/107
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
As an ODC helm backend developer I would like to be able to bump version of helm to 3.13 to stay synched up with the version we will ship with OCP 4.15
Normal activity we do every time a new OCP version is release to stay current
NA
NA
Bump version of helm to 3.13 run, build and unit test and make sure everything is working as expected. Last time we had a conflict with DevFile backend.
Might had dependencies with DevFile team to move some dependencies forward
NA
Console Helm dependency is moved to 3.13
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Unknown
Verified
Unsatisfied
Description of problem:
Logs like the following are constantly emitted by the clustersizing controller: {"level":"error","ts":"2024-05-13T11:30:43Z","msg":"Reconciler error","controller":"hostedclustersizing","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedCluster","HostedCluster":{"name":"dry-4","namespace":"ocm-staging-2b611n1002jcb8ikrn3to0vbds64qup9"},"namespace":"ocm-staging-2b611n1002jcb8ikrn3to0vbds64qup9","name":"dry-4","reconcileID":"7e4c2fa1-a2cb-40e7-bed3-38d2d498e59d","error":"could not get hosted cluster ocm-staging-2b611n1002jcb8ikrn3to0vbds64qup9/dry-4: HostedCluster.hypershift.openshift.io \"dry-4\" not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/opt/app-root/src/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/opt/app-root/src/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/opt/app-root/src/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
sometimes
Steps to Reproduce:
1. Set up a management cluster with size tagging 2. Create a hosted cluster 3. Delete the hosted cluster
Actual results:
The hypershift operator continues logging about not being able to find the deleted hosted cluster.
Expected results:
No additional logging happens.
Additional info:
The clusterszing controller returns an error when it can't find a hostedcluster instead of returning nil. This causes that hostedcluster to be requeued indefinitely.
This is a clone of issue OCPBUGS-38457. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-29240. The following is the description of the original issue:
—
Manila drivers and node-registrar should be configured to use healtchecks.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
The upstream project reorganized the config directory and we need to adapt it for downstream. Until then, upstream->downstream syncing is blocked.
Please review the following PR: https://github.com/openshift/machine-api-provider-gcp/pull/75
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
apbexternalroute and egressfirewall status shows empty on hypershift hosted cluster
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-17-173511
How reproducible:
always
Steps to Reproduce:
1. setup hypershift, login hosted cluster % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-128-55.us-east-2.compute.internal Ready worker 125m v1.28.4+7aa0a74 ip-10-0-129-197.us-east-2.compute.internal Ready worker 125m v1.28.4+7aa0a74 ip-10-0-135-106.us-east-2.compute.internal Ready worker 125m v1.28.4+7aa0a74 ip-10-0-140-89.us-east-2.compute.internal Ready worker 125m v1.28.4+7aa0a74 2. create new project test % oc new-project test 3. create apbexternalroute and egressfirewall on hosted cluster apbexternalroute yaml file: --- apiVersion: k8s.ovn.org/v1 kind: AdminPolicyBasedExternalRoute metadata: name: apbex-route-policy spec: from: namespaceSelector: matchLabels: kubernetes.io/metadata.name: test nextHops: static: - ip: "172.18.0.8" - ip: "172.18.0.9" % oc apply -f apbexroute.yaml adminpolicybasedexternalroute.k8s.ovn.org/apbex-route-policy created egressfirewall yaml file: --- apiVersion: k8s.ovn.org/v1 kind: EgressFirewall metadata: name: default spec: egress: - type: Allow to: cidrSelector: 0.0.0.0/0 % oc apply -f egressfw.yaml egressfirewall.k8s.ovn.org/default created 3. oc get apbexternalroute and oc get egressfirewall
Actual results:
The status show empty: % oc get apbexternalroute NAME LAST UPDATE STATUS apbex-route-policy 49s <--- status is empty % oc describe apbexternalroute apbex-route-policy | tail -n 8 Status: Last Transition Time: 2023-12-19T06:54:17Z Messages: ip-10-0-135-106.us-east-2.compute.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 ip-10-0-129-197.us-east-2.compute.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 ip-10-0-128-55.us-east-2.compute.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 ip-10-0-140-89.us-east-2.compute.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 Events: <none> % oc get egressfirewall NAME EGRESSFIREWALL STATUS default <--- status is empty % oc describe egressfirewall default | tail -n 8 Type: Allow Status: Messages: ip-10-0-129-197.us-east-2.compute.internal: EgressFirewall Rules applied ip-10-0-128-55.us-east-2.compute.internal: EgressFirewall Rules applied ip-10-0-140-89.us-east-2.compute.internal: EgressFirewall Rules applied ip-10-0-135-106.us-east-2.compute.internal: EgressFirewall Rules applied Events: <none>
Expected results:
the status can be shown correctly
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Please review the following PR: https://github.com/openshift/images/pull/158
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
In the `DoHTTPProbe` function located at `github.com/openshift/router/pkg/router/metrics/probehttp/probehttp.go`, logging of the HTTP response object at verbosity level 4 results in a serialisation error due to non-serialisable fields within the `http.Response` object. The error logged is `<<error: json: unsupported type: func() (io.ReadCloser, error)>>`, pointing towards an inability to serialise the `Body` field, which is of type `io.ReadCloser`.
This function is designed to check if a GET request to the specified URL succeeds, logging detailed response information at higher verbosity levels for diagnostic purposes.
Steps to Reproduce:
1. Increase the logging level to 4.
2. Perform an operation that triggers the `DoHTTPProbe` function.
3. Review the logging output for the error message.
Expected Behaviour:
The logger should gracefully handle or exclude non-serialisable fields like `Body`, ensuring clean and informative logging output that aids in diagnostics without encountering serialisation errors.
Actual Behaviour:
Non-serialisable fields in the `http.Response` object lead to the error `<<error: json: unsupported type: func() (io.ReadCloser, error)>>` being logged. This diminishes the utility of logs for debugging at higher verbosity levels.
Impact:
The issue is considered of low severity since it only appears at logging level 4, which is beyond standard operational levels (level 2) used in production. Nonetheless, it could hinder effective diagnostics and clutter logs with errors when high verbosity levels are employed for troubleshooting.
Suggested Fix:
Modify the logging functionality within `DoHTTPProbe` to either filter out non-serialisable fields from the `http.Response` object or implement a custom serialisation approach that allows these fields to be logged in a more controlled and error-free manner.
Issue customer is experiencing:
Despite manually removing the alternate service (old) and saving the configuration from the UI, the alternate service did not get removed from the route, and the changes did not take effect.
From the UI, if using the Form view and select Remove Alternate Service, click save, if they refresh the route information it still shows the route configuration with Alternate service defined.
If they use the YAML view, and remove the entry from there and save it's gone properly.
If they use the CLI and edit the route, and remove the alternate service section, it also works properly.
Tests:
I have tested this scenario in my test cluster with OCP v4.13
This fix contains the following changes coming from updated version of kubernetes up to v1.29.7:
Changelog:
v1.29.7: https://github.com/kubernetes/kubernetes/blob/release-1.29/CHANGELOG/CHANGELOG-1.29.md#changelog-since-v1296
Please review the following PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/206
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The e2e-aws-ovn-shared-to-local-gateway-mode-migration and e2e-aws-ovn-local-to-shared-gateway-mode-migration jobs fail about 50% of the time with + oc patch Network.operator.openshift.io cluster --type=merge --patch '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"gatewayConfig":{"routingViaHost":false}}}}}' network.operator.openshift.io/cluster patched + oc wait co network --for=condition=PROGRESSING=True --timeout=60s error: timed out waiting for the condition on clusteroperators/network
This is a clone of issue OCPBUGS-38616. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-38599. The following is the description of the original issue:
—
Description of problem:
If folder is undefined and the datacenter exists in a datacenter-based folder
the installer will create the entire path of folders from the root of vcenter - which is incorrect
This does not occur if folder is defined.
An upstream bug was identified when debugging this:
Description of problem:
During a pod deletion, the whereabouts reconciler correctly detects the pod deletion but it errors out claiming that the IPPool is not found.However, when checking the audit logs, we can see no deletion, no re-creation and we can even see successful "patch" and "get" requests to the same IPPool. This means that the IPPool was never deleted and properly accessible at the time of the issue, so the error in the reconciler looks like it made some mistake while retrieving the IPPool.
Version-Release number of selected component (if applicable):
4.12.22
How reproducible:
Sometimes
Steps to Reproduce:
1.Delete pod 2. 3.
Actual results:
Error in whereabouts reconciler. New pods cannot using additional networks with whereabouts IPAM plugin cannot have IPs allocated due to wrong cleanup.
Expected results:
Additional info:
Description of problem:
oc-mirror with v2 will create the idms file as output , but the source is like : apiVersion: config.openshift.io/v1 kind: ImageDigestMirrorSet metadata: creationTimestamp: null name: idms-2024-01-08t04-19-04z spec: imageDigestMirrors: - mirrors: - ec2-3-144-29-184.us-east-2.compute.amazonaws.com:5000/ocp2/openshift source: localhost:55000/openshift - mirrors: - ec2-3-144-29-184.us-east-2.compute.amazonaws.com:5000/ocp2/openshift-release-dev source: quay.io/openshift-release-dev status: {} The source should always be the origin registry like :quay.io/openshift-release-dev
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. run the command with v2 : apiVersion: mirror.openshift.io/v1alpha2 kind: ImageSetConfiguration mirror: platform: channels: - name: stable-4.14 minVersion: 4.14.3 maxVersion: 4.14.3 graph: true `oc-mirror --config config.yaml file://out --v2` `oc-mirror --config config.yaml --from file://out --v2 docker://xxxx:5000/ocp2` 2. check the idms file
Actual results:
2. cat idms-2024-01-08t04-19-04z.yaml apiVersion: config.openshift.io/v1 kind: ImageDigestMirrorSet metadata: creationTimestamp: null name: idms-2024-01-08t04-19-04z spec: imageDigestMirrors: - mirrors: - xxxx.com:5000/ocp2/openshift source: localhost:55000/openshift - mirrors: - xxxx.com:5000/ocp2/openshift-release-dev source: quay.io/openshift-release-dev
Expected results:
The source should not be localhost:55000, should be like the origin registry.
Additional info:
hypershift is not creating this alert in HostedClusters
https://github.com/openshift/cluster-kube-apiserver-operator/blob/master/bindata/assets/alerts/podsecurity-violations.yaml
In standalone OCP, it is done by the KASO.
GDescription of problem:
NodeLogQuery e2e tests are failing with Kubernetes 1.28 bump. Example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/1646/pull-ci-openshift-kubernetes-master-k8s-e2e-gcp-ovn/1683472309211369472
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
There is no response when user clicks on quickstart items.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-02-26-013420 browser 122.0.6261.69 (Official Build) (64-bit)
How reproducible:
always
Steps to Reproduce:
1.Go to quick starts page by clicking "View all quick starts" on Home -> Overview page. 2. Click on any quickstart item to check its steps. 3.
Actual results:
2. There is no response.
Expected results:
2. Should open quickstart sidepage for installation instructions.
Additional info:
The issue doesn't exist on firefox 123.0 (64-bit)
This is a follow up for https://issues.redhat.com/browse/OCPBUGS-14829 and https://issues.redhat.com/browse/OCPBUGS-21821, let's add an alert if vSphere users using usernames without domain, it makes storage doesn't work and should be alerted.
Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver-operator/pull/53
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
OCP 4.15 nightly deployment on a Bare-metal servers without using the provisioning network is stuck during deployment. Job history: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-telco5g Deployment stuck similiar to this: Upstream job logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-telco5g/1732520780954079232/artifacts/e2e-telco5g/telco5g-cluster-setup/artifacts/cloud-init-output.log ~~~ level=debug msg=ironic_node_v1.openshift-master-host[2]: Creating...level=debug msg=ironic_node_v1.openshift-master-host[0]: Creating...level=debug msg=ironic_node_v1.openshift-master-host[1]: Creating...level=debug msg=ironic_node_v1.openshift-master-host[0]: Still creating... [10s elapsed]..level=debug msg=ironic_node_v1.openshift-master-host[0]: Still creating... [2h28m51s elapsed]level=debug msg=ironic_node_v1.openshift-master-host[1]: Still creating... [2h28m51s elapsed] ~~~ Ironic logs from bootstrap node: ~~~ Dec 07 13:10:13 localhost.localdomain start-provisioning-nic.sh[3942]: Error: failed to modify ipv4.addresses: invalid IP address: Invalid IPv4 address ''. Dec 07 13:10:13 localhost.localdomain systemd[1]: provisioning-interface.service: Main process exited, code=exited, status=2/INVALIDARGUMENT Dec 07 13:10:13 localhost.localdomain systemd[1]: provisioning-interface.service: Failed with result 'exit-code'. Dec 07 13:10:13 localhost.localdomain systemd[1]: Failed to start Provisioning interface. Dec 07 13:10:13 localhost.localdomain systemd[1]: Dependency failed for DHCP Service for Provisioning Network. Dec 07 13:10:13 localhost.localdomain systemd[1]: ironic-dnsmasq.service: Job ironic-dnsmasq.service/start failed with result 'dependency' ~~~
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Everytime
Steps to Reproduce:
1.Deploy OCP More information about our setup: In our environment, We have 3 virtual master node, 1 virtual worker and 1 baremetal worker. We use KCLI tool for creation of the virtual environment and for running the deployment workflow using IPI, In our setup we don't use provisioning network. (Same setup is used for other OCP version till 4.14 and are working fine.) We have attached our install-config.yaml (for RH employees) and logs from bootstrap node.
Actual results:
Deployment is failing Dec 07 13:10:13 localhost.localdomain start-provisioning-nic.sh[3942]: Error: failed to modify ipv4.addresses: invalid IP address: Invalid IPv4 address ''.
Expected results:
Deployment should pass
Additional info:
Description of problem:
In https://github.com/openshift/installer/pull/8248, the bootstrap node metadata was overridden and the proxy information was no longer used.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-35416. The following is the description of the original issue:
—
Description of problem:
The presubmit test that expects an inactive CPMS to be regnerated, resets the state at the end of the test. In doing so, it causes the CPMS generator to re-generate back to the original state. Part of regeneration involves deleting and recreating the CPMS. If the regeneration is not quick enough, the next part of the test can fail, as it is expecting the CPMS to exist. We should change this to an eventually to avoid the race between the generator and the test. See https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-control-plane-machine-set-operator/304/pull-ci-openshift-cluster-control-plane-machine-set-operator-release-4.13-e2e-aws-operator/1801195115868327936 as an example failure
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/csi-operator/pull/81
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
During OpenShift cluster installation - 4.16 Openshift installer file which uses terraform module is unable to create tags for the security groups associated with master / worker nodes since the tag is in key value format. (i.e key=value) Error log for reference: level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: faile d to create security groups: failed to tag the Control plane security group: Resource not found: [PUT https://example.cloud:443/v2.0/security-groups/sg-id/tags/openshiftClusterID=ocpclientprod2-vwgsc]
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
100%
Steps to Reproduce:
1. Create install-config 2. run the 4.16 installer 3. Observe the installation logs
Actual results:
installation fails to tag the security group
Expected results:
installation to be successful
Additional info:
Please review the following PR: https://github.com/openshift/cluster-samples-operator/pull/527
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Changes made for faster risk cache-warming (the OCPBUGS-19512 series) introduced an unfortunate cycle:
1. Cincinnati serves vulnerable PromQL, like graph-data#4524.
2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues.
3. Cincinnati PromQL fixed, like graph-data#4528.
4. Cases:
The regression went back via:
Updates from those releases (and later in their 4.y, until this bug lands a fix) to later releases are exposed.
Likely very reproducible for exposed releases, but only when clusters are served PromQL risks that will consistently fail evaluation.
1. Launch a cluster.
2. Point it at dummy Cincinnati data, as described in OTA-520. Initially declare a risk with broken PromQL in that data, like cluster_operator_conditions.
3. Wait until the cluster is reporting Recommended=Unknown for those risks (oc adm upgrade --include-not-recommended).
4. Update the risk to working PromQL, like group(cluster_operator_conditions). Alternatively, update anything about the update-service data (e.g. adding a new update target with a path from the cluster's version).
5. Wait 10 minutes for the CVO to have plenty of time to pull that new Cincinnati data.
6. oc get -o json clusterversion version | jq '.status.conditionalUpdates[].risks[].matchingRules[].promql.promql' | sort | uniq | jq -r .
Exposed releases will still have the broken PromQL in their output (or will lack the new update target you added, or whatever the Cincinnati data change was).
Fixed releases will have picked up the fixed PromQL in their output (or will have the new update target you added, or whatever the Cincinnati data change was).
To detect exposure in collected Insights, look for EvaluationFailed conditionalUpdates like:
$ oc get -o json clusterversion version | jq -r '.status.conditionalUpdates[].conditions[] | select(.type == "Recommended" and .status == "Unknown" and .reason == "EvaluationFailed" and (.message | contains("invalid PromQL")))' { "lastTransitionTime": "2023-12-15T22:00:45Z", "message": "Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34\nAdding a new worker node will fail for clusters running on ARO. https://issues.redhat.com/browse/MCO-958", "reason": "EvaluationFailed", "status": "Unknown", "type": "Recommended" }
To confirm in-cluster vs. other EvaluationFailed invalid PromQL issues, you can look for Cincinnati retrieval attempts in CVO logs. Example from a healthy cluster:
$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 --since 30m | grep 'request updates from\|PromQL' | tail I1221 20:36:39.783530 1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?... I1221 20:36:39.831358 1 promql.go:118] evaluate PromQL cluster condition: "(\n group(cluster_operator_conditions{name=\"aro\"})\n or\n 0 * group(cluster_operator_conditions)\n)\n" I1221 20:40:19.674925 1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?... I1221 20:40:19.727998 1 promql.go:118] evaluate PromQL cluster condition: "(\n group(cluster_operator_conditions{name=\"aro\"})\n or\n 0 * group(cluster_operator_conditions)\n)\n" I1221 20:43:59.567369 1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?... I1221 20:43:59.620315 1 promql.go:118] evaluate PromQL cluster condition: "(\n group(cluster_operator_conditions{name=\"aro\"})\n or\n 0 * group(cluster_operator_conditions)\n)\n" I1221 20:47:39.457582 1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?... I1221 20:47:39.509505 1 promql.go:118] evaluate PromQL cluster condition: "(\n group(cluster_operator_conditions{name=\"aro\"})\n or\n 0 * group(cluster_operator_conditions)\n)\n" I1221 20:51:19.348286 1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?... I1221 20:51:19.401496 1 promql.go:118] evaluate PromQL cluster condition: "(\n group(cluster_operator_conditions{name=\"aro\"})\n or\n 0 * group(cluster_operator_conditions)\n)\n"
showing fetch lines every few minutes. And from an exposed cluster, only showing PromQL eval lines:
$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 --since 30m | grep 'request updates from\|PromQL' | tail I1221 20:50:10.165101 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:11.166170 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:12.166314 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:13.166517 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:14.166847 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:15.167737 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:16.168486 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:17.169417 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:18.169576 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:19.170544 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 $ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 --since 30m | grep 'request updates from' | tail ...no hits...
If bitten, the remediation is to address the invalid PromQ. For example, we fixed that AROBrokenDNSMasq expression in graph-data#4528. And after that the local cluster administrator should restart their CVO, such as with:
$ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pods
Description of problem:
Test case: [sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster should start and expose a secured proxy and unsecured metrics [apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] Example Z Job Link: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-ovn-remote-libvirt-s390x/1754481543524388864 Z must-gather Link: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-ovn-remote-libvirt-s390x/1754481543524388864/artifacts/ocp-e2e-ovn-remote-libvirt-s390x/gather-libvirt/artifacts/must-gather.tar Example P Job Link: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-ovn-remote-libvirt-ppc64le/1754481543436308480 P must-gather Link: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-ovn-remote-libvirt-ppc64le/1754481543436308480/artifacts/ocp-e2e-ovn-remote-libvirt-ppc64le/gather-libvirt/artifacts/must-gather.tar JSON body of error: { fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:383]: Unexpected error: <*fmt.wrapError | 0xc001d9c000>: https://thanos-querier.openshift-monitoring.svc:9091: request failed: Get "https://thanos-querier.openshift-monitoring.svc:9091": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.38.188:53: no such host { msg: "https://thanos-querier.openshift-monitoring.svc:9091: request failed: Get \"https://thanos-querier.openshift-monitoring.svc:9091\": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.38.188:53: no such host", err: <*url.Error | 0xc0020e02d0>{ Op: "Get", URL: "https://thanos-querier.openshift-monitoring.svc:9091", Err: <*net.OpError | 0xc000b8f770>{ Op: "dial", Net: "tcp", Source: nil, Addr: nil, Err: <*net.DNSError | 0xc0020df700>{ Err: "no such host", Name: "thanos-querier.openshift-monitoring.svc", Server: "172.30.38.188:53", IsTimeout: false, IsTemporary: false, IsNotFound: true, }, }, }, }
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Observe Nightlies on P and/or Z
Actual results:
Test failing
Expected results:
Test passing
Additional info:
Please review the following PR: https://github.com/openshift/cluster-api-provider-libvirt/pull/274
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-39226. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible: Always
Repro Steps:
Add: "bridge=br0:enpf0,enpf2 ip=br0:dhcp" to dracut cmdline. Make sure either enpf0/enpf2 is the primary network of the cluster subnet.
The linux bridge can be configured to add a virtual switch between one or many ports. This can be done by a simple machine config that adds:
"bridge=br0:enpf0,enpf2 ip=br0:dhcp"
to the the kernel command line options which will be processed by dracut.
The use case of adding such a virtual bridge for simple IEEE802.1 switching is to support PCIe devices that act as co-processors in a baremetal server. For example:
-------- ---------------------
Host | PCIe | Co-processor | |
eth0 | <-------> | enpf0 < |
<---> network |
-------- ---------------------
This co-processor could be a "DPU" network interface card. Thus the co-processor can be part of the same underlay network as the cluster and pods can be scheduled on the Host and the Co-processor. This allows for pods to be offloaded to the co-processor for scaling workloads.
Actual results:
ovs-configuration service fails.
Expected results:
ovs-configuration service passes with the bridge interface added to the ovs bridge.
Description of problem:
Backport volumegroupsnapshot fixes to OCP 4.16, below are the PR's that need to be backported to external-snapshotter for OCP 4.16
https://github.com/kubernetes-csi/external-snapshotter/pull/1014
https://github.com/kubernetes-csi/external-snapshotter/pull/1015
https://github.com/kubernetes-csi/external-snapshotter/pull/1011
https://github.com/kubernetes-csi/external-snapshotter/pull/1034
The flowcontrol manifests in the following operators (kas, oas, etcd, openshift controller manager, auth, and network) should use v1.
Description of problem:
The YAML sidebar is occupying too much space on some pages
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-03-140457
How reproducible:
Always
Steps to Reproduce:
1. Go to Deployment/DeploymentConfig creation page 2. Choose 'YAML view' 3. (for comparison) Go to other resources YAML page, open the sidebar
Actual results:
We can see the sidebar is occupying too much screen compared with other resources YAML page
Expected results:
We should reduce the space sidebar occupies
Additional info:
Description of problem:
The SAST scans keep coming up with bogus positive results from test and vendor files. This bug is just a placeholder to allow us to backport the change to ignore those files.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Today these are in isPodLog in the javascript, we'd like them in their own section, preferably charted very close to the node update section.
Please review the following PR: https://github.com/openshift/cluster-api-provider-ibmcloud/pull/70
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-37222. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-35054. The following is the description of the original issue:
—
Description of problem:
Create VPC and subnets with following configs [refer to attached CF template]: Subnets (subnets-pair-default) in CIDR 10.0.0.0/16 Subnets (subnets-pair-134) in CIDR 10.134.0.0/16 Subnets (subnets-pair-190) in CIDR 10.190.0.0/16 Create cluster into subnets-pair-134, the bootstrap process fails [see attached log-bundle logs]: level=debug msg=I0605 09:52:49.548166 937 loadbalancer.go:1262] "adding attributes to load balancer" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" namespace="openshift-cluster-api-guests" name="yunjiang29781a-86-rvqd9" reconcileID="a9310bd5-acc7-4b01-8a84-e47139fc0d1d" cluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" attrs=[{"Key":"load_balancing.cross_zone.enabled","Value":"true"}] level=debug msg=I0605 09:52:49.909861 937 awscluster_controller.go:291] "Looking up IP address for DNS" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" namespace="openshift-cluster-api-guests" name="yunjiang29781a-86-rvqd9" reconcileID="a9310bd5-acc7-4b01-8a84-e47139fc0d1d" cluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" dns="yunjiang29781a-86-rvqd9-int-19a9485653bf29a1.elb.us-east-2.amazonaws.com" level=debug msg=I0605 09:52:53.483058 937 reflector.go:377] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: forcing resync level=debug msg=Fetching Bootstrap SSH Key Pair... Checking security groups: <infraid>-lb allows 10.0.0.0/16:6443 and 10.0.0.0/16:22623 <infraid>-apiserver-lb allows 10.0.0.0/16:6443 and 10.134.0.0/16:22623 (and 0.0.0.0/0:6443) are these settings correct?
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-03-060250
How reproducible:
Always
Steps to Reproduce:
1. Create subnets using attached CG template 2. Create cluster into subnets which CIDR is 10.134.0.0/16 3.
Actual results:
Bootstrap process fails.
Expected results:
Bootstrap succeeds.
Additional info:
No issues if creating cluster into subnets-pair-default (10.0.0.0/16) No issues if only one CIDR in VPC, e.g. set VpcCidr to 10.134.0.0/16 in https://github.com/openshift/installer/blob/master/upi/aws/cloudformation/01_vpc.yaml
Since https://github.com/openshift/installer/pull/8093 merged, CI jobs for the agent appliance have been broken. It appears that the agent-register-cluster.service is no longer getting enabled.
Description of problem:
Hosted control plane clusters of OCP 4.16 are using default catalog sources (redhat-operators, certified-operators, community-operators and redhat-marketplace) pointing to the 4.14, thus 4.16 operators are not available and this can't be updated from within the guest.
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
Always
Steps to Reproduce:
1. check the .spec.image of the default catalog sources in openshift-marketplace namespace.
Actual results:
the default catalogs are pointing to :v4.14
Expected results:
they should point to :v4.16 instead
Additional info:
Please review the following PR: https://github.com/openshift/azure-file-csi-driver/pull/45
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Looking at recent CI metal-ipi CI jobs
Some of the boot strap failures seem to be because of master nodes failing to come up
Search https://search.dptools.openshift.org/?search=Got+0+worker+nodes%2C+%5B12%5D+master+nodes%2C&maxAge=336h&context=-1&type=build-log&name=metal-ipi&excludeName=&maxMatches=1&maxBytes=20971520&groupBy=none
43 results over the last 14 days
level=error msg=ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 1 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).
Background:
CCO was made optional in OCP 4.15, see https://issues.redhat.com/browse/OCPEDGE-69. CloudCredential was introduced as a new capability to openshift/api. We need to bump api at oc to include the CloudCredential capability so oc adm release extract works correctly.
Description of problem:
Some relevant CredentialsRequests are not extracted by the following command: oc adm release extract --credentials-requests --included --install-config=install-config.yaml ... where install-config.yaml looks like the following: ... capabilities: baselineCapabilitySet: None additionalEnabledCapabilities: - MachineAPI - CloudCredential platform: aws: ...
Logs:
... I1209 19:57:25.968783 79037 extract.go:418] Found manifest 0000_50_cloud-credential-operator_05-iam-ro-credentialsrequest.yaml I1209 19:57:25.968902 79037 extract.go:429] Excluding Group: "cloudcredential.openshift.io" Kind: "CredentialsRequest" Namespace: "openshift-cloud-credential-operator" Name: "cloud-credential-operator-iam-ro": unrecognized capability names: CloudCredential ...
Please review the following PR: https://github.com/openshift/machine-api-operator/pull/1190
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
BuildRun logs cannot be displayed in the console and shows the following error:
The buildrun is created and started using the shp cli (similar behavior is observed when the build is created & started via console/yaml too):
shp build create goapp-buildah \ --strategy-name="buildah" \ --source-url="https://github.com/shipwright-io/sample-go" \ --source-context-dir="docker-build" \ --output-image="image-registry.openshift-image-registry.svc:5000/demo/go-app"
The issue occurs on OCP 4.14.6. Investigation showed that this works correctly on OCP 4.14.5.
Description of problem:
It is noticed that ovs-monitor-ipsec fails to import cert into nss db with following error. 2024-04-17T19:57:21.140989157Z 2024-04-17T19:57:21Z | 6 | reconnect | INFO | unix:/var/run/openvswitch/db.sock: connecting... 2024-04-17T19:57:21.142234972Z 2024-04-17T19:57:21Z | 9 | reconnect | INFO | unix:/var/run/openvswitch/db.sock: connected 2024-04-17T19:57:21.170709468Z 2024-04-17T19:57:21Z | 14 | ovs-monitor-ipsec | INFO | Tunnel ovn-69b991-0 appeared in OVSDB 2024-04-17T19:57:21.171379359Z 2024-04-17T19:57:21Z | 16 | ovs-monitor-ipsec | INFO | Tunnel ovn-52bc87-0 appeared in OVSDB 2024-04-17T19:57:21.171826906Z 2024-04-17T19:57:21Z | 18 | ovs-monitor-ipsec | INFO | Tunnel ovn-3e78bb-0 appeared in OVSDB 2024-04-17T19:57:21.172300675Z 2024-04-17T19:57:21Z | 20 | ovs-monitor-ipsec | INFO | Tunnel ovn-12fb32-0 appeared in OVSDB 2024-04-17T19:57:21.172726970Z 2024-04-17T19:57:21Z | 22 | ovs-monitor-ipsec | INFO | Tunnel ovn-8a4d01-0 appeared in OVSDB 2024-04-17T19:57:21.178644919Z 2024-04-17T19:57:21Z | 24 | ovs-monitor-ipsec | ERR | Import cert and key failed. 2024-04-17T19:57:21.178644919Z b"No cert in -in file '/etc/openvswitch/keys/ipsec-cert.pem' matches private key\n80FBF36CDE7F0000:error:05800074:x509 certificate routines:X509_check_private_key:key values mismatch:crypto/x509/x509_cmp.c:405:\n" 2024-04-17T19:57:21.179581526Z 2024-04-17T19:57:21Z | 25 | ovs-monitor-ipsec | ERR | traceback 2024-04-17T19:57:21.179581526Z Traceback (most recent call last): 2024-04-17T19:57:21.179581526Z File "/usr/share/openvswitch/scripts/ovs-monitor-ipsec", line 1382, in <module> 2024-04-17T19:57:21.179581526Z main() 2024-04-17T19:57:21.179581526Z File "/usr/share/openvswitch/scripts/ovs-monitor-ipsec", line 1369, in main 2024-04-17T19:57:21.179581526Z monitor.run() 2024-04-17T19:57:21.179581526Z File "/usr/share/openvswitch/scripts/ovs-monitor-ipsec", line 1176, in run 2024-04-17T19:57:21.179581526Z if self.ike_helper.config_global(self): 2024-04-17T19:57:21.179581526Z File "/usr/share/openvswitch/scripts/ovs-monitor-ipsec", line 521, in config_global 2024-04-17T19:57:21.179581526Z self._nss_import_cert_and_key(cert, key, name) 2024-04-17T19:57:21.179581526Z File "/usr/share/openvswitch/scripts/ovs-monitor-ipsec", line 809, in _nss_import_cert_and_key 2024-04-17T19:57:21.179581526Z os.remove(path) 2024-04-17T19:57:21.179581526Z FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ovs_certkey_ef9cf1a5-bfb2-4876-8fb3-69c6b22561a2.p12'
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
Hit on the CI: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/50690/rehearse-50690-pull-ci-openshift-cluster-network-operator-master-e2e-ovn-ipsec-step-registry/1780660589492703232
Steps to Reproduce:
1. 2. 3.
Actual results:
openshift-install failed with error: time="2024-04-17T19:34:47Z" level=error msg="Cluster initialization failed because one or more operators are not functioning properly.\nThe cluster should be accessible for troubleshooting as detailed in the documentation linked below,\nhttps://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html\nThe 'wait-for install-complete' subcommand can then be used to continue the installation" time="2024-04-17T19:34:47Z" level=error msg="failed to initialize the cluster: Multiple errors are preventing progress:\n* Cluster operator authentication is degraded\n* Cluster operators monitoring, openshift-apiserver are not available" https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_release/50690/rehearse-50690-pull-ci-openshift-cluster-network-operator-master-e2e-ovn-ipsec-step-registry/1780660589492703232/artifacts/e2e-ovn-ipsec-step-registry/ipi-install-install/artifacts/.openshift_install-1713382487.log
Expected results:
Cluster must come up COs running with IPsec enabled for EW traffic.
Additional info:
It seems like ovn-ipsec-host pod's ovn-keys init container write empty content into /etc/openvswitch/keys/ipsec-cert.pem though corresponding csr request containing certificate in its status.
Description of the problem:
When installing a spoke cluster earlier that 4.14 with a mirror registry config, assisted does not create the required ImageContentSourcePolicy needed to pull images from a custom registry.
How reproducible:
4/4
Steps to reproduce:
1. Install 4.12 spoke cluster with ACM 2.10 using a mirror registry config
Actual results:
Spoke installation fails because master can not pull images needed to run assisted-installer-controller
Expected results:
ICSP created and installation finishes successfully
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
We need to make a d/s sync with the u/s multus to support the expose of MTU in the network-status annotation.
The PR was merged u/s https://github.com/k8snetworkplumbingwg/multus-cni/pull/1250
Description of problem:
Recently we bumped the hyperkube image [1] to use both RHEL 9 builder and base images. In order to keep things consistent, we tried to do the same with the "tests" image [2], however, that was not possible because there is currently no "tools" image on RHEL 9. The "tests" image uses "tools" as the base image. As a result, we decided to keep builder & base images for "tests" in RHEL 8, as this work was not required for the kube 1.28 bump nor the FIPS issue we were addressing. However, for the sake of consistency, eventually it'd be good to bump the "tests" builder image to RHEL 9. This would also require us to create a "tools" image based on RHEL 9. [1] https://github.com/openshift/kubernetes/blob/6ab54b8d9a0ea02856efd3835b6f9df5da9ce115/openshift-hack/images/hyperkube/Dockerfile.rhel#L1[2] https://github.com/openshift/kubernetes/blob/master/openshift-hack/images/tests/Dockerfile.rhel#L1 [2] https://github.com/openshift/kubernetes/blob/master/openshift-hack/images/tests/Dockerfile.rhel#L1
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
"tests" image is build and based on a RHEL 8 image.
Expected results:
"tests" image is build and based on a RHEL 9 image.
Additional info:
This is a clone of issue OCPBUGS-36339. The following is the description of the original issue:
—
Description of problem:
The option "Auto deploy when new image is available" becomes unchecked when editing a deployment from web console
Version-Release number of selected component (if applicable):
4.15.17
How reproducible:
100%
Steps to Reproduce:
1. Goto Workloads --> Deployments --> Edit Deployment --> Under Images section --> Tick the option "Auto deploy when new Image is available" and now save deployment. 2. Now again edit the deployment and observe that the option "Auto deploy when new Image is available" is unchecked. 3. Same test work fine in 4.14 cluster.
Actual results:
Option "Auto deploy when new Image is available" is in unchecked state.
Expected results:
Option "Auto deploy when new Image is available" remains in checked state.
Additional info:
Please review the following PR: https://github.com/openshift/machine-api-provider-nutanix/pull/58
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-33636. The following is the description of the original issue:
—
Updating the secrets using Form editor displays the an unknown warning message. This is caused due to incorrect request object sent to server in edit Secret form.
Description of problem:
Version-Release number of selected component (if applicable):
Version4.16 - Always
How reproducible:
Steps to Reproduce:
1. Goto Edit Secret form editor 2. Click Save The warning notification is triggered because of incorrect request object
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-37271. The following is the description of the original issue:
—
Description of problem:
In debugging recent cyclictest issues on OCP 4.16 (5.14.0-427.22.1.el9_4.x86_64+rt kernel), we have discovered that the "psi=1" kernel cmdline argument, which is now added by default due to cgroupsv2 being enabled, is causing latency issues (both cyclictest and timerlat are failing to meet the latency KPIs we commit to for Telco RAN DU deployments). See RHEL-42737 for reference.
Version-Release number of selected component (if applicable):
OCP 4.16
How reproducible:
Cyclictest and timerlat consistently fail on long duration runs (e.g. 12 hours).
Steps to Reproduce:
1. Install OCP 4.16 and configure with the Telco RAN DU reference configuration. 2. Run a long duration cyclictest or timerlat test
Actual results:
Maximum latencies are detected above 20us.
Expected results:
All latencies are below 20us.
Additional info:
See RHEL-42737 for test results and debugging information. This was originally suspected to be an RHEL issue, but it turns out that PSI is being enabled by OpenShift code (which adds psi=1 to the kernel cmdline).
Description of problem:
Install cluster with azure workload identity against 4.16 nightly build, failed as some co are degraded. $ oc get co | grep -v "True False False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.16.0-0.nightly-2024-02-07-200316 False False True 153m OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.jima416a1.qe.azure.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.jima416a1.qe.azure.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server) console 4.16.0-0.nightly-2024-02-07-200316 False True True 141m DeploymentAvailable: 0 replicas available for console deployment... ingress False True True 137m The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (LoadBalancerPending: The LoadBalancer service is pending) Ingress LB public IP is pending to be created $ oc get svc -n openshift-ingress NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-default LoadBalancer 172.30.199.169 <pending> 80:32007/TCP,443:30229/TCP 154m router-internal-default ClusterIP 172.30.112.167 <none> 80/TCP,443/TCP,1936/TCP 154m Detected that CCM pod is CrashLoopBackOff with error $ oc get pod -n openshift-cloud-controller-manager NAME READY STATUS RESTARTS AGE azure-cloud-controller-manager-555cf5579f-hz6gl 0/1 CrashLoopBackOff 21 (2m55s ago) 160m azure-cloud-controller-manager-555cf5579f-xv2rn 0/1 CrashLoopBackOff 21 (15s ago) 160m error in ccm pod: I0208 04:40:57.141145 1 azure.go:931] Azure cloudprovider using try backoff: retries=6, exponent=1.500000, duration=6, jitter=1.000000 I0208 04:40:57.141193 1 azure_auth.go:86] azure: using workload identity extension to retrieve access token I0208 04:40:57.141290 1 azure_diskclient.go:68] Azure DisksClient using API version: 2022-07-02 I0208 04:40:57.141380 1 azure_blobclient.go:73] Azure BlobClient using API version: 2021-09-01 F0208 04:40:57.141471 1 controllermanager.go:314] Cloud provider azure could not be initialized: could not init cloud provider azure: no token file specified. Check pod configuration or set TokenFilePath in the options
Version-Release number of selected component (if applicable):
4.16 nightly build
How reproducible:
Always
Steps to Reproduce:
1. Install cluster with azure workload identity 2. 3.
Actual results:
Installation failed due to some operators are degraded
Expected results:
Installation is successful.
Additional info:
Changes to oc idle needed to support the elimination of this carry - https://github.com/openshift/kubernetes/commit/bd2d0db195d?w=1
More details here - https://redhat-internal.slack.com/archives/C065R4NCLGM/p1701252429658919
Description of problem:
See https://github.com/openshift/console/pull/14030/files/0eba7f7db6c35bbf7bca5e0b8eebd578e47b15cc#r1707020700
Description of problem:
From OCPBUGS-237 we discussed disabled memory-trim-on-compaction once it was enabled by default.
On OCP with 4.15.0-0.nightly-2023-12-19-033450 we are at
Red Hat Enterprise Linux CoreOS 415.92.202312132107-0 (Plow) openvswitch3.1-3.1.0-59.el9fdp.x86_64
This should have memory-trim-on-compaction enabled by default
v3.0.0 - 15 Aug 2022 -------------------- * Returning unused memory to the OS after the database compaction is now enabled by default. Use 'ovsdb-server/memory-trim-on-compaction off' unixctl command to disable.
https://github.com/openvswitch/ovs/commit/e773140ec3f6d296e4a3877d709fb26fb51bc6ee
If we are enabled by default, we should remove the enable loop.
# set trim-on-compaction if ! retry 60 "trim-on-compaction" "ovn-appctl -t ${nbdb_ctl} --timeout=5 ovsdb-server/memory-trim-on-compaction on"; then exit 1 fi
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-19-033450
How reproducible:
Always
Steps to Reproduce:
1. check if memory-trim-on-compaction is enabled by default in OVS
2. check ndbd log files for
Actual results:
2023-12-20T18:12:47.053444489Z 2023-12-20T18:12:47.053Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 3.1.2 2023-12-20T18:12:49.001580092Z 2023-12-20T18:12:49.001Z|00003|ovsdb_server|INFO|memory trimming after compaction enabled.
Expected results:
memory-trim-on-compaction should be enabled by default, we don't need to re-enable it.
Affected Platforms:
All
Description of problem:
Pod capi-ibmcloud-controller-manager stuck in ContainerCreating on IBM cloud
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Built a cluster on ibm cloud and enable TechPreviewNoUpgrade 2. 3.
Actual results:
4.16 cluster $ oc get po NAME READY STATUS RESTARTS AGE capi-controller-manager-6bccdc844-jsm4s 1/1 Running 9 (24m ago) 175m capi-ibmcloud-controller-manager-75d55bfd7d-6qfxh 0/2 ContainerCreating 0 175m cluster-capi-operator-768c6bd965-5tjl5 1/1 Running 0 3h Warning FailedMount 5m15s (x87 over 166m) kubelet MountVolume.SetUp failed for volume "credentials" : secret "capi-ibmcloud-manager-bootstrap-credentials" not found $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-01-21-154905 True False 156m Cluster version is 4.16.0-0.nightly-2024-01-21-154905 4.15 cluster $ oc get po NAME READY STATUS RESTARTS AGE capi-controller-manager-6b67f7cff4-vxtpg 1/1 Running 6 (9m51s ago) 35m capi-ibmcloud-controller-manager-54887589c6-6plt2 0/2 ContainerCreating 0 35m cluster-capi-operator-7b7f48d898-9r6nn 1/1 Running 1 (17m ago) 39m $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.0-0.nightly-2024-01-22-160236 True False 11m Cluster version is 4.15.0-0.nightly-2024-01-22-160236
Expected results:
No pod is in ContainerCreating status
Additional info:
must-gather: https://drive.google.com/file/d/1F5xUVtW-vGizAYgeys0V5MMjp03zkSEH/view?usp=sharing
Description of problem:
InstallPlan fails with "updated validation is too restrictive" when: * Previous CRs and CRDs exist, and * Multiple CRD versions are served (ex. v1alpha1 and v1alpha2)
Version-Release number of selected component (if applicable):
This is reproducible on the OpenShift 4.15.3 rosa cluster, and not reproducible on 4.14.15 or 4.13.
How reproducible:
Always
Steps to Reproduce:
1.Create the following catalogsource and subscription apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: devworkspace-operator-catalog namespace: openshift-marketplace spec: sourceType: grpc image: quay.io/devfile/devworkspace-operator-index:release publisher: Red Hat displayName: DevWorkspace Operator Catalog updateStrategy: registryPoll: interval: 5m --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: namespace: openshift-operators name: devworkspace-operator spec: channel: fast installPlanApproval: Manual name: devworkspace-operator source: devworkspace-operator-catalog sourceNamespace: openshift-marketplace 2. Approve the installplan 3. Create a CR instance (DevWorkspace CR): $ curl https://raw.githubusercontent.com/devfile/devworkspace-operator/main/samples/empty.yaml | kubectl apply -f - 4. Delete the subscription and csv $ oc project openshift-operators $ oc delete sub devworkspace-operator $ oc get csv $ oc delete csv devworkspace-operator.v0.26.0 5. Create the subscription from step 1 again, and approve the installplan 6. View the "updated validation is too restrictive" error in the installplan's status.conditions: --- error validating existing CRs against new CRD's schema for "devworkspaces.workspace.devfile.io": error validating workspace.devfile.io/v1alpha1, Kind=DevWorkspace "openshift-operators/empty-devworkspace": updated validation is too restrictive: [].status.workspaceId: Required value ---
Actual results:
InstallPlan fails and the operator is not installed
Expected results:
InstallPlan succeeds
Additional info:
For this specific scenario, a workaround is to temporarily un-serve the v1alpha1 version before approving the installplan: $ oc patch crd devworkspacetemplates.workspace.devfile.io --type='json' -p='[{"op": "replace", "path": "/spec/versions/0/served", "value": false}]' $ oc patch crd devworkspaces.workspace.devfile.io --type='json' -p='[{"op": "replace", "path": "/spec/versions/0/served", "value": false}]' Another workaround is to delete the existing CR before approving the new installplan.
This is a clone of issue OCPBUGS-25929. The following is the description of the original issue:
—
In Quick Start guided tour, user needs to click "next" button two times for moving forward to next step. If you skip the alert(Yes/No input) and click the "next" button it won't work.
The next button don't respond for first click
The next button should navigate to next step whether the user has answered the Alert message or not.
Please review the following PR: https://github.com/openshift/etcd/pull/236
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The number of control plane replicas defined in install-config.yaml (or agent-cluster-install.yaml) should be validated to check its set to 3, or 1 in the case of SNO. If set to another value the "create image" command should fail.
We recently had a case where the number of replicas was set to 2 and the installation failed. It would be good to catch this misconfiguration prior to the install.
Description of the problem:
Up to latest decision RH won't going to support installation OCP cluster on Nutanix with
nested virtualization. Thus the check box "Install OpenShift Virtualization" on page "Operators" should be disabled when select platform "Nutanix" on page "Cluster Details"
Slack discussion thread
https://redhat-internal.slack.com/archives/C0211848DBN/p1706640683120159
Nutanix
https://portal.nutanix.com/page/documents/kbs/details?targetId=kA00e000000XeiHCAS
Tracker issue for bootimage bump in 4.16. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-33531.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When using SecureBoot tuned reports the following error as debugfs access is restricted:
tuned.utils.commands: Writing to file '/sys/kernel/debug/sched/migration_cost_ns' error: '[Errno 1] Operation not permitted: '/sys/kernel/debug/sched/migration_cost_ns''
tuned.plugins.plugin_scheduler: Error writing value '5000000' to 'migration_cost_ns'
This issue has been reported with the following tickets:
As this is a confirmed limitation of the NTO due to the TuneD component, we should document this as a limitation in the OpenShift Docs:
https://docs.openshift.com/container-platform/4.16/nodes/nodes/nodes-node-tuning-operator.html
Expected Outcome:
Description of problem:
If a cluster is installed using proxy and the username used for connecting to the proxy contains the characters "%40" for encoding a "@" in case of providing a doamin, the instalation fails. The failure is because the proxy variables implemented in the file "/etc/systemd/system.conf.d/10-default-env.conf" in the bootstrap node are ignored by systemd. This issue seems was already fixed in MCO (BZ 1882674 - fixed in RHOCP 4.7), but looks like is affecting the bootstrap process in 4.13 and 4.14, causing the installation to not start at all.
Version-Release number of selected component (if applicable):
4.14, 4.13
How reproducible:
100% always
Steps to Reproduce:
1. create a install-config.yaml file with "%40" in the middle of the username used for proxy. 2. start cluster installation. 3. bootstrap will fail for not using proxy variables.
Actual results:
Installation fails because systemd fails to load the proxy varaibles if "%" is present in the username.
Expected results:
Installation to succeed using a username with "%40" for the proxy.
Additional info:
File "/etc/systemd/system.conf.d/10-default-env.conf" for the bootstrap should be generated in a way accepted by systemd.
Description of problem:
Following https://issues.redhat.com/browse/CNV-28040 On CNV, when virtual machine, with secondary interfaces connected with bridge CNI, is live migrated we observe disruption at the VM inbound traffic. The root cause for it is the migration target bridge interface advertise before the migration is completed. When the migration destination pod is created an IPv6 NS (Neighbor Solicitation) and NA (Neighbor Advertisement) are sent automatically by the kernel. The switches at the endpoints (e.g.: migration destination node) tables get updated and the traffic is forwarded to the migration destination before the migration is completed [1]. The solution is to have the bridge CNI create the pod interface in "link-down" state [2], the IPv6 NS/NA packets are avoided, CNV in turn, set the pod interface to "link-up" [3]. CNV depends on bridge CNI with [2] bits, which is deployed by cluster-network-operator. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2186372#c6 [2] https://github.com/kubevirt/kubevirt/pull/11069 [3] https://github.com/containernetworking/plugins/pull/997
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
100%
Steps to Reproduce:
1. 2. 3.
Actual results:
CNO deploys CNI bridge w/o an option to set the bridge interface down.
Expected results:
CNO to deploy bridge CNI with [1] changes, from release-4.16 branch. [1] https://github.com/containernetworking/plugins/pull/997
Additional info:
More https://issues.redhat.com/browse/CNV-28040
Description of problem:
The ovs-if-br-ex.nmconnection.J1K8B2 like files breaks ovs-configuration.service. Deleting the file fixes the issue.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
During live OVN migration, network operator show the error message: Not applying unsafe configuration change: invalid configuration: [cannot change default network type when not doing migration]. Use 'oc edit network.operator.openshift.io cluster' to undo the change.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create 4.15 nightly SDN ROSA cluster 2. oc delete validatingwebhookconfigurations.admissionregistration.k8s.io/sre-techpreviewnoupgrade-validation 3. oc edit featuregate cluster to enable featuregates 4. Wait for all node rebooting and back to normal 5. oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'
Actual results:
[weliang@weliang ~]$ oc delete validatingwebhookconfigurations.admissionregistration.k8s.io/sre-techpreviewnoupgrade-validation[weliang@weliang ~]$ oc edit featuregate cluster[weliang@weliang ~]$ oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'network.config.openshift.io/cluster patched[weliang@weliang ~]$ [weliang@weliang ~]$ oc get co networkNAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGEnetwork 4.15.0-0.nightly-2023-12-18-220750 True False True 105m Not applying unsafe configuration change: invalid configuration: [cannot change default network type when not doing migration]. Use 'oc edit network.operator.openshift.io cluster' to undo the change.[weliang@weliang ~]$ oc describe Network.config.openshift.io clusterName: clusterNamespace: Labels: <none>Annotations: network.openshift.io/network-type-migration: API Version: config.openshift.io/v1Kind: NetworkMetadata: Creation Timestamp: 2023-12-20T15:13:39Z Generation: 3 Resource Version: 119899 UID: 6a621b88-ac4f-4918-a7f6-98dba7df222cSpec: Cluster Network: Cidr: 10.128.0.0/14 Host Prefix: 23 External IP: Policy: Network Type: OVNKubernetes Service Network: 172.30.0.0/16Status: Cluster Network: Cidr: 10.128.0.0/14 Host Prefix: 23 Cluster Network MTU: 8951 Network Type: OpenShiftSDN Service Network: 172.30.0.0/16Events: <none>[weliang@weliang ~]$ oc describe Network.operator.openshift.io clusterName: clusterNamespace: Labels: <none>Annotations: <none>API Version: operator.openshift.io/v1Kind: NetworkMetadata: Creation Timestamp: 2023-12-20T15:15:37Z Generation: 275 Resource Version: 120026 UID: 278bd491-ac88-4038-887f-d1defc450740Spec: Cluster Network: Cidr: 10.128.0.0/14 Host Prefix: 23 Default Network: Openshift SDN Config: Enable Unidling: true Mode: NetworkPolicy Mtu: 8951 Vxlan Port: 4789 Type: OVNKubernetes Deploy Kube Proxy: false Disable Multi Network: false Disable Network Diagnostics: false Kube Proxy Config: Bind Address: 0.0.0.0 Log Level: Normal Management State: Managed Observed Config: <nil> Operator Log Level: Normal Service Network: 172.30.0.0/16 Unsupported Config Overrides: <nil> Use Multi Network Policy: falseStatus: Conditions: Last Transition Time: 2023-12-20T15:15:37Z Status: False Type: ManagementStateDegraded Last Transition Time: 2023-12-20T16:58:58Z Message: Not applying unsafe configuration change: invalid configuration: [cannot change default network type when not doing migration]. Use 'oc edit network.operator.openshift.io cluster' to undo the change. Reason: InvalidOperatorConfig Status: True Type: Degraded Last Transition Time: 2023-12-20T15:15:37Z Status: True Type: Upgradeable Last Transition Time: 2023-12-20T16:52:11Z Status: False Type: Progressing Last Transition Time: 2023-12-20T15:15:45Z Status: True Type: Available Ready Replicas: 0 Version: 4.15.0-0.nightly-2023-12-18-220750Events: <none>[weliang@weliang ~]$ oc get clusterversionNAME VERSION AVAILABLE PROGRESSING SINCE STATUSversion 4.15.0-0.nightly-2023-12-18-220750 True False 84m Error while reconciling 4.15.0-0.nightly-2023-12-18-220750: the cluster operator network is degraded[weliang@weliang ~]$
Expected results:
Migration success
Additional info:
Get same error message from ROSA and GCP cluster.
CI is flaky because the TestHostNetworkPort test fails:
=== NAME TestAll/serial/TestHostNetworkPortBinding operator_test.go:1034: Expected conditions: map[Admitted:True Available:True DNSManaged:False DeploymentReplicasAllAvailable:True LoadBalancerManaged:False] Current conditions: map[Admitted:True Available:True DNSManaged:False Degraded:False DeploymentAvailable:True DeploymentReplicasAllAvailable:False DeploymentReplicasMinAvailable:True DeploymentRollingOut:True EvaluationConditionsDetected:False LoadBalancerManaged:False LoadBalancerProgressing:False Progressing:True Upgradeable:True] operator_test.go:1034: Ingress Controller openshift-ingress-operator/samehost status: { "availableReplicas": 0, "selector": "ingresscontroller.operator.openshift.io/deployment-ingresscontroller=samehost", "domain": "samehost.ci-op-xlwngvym-43abb.origin-ci-int-aws.dev.rhcloud.com", "endpointPublishingStrategy": { "type": "HostNetwork", "hostNetwork": { "protocol": "TCP", "httpPort": 9080, "httpsPort": 9443, "statsPort": 9936 } }, "conditions": [ { "type": "Admitted", "status": "True", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "Valid" }, { "type": "DeploymentAvailable", "status": "True", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "DeploymentAvailable", "message": "The deployment has Available status condition set to True" }, { "type": "DeploymentReplicasMinAvailable", "status": "True", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "DeploymentMinimumReplicasMet", "message": "Minimum replicas requirement is met" }, { "type": "DeploymentReplicasAllAvailable", "status": "False", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "DeploymentReplicasNotAvailable", "message": "0/1 of replicas are available" }, { "type": "DeploymentRollingOut", "status": "True", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "DeploymentRollingOut", "message": "Waiting for router deployment rollout to finish: 0 of 1 updated replica(s) are available...\n" }, { "type": "LoadBalancerManaged", "status": "False", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "EndpointPublishingStrategyExcludesManagedLoadBalancer", "message": "The configured endpoint publishing strategy does not include a managed load balancer" }, { "type": "LoadBalancerProgressing", "status": "False", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "LoadBalancerNotProgressing", "message": "LoadBalancer is not progressing" }, { "type": "DNSManaged", "status": "False", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "UnsupportedEndpointPublishingStrategy", "message": "The endpoint publishing strategy doesn't support DNS management." }, { "type": "Available", "status": "True", "lastTransitionTime": "2024-02-26T17:25:39Z" }, { "type": "Progressing", "status": "True", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "IngressControllerProgressing", "message": "One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 0 of 1 updated replica(s) are available...\n)" }, { "type": "Degraded", "status": "False", "lastTransitionTime": "2024-02-26T17:25:39Z" }, { "type": "Upgradeable", "status": "True", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "Upgradeable", "message": "IngressController is upgradeable." }, { "type": "EvaluationConditionsDetected", "status": "False", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "NoEvaluationCondition", "message": "No evaluation condition is detected." } ], "tlsProfile": { "ciphers": [ "ECDHE-ECDSA-AES128-GCM-SHA256", "ECDHE-RSA-AES128-GCM-SHA256", "ECDHE-ECDSA-AES256-GCM-SHA384", "ECDHE-RSA-AES256-GCM-SHA384", "ECDHE-ECDSA-CHACHA20-POLY1305", "ECDHE-RSA-CHACHA20-POLY1305", "DHE-RSA-AES128-GCM-SHA256", "DHE-RSA-AES256-GCM-SHA384", "TLS_AES_128_GCM_SHA256", "TLS_AES_256_GCM_SHA384", "TLS_CHACHA20_POLY1305_SHA256" ], "minTLSVersion": "VersionTLS12" }, "observedGeneration": 1 } operator_test.go:1036: failed to observe expected conditions for the second ingresscontroller: timed out waiting for the condition operator_test.go:1059: deleted ingresscontroller samehost operator_test.go:1059: deleted ingresscontroller hostnetworkportbinding
This particular failure comes from https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1017/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/1762147882179235840. Search.ci shows another failure: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/48873/rehearse-48873-pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-gatewayapi/1762576595890999296. The test has failed sporadically in the past, beyond what search.ci is able to search.
TestHostNetworkPort is marked as a serial test in TestAll and marked with t.Parallel() in the test itself. Not sure if this is what is causing a new failure seen in this test, but something is incorrect.
The test failures have been observed recently on 4.16 as well as on 4.12 (https://github.com/openshift/cluster-ingress-operator/pull/828#issuecomment-1292888086) and 4.11 (https://github.com/openshift/cluster-ingress-operator/pull/914#issuecomment-1526808286). The logic error was introduced in 4.11 (https://github.com/openshift/cluster-ingress-operator/pull/756/commits/a22322b25569059c61e1973f37f0a4b49e9407bc).
The logic error is self-evident. The test failure is very rare. The failure has been observed sporadically over the past couple years. Presently, search.ci shows two failures, with the following impact, for the past 14 days:
rehearse-48873-pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-gatewayapi (all) - 3 runs, 33% failed, 100% of failures match = 33% impact
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator (all) - 16 runs, 25% failed, 25% of failures match = 6% impact
N/A.
The TestHostNetworkPort test fails. The test is marked as both serial and parallel.
Test should be marked as either serial or parallel, and it should pass consistently.
When TestAll was introduced, TestHostNetworkPortBinding was initially marked parallel in https://github.com/openshift/cluster-ingress-operator/pull/756/commits/a22322b25569059c61e1973f37f0a4b49e9407bc. After some discussion, it was moved to the serial list in https://github.com/openshift/cluster-ingress-operator/pull/756/commits/a449e497e35fafeecbee9ea656e0631393182f70, but the commit to remove t.Parallel() evidently got inadvertently dropped.
Description of problem:
OCPBUGS-29424 revealed that setting the node status update frequency in kubelet (introduced with OCPBUGS-15583) causes a lot of control plane CPU. The reason is the increased frequency of kubelet node status updates will trigger second order effects in all control plane operators that usually trigger on node changes (api server, etcd, PDB guard pod controllers, or any other static pod based machinery). Reverting the code in OCPBUGS-15583, or manually setting the report/status frequency to 0s causes the CPU to drop immediately.
Version-Release number of selected component (if applicable):
Versions where OCPBUGS-15583 was backported. This includes 4.16, 4.15.0, 4.14.8, 4.13.33, and the next 4.12.z likely 4.12.51.
How reproducible:
always
Steps to Reproduce:
1. create a cluster that contains a fix for OCPBUGS-15583 2. observe the apiserver metrics (eg rate(apiserver_request_total[5m])), those should show abnormal values for pod/configmap GET alternatively the rate of node updates is increaed (rate(apiserver_request_total{resource="nodes", subresource="status", verb="PATCH"}[1m]))
Actual results:
the node status updates every 10s, which causes high CPU usage on control plane operators and apiserver
Expected results:
the node status should not update that frequently, meaning the control plane CPU usage should go down again
Additional info:
slack thread with the node team: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1708429189987849
Description of problem:
Check scripts for the on-premise keepalived static pods only check the haproxy, which only directs to kube-apiserver pod. They do not take into consideration whether the control plane node has a healthy machine-config-server. This may be a problem because, in a failure scenario, it may be required to rebuild nodes and machine-config-server is required for that (so that ignitions are provided). One example is the etcd restore procedure (https://docs.openshift.com/container-platform/4.12/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html). In our case, the following happened (I'd suggest reading the recovery procedure before this sequence of events): - Machine config server was healthy in the recovery control plane node but not in the other hosts. - At this point, we can only guarantee the health of the recovery control plane node because the non-recovery ones are to be replaced and must be removed first from the cluster (node objects deleted) so that OVN-Kubernetes control plane can work properly. - The keepalived check scripts were succeeding in the non-recovery control plane nodes because their haproxy pods were up and running. That is fine from kube-apiserver point of view, actually, but does not take machine config server into consideration. - As the machine-config-server was not reachable, provision of the new masters required by the procedure was impossible. In parallel to this bug, I'll be raising another bug to improve the restore procedure. Basically, asking to stop the keepalived static pods on the non-recovery control plane nodes. This would prevent the exact situation above. However, there are other situations where machine-config-server pods may be unhealthy and we should not just be manually stopping keepalived. In such cases, keepalived should take machine-config-server into consideration.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Under some failure scenarios, where machine-config-server is not healthy in one control plane node.
Steps to Reproduce:
1. Try to provision new machine for recovery. 2. 3.
Actual results:
Machine-config-server not serving because keepalived assigned the VIP to one node that doesn't have a working machine-config-server pod.
Expected results:
Keepalived to take machine-config-server health into consideration while doing failover.
Additional info:
Possible ideas to fix: - Create a check script for the machine-config-server check. It may have less weight than the kube-apiserver ones. - Include machine-config-server endpoint in the haproxy of the kube-apiservers.
Please review the following PR: https://github.com/openshift/csi-external-provisioner/pull/84
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-43051. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-42783. The following is the description of the original issue:
—
Context
Some ROSA HCP users host their own container registries (e.g., self-hosted Quay servers) that are only accessible from inside of their VPCs. This is often achieved through the use of private DNS zones that resolve non-public domains like quay.mycompany.intranet to non-public IP addresses. The private registries at those addresses then present self-signed SSL certificates to the client that can be validated against the HCP's additional CA trust bundle.
Problem Description
A user of a ROSA HCP cluster with a configuration like the one described above is encountering errors when attempting to import a container image from their private registry into their HCP's internal registry via oc import-image. Originally, these errors showed up in openshift-apiserver logs as DNS resolution errors, i.e., OCPBUGS-36944. After the user upgraded their cluster to 4.14.37 (which fixes OCPBUGS-36944), openshift-apiserver was able to properly resolve the domain name but complains of HTTP 502 Bad Gateway errors. We suspect these 502 Bad Gateway errors are coming from the Konnectivity-agent while it proxies traffic between the control and data planes.
We've confirmed that the private registry is accessible from the HCP data plane (worker nodes) and that the certificate presented by the registry can be validated against the cluster's additional trust bundle. IOW, curl-ing the private registry from a worker node returns a HTTP 200 OK, but doing the same from a control plane node returns a HTTP 502. Notably, this cluster is not configured with a cluster-wide proxy, nor does the user's VPC feature a transparent proxy.
Version-Release number of selected component
OCP v4.14.37
How reproducible
Can be reliably reproduced, although the network config (see Context above) is quite specific
Steps to Reproduce
oc import-image imagegroup/imagename:v1.2.3 --from=quay.mycompany.intranet/imagegroup/imagename:v1.2.3 --confirm
Actual Results
error: tag v1.2.3 failed: Internal error occurred: quay.mycompany.intranet/imagegroup/imagename:v1.2.3: Get "https://quay.mycompany.intranet/v2/": Bad Gateway imagestream.image.openshift.io/imagename imported with errors Name: imagename Namespace: mynamespace Created: Less than a second ago Labels: <none> Annotations: openshift.io/image.dockerRepositoryCheck=2024-10-01T12:46:02Z Image Repository: default-route-openshift-image-registry.apps.rosa.clustername.abcd.p1.openshiftapps.com/mynamespace/imagename Image Lookup: local=false Unique Images: 0 Tags: 1 v1.2.3 tagged from quay.mycompany.intranet/imagegroup/imagename:v1.2.3 ! error: Import failed (InternalError): Internal error occurred: quay.mycompany.intranet/imagegroup/imagename:v1.2.3: Get "https://quay.mycompany.intranet/v2/": Bad Gateway Less than a second ago error: imported completed with errors
Expected Results
Desired container image is imported from private external image registry into cluster's internal image registry without error
Description of problem:
No detail failure on signature verification while failing to validate signature of the target release payload during upgrade. It's unclear for user to know which action could be taken for the failure. For example, checking if any wrong configmap set, or default store is not available or any issue on custom store? # ./oc adm upgrade Cluster version is 4.15.0-0.nightly-2023-12-08-202155 Upgradeable=False Reason: FeatureGates_RestrictedFeatureGates_TechPreviewNoUpgrade Message: Cluster operator config-operator should not be upgraded between minor versions: FeatureGatesUpgradeable: "TechPreviewNoUpgrade" does not allow updates ReleaseAccepted=False Reason: RetrievePayload Message: Retrieving payload failed version="4.15.0-0.nightly-2023-12-09-012410" image="registry.ci.openshift.org/ocp/release@sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7" failure=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat Upstream: https://amd64.ocp.releases.ci.openshift.org/graph Channel: stable-4.15 Recommended updates: VERSION IMAGE 4.15.0-0.nightly-2023-12-09-012410 registry.ci.openshift.org/ocp/release@sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 # ./oc -n openshift-cluster-version logs cluster-version-operator-6b7b5ff598-vxjrq|grep "verified"|tail -n4 I1211 09:28:22.755834 1 sync_worker.go:434] loadUpdatedPayload syncPayload err=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat I1211 09:28:22.755974 1 event.go:298] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RetrievePayloadFailed' Retrieving payload failed version="4.15.0-0.nightly-2023-12-09-012410" image="registry.ci.openshift.org/ocp/release@sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7" failure=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat I1211 09:28:37.817102 1 sync_worker.go:434] loadUpdatedPayload syncPayload err=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat I1211 09:28:37.817488 1 event.go:298] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RetrievePayloadFailed' Retrieving payload failed version="4.15.0-0.nightly-2023-12-09-012410" image="registry.ci.openshift.org/ocp/release@sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7" failure=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-08-202155
How reproducible:
always
Steps to Reproduce:
1. trigger an fresh installation with tp enabled(no spec.signaturestores property set by default) 2.trigger an upgrade against a nightly build(no signature available in default signature store) 3.
Actual results:
no detail log on signature verification failure
Expected results:
include detail failure on signature verification in the cvo log
Additional info:
https://github.com/openshift/cluster-version-operator/pull/1003
Description of the problem:
When the function ops.GetEncapsulatedMC takes too long, the host-stage 'Writing Image to Disk' may time out
This may be caused by timed out connection to API VIP to get the ignition
How reproducible:
When connection to bootstrap API VIP times out
Steps to reproduce:
Only artificially
Actual results:
When the problem happens, the host-stage 'Writing image to disk' host stage times out.
Expected results:
In case such problem happens, the host-stage shouldn't time out
This is a clone of issue OCPBUGS-42605. The following is the description of the original issue:
—
Description of problem:
We are in a live migration scenario.
If a project has a networkpolicy to allow from the host network (more concretely, to allow from the ingress controllers and the ingress controllers are in the host network), traffic doesn't work during the live migration between any ingress controller node (either migrated or not migrated) and an already migrated application node.
I'll expand later in the description and internal comments, but the TL;DR is that the IPs of the tun0 of not migrated source nodes and the IPs of the ovn-k8s-mp0 from migrated source nodes are not added to the address sets related to the networkpolicy ACL in the target OVN-Kubernetes node, so that traffic is not allowed.
Version-Release number of selected component (if applicable):
4.16.13
How reproducible:
Always
Steps to Reproduce:
1. Before the migration: have a project with a networkpolicy that allows from the ingress controller and the ingress controller in the host network. Everything must work properly at this point.
2. Start the migration
3. During the migration, check connectivity from the host network of either a migrated node or a non-migrated node. Both will fail (checking from the same node doesn't fail)
Actual results:
Pod on the worker node is not reachable from the host network of the ingress controller node (unless the pod is in the same node than the ingress controller), which causes the ingress controller routes to throw 503 error.
Expected results:
Pod on the worker node to be reachable from the ingress controller node, even when the ingress controller node has not migrated yet and the application node has.
Additional info:
This is not a duplicate of OCPBUGS-42578. This bug refers to the host-to-pod communication path while the other one doesn't.
This is a customer issue. More details to be included in private comments for privacy.
Workaround: Creating a networkpolicy that explicitly allows traffic from tun0 and ovn-k8s-mp0 interfaces. However, note that the workaround can be problematic for clusters with hundreds or thousands of projects. Another possible workaround is to temporarily delete all the networkpolicies of the projects. But again, this may be problematic (and a security risk).
Description of problem:
The node-network-identity deployment should conform to hypershift control plane expectations that all applicable containers should have a liveness probe, and a readiness probe if it is an endpoint for a service.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
No liveness or readiness probes
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/ironic-rhcos-downloader/pull/95
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Component Readiness has found a potential regression in [sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel].
Probability of significant regression: 98.46%
Sample (being evaluated) Release: 4.15
Start Time: 2023-12-29T00:00:00Z
End Time: 2024-01-04T23:59:59Z
Success Rate: 83.33%
Successes: 15
Failures: 3
Flakes: 0
Base (historical) Release: 4.14
Start Time: 2023-10-04T00:00:00Z
End Time: 2023-10-31T23:59:59Z
Success Rate: 98.36%
Successes: 120
Failures: 2
Flakes: 0
Description of problem:
[inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc get co/image-registry NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE image-registry False True True 50m Available: The deployment does not exist... [inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc describe co/image-registry ... Message: Progressing: Unable to apply resources: unable to sync storage configuration: cos region corresponding to a powervs region wdc not found ...
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-ppc64le-2024-01-10-083055
How reproducible:
Always
Steps to Reproduce:
1. Deploy a PowerVS cluster in wdc06 zone
Actual results:
See above error message
Expected results:
Cluster deploys
Description of the problem:
Event showing how often host has been rebooted showing now only for part of the nodes
11/22/2023, 12:15:10 AM Node test-infra-cluster-57eb6989-master-1 has been rebooted 2 times before completing installation
11/22/2023, 12:00:01 AM Host: test-infra-cluster-57eb6989-master-1, reached installation stage Rebooting
11/21/2023, 11:53:14 PM Host: test-infra-cluster-57eb6989-worker-0, reached installation stage Rebooting
11/21/2023, 11:53:13 PM Host: test-infra-cluster-57eb6989-worker-1, reached installation stage Rebooting
11/21/2023, 11:34:56 PM Host: test-infra-cluster-57eb6989-master-0, reached installation stage Rebooting
11/21/2023, 11:34:26 PM Host: test-infra-cluster-57eb6989-master-2, reached installation stage Rebooting
in this cluster 4 events are missing
11/21/2023, 3:49:34 PM Node test-infra-cluster-164a0f73-master-0 has been rebooted 2 times before completing installation
11/21/2023, 3:49:32 PM Node test-infra-cluster-164a0f73-worker-0 has been rebooted 2 times before completing installation
11/21/2023, 3:49:32 PM Node test-infra-cluster-164a0f73-worker-1 has been rebooted 2 times before completing installation
11/21/2023, 3:37:15 PM Host: test-infra-cluster-164a0f73-master-0, reached installation stage Rebooting
11/21/2023, 3:27:34 PM Host: test-infra-cluster-164a0f73-worker-0, reached installation stage Rebooting
11/21/2023, 3:27:30 PM Host: test-infra-cluster-164a0f73-worker-1, reached installation stage Rebooting
11/21/2023, 3:09:40 PM Host: test-infra-cluster-164a0f73-master-2, reached installation stage Rebooting
11/21/2023, 3:09:35 PM Host: test-infra-cluster-164a0f73-master-1, reached installation stage Rebooting
in this cluster 2 events are missing
How reproducible:
Steps to reproduce:
1. create cluster
2. start installation
3.
Actual results:
some events are missing for the indication how often
Expected results:
for each host there should be indication evet
Description of problem:
- As per official doc this feature is configuration.
https://docs.openshift.com/container-platform/4.16/storage/container_storage_interface/persistent-storage-csi-vsphere.html#vsphere-change-max-snapshot_persistent-storage-csi-vsphere
Before Patch: $ ./oc -n openshift-cluster-csi-drivers get cm/vsphere-csi-config -o yaml apiVersion: v1 data: cloud.conf: | # Labels with topology values are added dynamically via operator [Global] cluster-id = ci-ln-1pd7szb-c1627-cbtd8 [VirtualCenter "vcenter-1.ci.ibmc.devcluster.openshift.com"] insecure-flag = true datacenters = cidatacenter-2 migration-datastore-url = ds:///vmfs/volumes/vsan:52eb63e99ce26f5b-b5ba4b2484169430/ kind: ConfigMap metadata: creationTimestamp: "2024-07-12T04:54:31Z" name: vsphere-csi-config namespace: openshift-cluster-csi-drivers resourceVersion: "8172" uid: b1a4cf21-8416-4dc2-a3b5-2abe887dbe4f Patch command: $ ./oc patch clustercsidriver/csi.vsphere.vmware.com --type=merge -p '{"spec":{"driverConfig":{"vSphere":{"globalMaxSnapshotsPerBlockVolume": 10}}}}' Warning: unknown field "spec.driverConfig.vSphere.globalMaxSnapshotsPerBlockVolume" clustercsidriver.operator.openshift.io/csi.vsphere.vmware.com patched $ ./oc -n openshift-cluster-csi-drivers get cm/vsphere-csi-config -o yaml apiVersion: v1 data: cloud.conf: | # Labels with topology values are added dynamically via operator [Global] cluster-id = ci-ln-1pd7szb-c1627-cbtd8 [VirtualCenter "vcenter-1.ci.ibmc.devcluster.openshift.com"] insecure-flag = true datacenters = cidatacenter-2 migration-datastore-url = ds:///vmfs/volumes/vsan:52eb63e99ce26f5b-b5ba4b2484169430/ kind: ConfigMap metadata: creationTimestamp: "2024-07-12T04:54:31Z" name: vsphere-csi-config namespace: openshift-cluster-csi-drivers resourceVersion: "8172" uid: b1a4cf21-8416-4dc2-a3b5-2abe887dbe4f $ ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.2 True False 11m Cluster version is 4.16.2
Also can you confirm if this is TP or GA feature?
As per the https://github.com/openshift/enhancements/blob/master/enhancements/storage/vsphere-driver-configuration.md?plain=1#L154
I see it still in enhancement and in TP
~~~
----------- | -------------- |
4.16 | Tech Preview |
4.17 | GA |
~~~
https://redhat-internal.slack.com/archives/CBQHQFU0N/p1720768379160209
This is a clone of issue OCPBUGS-35905. The following is the description of the original issue:
—
Description of problem:
The builds installed in the hosted clusters are having issues to git-clone repositories from external URLs where their CA are configured in the ca-bundle.crt from trsutedCA section: spec: configuration: apiServer: [...] proxy: trustedCA: name: user-ca-bundle <--- In traditional OCP implementations, the *-global-ca configmap is installed in the same namespace from the build and the ca-bundle.crt is injected into this configmap. In hosted clusters the configmap is being created empty: $ oc get cm -n <app-namespace> <build-name>-global-ca -oyaml apiVersion: v1 data: ca-bundle.crt: "" As mentioned, the user-ca-bundle has the certificates configured: $ oc get cm -n openshift-config user-ca-bundle -oyaml apiVersion: v1 data: ca-bundle.crt: | -----BEGIN CERTIFICATE----- <---
Version-Release number of selected component (if applicable):
How reproducible:
Easily
Steps to Reproduce:
1. Install hosted cluster with trustedCA configmap 2. Run a build in the hosted cluster 3. Check the global-ca configmap
Actual results:
global-ca is empty
Expected results:
global-ca injects the ca-bundle.crt properly
Additional info:
Description of problem:
1. [sig-network][Feature:EgressFirewall] egressFirewall should have no impact outside its namespace [Suite:openshift/conformance/parallel] 2. [sig-network][Feature:EgressFirewall] when using openshift ovn-kubernetes should ensure egressfirewall is created [Suite:openshift/conformance/parallel] The issue arises during the execution of the above tests and appears to be related to the image in use, specifically, the image located at https://quay.io/repository/redhat-developer/nfs-server?tab=tags&tag=1.1 (quay.io/redhat-developer/nfs-server:1.1). This image does not include the 'ping' executable for the s390x architecture, leading to the following error in the prow job logs: ... msg: "Error running /usr/bin/oc --namespace=e2e-test-no-egress-firewall-e2e-6mg9v --kubeconfig=/tmp/configfile3768380277 exec dummy -- ping -c 1 8.8.8.8:\nStdOut>\ntime=\"2023-10-11T19:04:52Z\" level=error msg=\"exec failed: unable to start container process: exec: \\\"ping\\\": executable file not found in $PATH\"\ncommand terminated with exit code 255\nStdErr>\ntime=\"2023-10-11T19:04:52Z\" level=error msg=\"exec failed: unable to start container process: exec: \\\"ping\\\": executable file not found in $PATH\"\ncommand terminated with exit code 255\nexit status 255\n" ... Our suggest fix: build new s390x image that contains ping binary.
Version-Release number of selected component (if applicable):
How reproducible:
The issue is reproducible when the test container (quay.io/redhat-developer/nfs-server:1.1) is scheduled on an s390x node, leading to test failures.
Steps to Reproduce:
1.Have a multi-arch cluster (x86 + s390x day2 worker node attached) 2.Execute the two tests 3.Try few times to make the pod assigned to s390x node
Actual results from prow job:
Run #0: Failed expand_less30s{ fail [github.com/openshift/origin/test/extended/networking/egress_firewall.go:70]: Unexpected error: <*fmt.wrapError | 0xc005924300>: Error running /usr/bin/oc --namespace=e2e-test-no-egress-firewall-e2e-6r9zh --kubeconfig=/tmp/configfile3961753222 exec dummy -- ping -c 1 8.8.8.8: StdOut> time="2023-10-12T07:17:02Z" level=error msg="exec failed: unable to start container process: exec: \"ping\": executable file not found in $PATH" command terminated with exit code 255 StdErr> time="2023-10-12T07:17:02Z" level=error msg="exec failed: unable to start container process: exec: \"ping\": executable file not found in $PATH" command terminated with exit code 255 exit status 255 { msg: "Error running /usr/bin/oc --namespace=e2e-test-no-egress-firewall-e2e-6r9zh --kubeconfig=/tmp/configfile3961753222 exec dummy -- ping -c 1 8.8.8.8:\nStdOut>\ntime=\"2023-10-12T07:17:02Z\" level=error msg=\"exec failed: unable to start container process: exec: \\\"ping\\\": executable file not found in $PATH\"\ncommand terminated with exit code 255\nStdErr>\ntime=\"2023-10-12T07:17:02Z\" level=error msg=\"exec failed: unable to start container process: exec: \\\"ping\\\": executable file not found in $PATH\"\ncommand terminated with exit code 255\nexit status 255\n", err: <*exec.ExitError | 0xc0059242e0>{ ProcessState: { pid: 78611, status: 65280, rusage: { Utime: {Sec: 0, Usec: 168910}, Stime: {Sec: 0, Usec: 60897}, Maxrss: 206428, Ixrss: 0, Idrss: 0, Isrss: 0, Minflt: 4199, Majflt: 0, Nswap: 0, Inblock: 0, Oublock: 0, Msgsnd: 0, Msgrcv: 0, Nsignals: 0, Nvcsw: 753, Nivcsw: 149, }, }, Stderr: nil, }, } occurred Ginkgo exit error 1: exit with code 1}
Expected results:
Passed
Additional info:
This issue pertains to a specific bug on the s390x architecture and additionally impacts the libvirt-s390x prow job.
Please review the following PR: https://github.com/openshift/images/pull/157
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Networking / Router". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.
Component name: openshift-enterprise-egress-router-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml
Please review the following PR: https://github.com/openshift/ovirt-csi-driver/pull/132
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
CI is flakey and causing issues for in-cluster team.
Integration tests need a clean-up and made more robust.
Bump Golang to v1.21 in main and hack/tools go.mod's.
Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/101
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-33897. The following is the description of the original issue:
—
=Control Plane Upgrade= ... Completion: 45% (Est Time Remaining: 35m) ^^^^^^^^^^^^^^^^^^^^^^^^^
Do not worry too much about the precision, we can make this more precise in the future. I am thinking of
1. Assigning a fixed amount of time per CO remaining for COs that do not have daemonsets
2. Assign an amount of time proportional to # of workers to each remaining CO that has daemonsets (network, dns)
3. Assign a special amount of time proportional to # of workers to MCO
We can probably take into account the "how long are we upgrading this operator right now" exposed by CVO in OTA-1160
In a 4.16.0-ec.1 cluster, scaling up a MachineSet with publicIP:true fails with:
$ oc -n openshift-machine-api get -o json machines.machine.openshift.io | jq -r '.items[] | select(.status.phase == "Failed") | .status.providerStatus.conditions[].message' | sort | uniq -c 1 googleapi: Error 403: Required 'compute.subnetworks.useExternalIp' permission for 'projects/openshift-gce-devel-ci-2/regions/us-central1/subnetworks/ci-ln-q4d8y8t-72292-msmgw-worker-subnet', forbidden
Seen in 4.16.0-ec.1. Not noticed in 4.15.0-ec.3. Fix likely needs a backport to 4.15 to catch up with OCPBUGS-26406.
Seen in the wild in a cluster after updating from 4.15.0-ec.3 to 4.16.0-ec.1. Reproduced in Cluster Bot on the first attempt, so likely very reproducible.
launch 4.16.0-ec.1 gcp Cluster Bot cluster (logs).
$ oc adm upgrade Cluster version is 4.16.0-ec.1 Upstream: https://api.integration.openshift.com/api/upgrades_info/graph Channel: candidate-4.16 (available channels: candidate-4.16) No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available. $ oc -n openshift-machine-api get machinesets NAME DESIRED CURRENT READY AVAILABLE AGE ci-ln-q4d8y8t-72292-msmgw-worker-a 1 1 1 1 60m ci-ln-q4d8y8t-72292-msmgw-worker-b 1 1 1 1 60m ci-ln-q4d8y8t-72292-msmgw-worker-c 1 1 1 1 60m ci-ln-q4d8y8t-72292-msmgw-worker-f 0 0 60m $ oc -n openshift-machine-api get -o json machinesets | jq -c '.items[].spec.template.spec.providerSpec.value.networkInterfaces' | sort | uniq -c 4 [{"network":"ci-ln-q4d8y8t-72292-msmgw-network","subnetwork":"ci-ln-q4d8y8t-72292-msmgw-worker-subnet"}] $ oc -n openshift-machine-api edit machineset ci-ln-q4d8y8t-72292-msmgw-worker-f # add publicIP $ oc -n openshift-machine-api get -o json machineset ci-ln-q4d8y8t-72292-msmgw-worker-f | jq -c '.spec.template.spec.providerSpec.value.networkInterfaces' [{"network":"ci-ln-q4d8y8t-72292-msmgw-network","publicIP":true,"subnetwork":"ci-ln-q4d8y8t-72292-msmgw-worker-subnet"}] $ oc -n openshift-machine-api scale --replicas 1 machineset ci-ln-q4d8y8t-72292-msmgw-worker-f $ sleep 300 $ oc -n openshift-machine-api get -o json machines.machine.openshift.io | jq -r '.items[] | select(.status.phase == "Failed") | .status.providerStatus.conditions[].message' | sort | uniq -c
1 googleapi: Error 403: Required 'compute.subnetworks.useExternalIp' permission for 'projects/openshift-gce-devel-ci-2/regions/us-central1/subnetworks/ci-ln-q4d8y8t-72292-msmgw-worker-subnet', forbidden
Successfully created machines.
I would expect the CredentialsRequest to ask for this permission, but it doesn't seem to. The old roles/compute.admin includes it, and it probably just needs to be added explicitly. Not clear how many other permissions might also need explicit listing.
Description of problem:
When you delete a cluster, or just a BMH, before the installation starts (Assisted Service takes the control), the metal3-operator tries to generate a PreprovisioningImage.
In previous versions, it was created a fix that, during some first installation phases the creation of the PreprovisioningImage was not invoked:
it was based on the status "StateDeleting".
Recently, it was added a new status "StatePoweringOffBeforeDelete":
but this status is not covered on the previous fix. And during this new phase there should not be tried to create the image.
The problem of trying create the PreprovisioningImage, when it should not, it is that create problems on ZTP. Where the BMH and all the objects are deleted at the same time. And the operator cannot create the image because the NS is been deleted.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Steps to Reproduce:
1.Create a cluster 2.Wait until the provisioing phase 3.Delete the cluster 4.The metal3-operator tries to create the PreprovisioningImage wrongly.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-39286. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-39285. The following is the description of the original issue:
—
Description of problem: https://github.com/openshift/installer/pull/7727 changed the order of some playbooks and we're expected to run the network.yaml playbook before the metadata.json file is created. This isn't a problem with newer version of ansible, that will happily ignore missing var_files, however this is a problem with older ansible that fail with:
[cloud-user@installer-host ~]$ ansible-playbook -i "/home/cloud-user/ostest/inventory.yaml" "/home/cloud-user/ostest/network.yaml" PLAY [localhost] ***************************************************************************************************************************************************************************************************************************** ERROR! vars file metadata.json was not found Could not find file on the Ansible Controller. If you are using a module and expect the file to exist on the remote, see the remote_src option
Description of problem:
We should be checking the `currentVersion` and `desiredVersion` for being empty.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
On ipv6primary dualstack cluster, creating an ipv6 egressIP following this procedure:
is not working. ovnkube-cluster-manager shows below error:
2024-01-16T14:48:18.156140746Z I0116 14:48:18.156053 1 obj_retry.go:358] Adding new object: *v1.EgressIP egress-dualstack-ipv6 2024-01-16T14:48:18.161367817Z I0116 14:48:18.161269 1 obj_retry.go:370] Retry add failed for *v1.EgressIP egress-dualstack-ipv6, will try again later: cloud add request failed for CloudPrivateIPConfig: fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333, err: CloudPrivateIPConfig.cloud.network.openshift.io "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333" is invalid: [<nil>: Invalid value: "": "metadata.name" must validate at least one schema (anyOf), metadata.name: Invalid value: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333": metadata.name in body must be of type ipv4: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333"] 2024-01-16T14:48:18.161416023Z I0116 14:48:18.161357 1 event.go:298] Event(v1.ObjectReference{Kind:"EgressIP", Namespace:"", Name:"egress-dualstack-ipv6", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'CloudAssignmentFailed' egress IP: fd2e:6f44:5dd8:c956:f816:3eff:fef0:3333 for object EgressIP: egress-dualstack-ipv6 could not be created, err: CloudPrivateIPConfig.cloud.network.openshift.io "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333" is invalid: [<nil>: Invalid value: "": "metadata.name" must validate at least one schema (anyOf), metadata.name: Invalid value: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333": metadata.name in body must be of type ipv4: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333"] 2024-01-16T14:49:37.714410622Z I0116 14:49:37.714342 1 reflector.go:790] k8s.io/client-go/informers/factory.go:159: Watch close - *v1.Service total 8 items received 2024-01-16T14:49:48.155826915Z I0116 14:49:48.155330 1 obj_retry.go:296] Retry object setup: *v1.EgressIP egress-dualstack-ipv6 2024-01-16T14:49:48.156172766Z I0116 14:49:48.155899 1 obj_retry.go:358] Adding new object: *v1.EgressIP egress-dualstack-ipv6 2024-01-16T14:49:48.168795734Z I0116 14:49:48.168520 1 obj_retry.go:370] Retry add failed for *v1.EgressIP egress-dualstack-ipv6, will try again later: cloud add request failed for CloudPrivateIPConfig: fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333, err: CloudPrivateIPConfig.cloud.network.openshift.io "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333" is invalid: [<nil>: Invalid value: "": "metadata.name" must validate at least one schema (anyOf), metadata.name: Invalid value: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333": metadata.name in body must be of type ipv4: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333"] 2024-01-16T14:49:48.169400971Z I0116 14:49:48.168937 1 event.go:298] Event(v1.ObjectReference{Kind:"EgressIP", Namespace:"", Name:"egress-dualstack-ipv6", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'CloudAssignmentFailed' egress IP: fd2e:6f44:5dd8:c956:f816:3eff:fef0:3333 for object EgressIP: egress-dualstack-ipv6 could not be created, err: CloudPrivateIPConfig.cloud.network.openshift.io "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333" is invalid: [<nil>: Invalid value: "": "metadata.name" must validate at least one schema (anyOf), metadata.name: Invalid value: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333": metadata.name in body must be of type ipv4: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333"]
Same is observed with ipv6 subnet on slaac mode.
Version-Release number of selected component (if applicable):
How reproducible: Always.
Steps to Reproduce:
Applying below:
$ oc label node/ostest-8zrlf-worker-0-4h78l k8s.ovn.org/egress-assignable="" $ cat egressip_ipv4.yaml && cat egressip_ipv6.yaml apiVersion: k8s.ovn.org/v1 kind: EgressIP metadata: name: egress-dualstack-ipv4 spec: egressIPs: - 192.168.192.111 namespaceSelector: matchLabels: app: egress apiVersion: k8s.ovn.org/v1 kind: EgressIP metadata: name: egress-dualstack-ipv6 spec: egressIPs: - fd2e:6f44:5dd8:c956:f816:3eff:fef0:3333 namespaceSelector: matchLabels: app: egress $ oc apply -f egressip_ipv4.yaml $ oc apply -f egressip_ipv6.yaml
But it shows only info about ipv4 egressIP. The IPv6 port is not even created in openstack:
oc logs -n openshift-cloud-network-config-controller cloud-network-config-controller-67cbc4bc84-786jm I0116 13:15:48.914323 1 controller.go:182] Assigning key: 192.168.192.111 to cloud-private-ip-config workqueue I0116 13:15:48.928927 1 cloudprivateipconfig_controller.go:357] CloudPrivateIPConfig: "192.168.192.111" will be added to node: "ostest-8zrlf-worker-0-4h78l" I0116 13:15:48.942260 1 cloudprivateipconfig_controller.go:381] Adding finalizer to CloudPrivateIPConfig: "192.168.192.111" I0116 13:15:48.943718 1 controller.go:182] Assigning key: 192.168.192.111 to cloud-private-ip-config workqueue I0116 13:15:49.758484 1 openstack.go:760] Getting port lock for portID 8854b2e9-3139-49d2-82dd-ee576b0a0cce and IP 192.168.192.111 I0116 13:15:50.547268 1 cloudprivateipconfig_controller.go:439] Added IP address to node: "ostest-8zrlf-worker-0-4h78l" for CloudPrivateIPConfig: "192.168.192.111" I0116 13:15:50.602277 1 controller.go:160] Dropping key '192.168.192.111' from the cloud-private-ip-config workqueue I0116 13:15:50.614413 1 controller.go:160] Dropping key '192.168.192.111' from the cloud-private-ip-config workqueue $ openstack port list --network network-dualstack | grep -e 192.168.192.111 -e 6f44:5dd8:c956:f816:3eff:fef0:3333 | 30fe8d9a-c1c6-46c3-a873-9a02e1943cb7 | egressip-192.168.192.111 | fa:16:3e:3c:23:2a | ip_address='192.168.192.111', subnet_id='ae8a4c1f-d3e4-4ea2-bc14-ef1f6f5d0bbe' | DOWN |
Actual results: ipv6 egressIP object is ignored.
Expected results: ipv6 egressIP is created and can be attached to a pod.
Additional info: must-gather linked in private comment.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-31446. The following is the description of the original issue:
—
Description of problem:
imagesStreams on hosted-clusters pointing to image on private registries are failing due to tls verification although the registry is correctly trusted. example: $ oc create namespace e2e-test $ oc --namespace=e2e-test tag virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-29-4-4zE9mRvED4RQoUxQ busybox:latest $ oc --namespace=e2e-test set image-lookup busybox stirabos@t14s:~$ oc get imagestream -n e2e-test NAME IMAGE REPOSITORY TAGS UPDATED busybox image-registry.openshift-image-registry.svc:5000/e2e-test/busybox latest stirabos@t14s:~$ oc get imagestream -n e2e-test busybox -o yaml apiVersion: image.openshift.io/v1 kind: ImageStream metadata: annotations: openshift.io/image.dockerRepositoryCheck: "2024-03-27T12:43:56Z" creationTimestamp: "2024-03-27T12:43:56Z" generation: 3 name: busybox namespace: e2e-test resourceVersion: "49021" uid: 847281e7-e307-4057-ab57-ccb7bfc49327 spec: lookupPolicy: local: true tags: - annotations: null from: kind: DockerImage name: virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-29-4-4zE9mRvED4RQoUxQ generation: 2 importPolicy: importMode: Legacy name: latest referencePolicy: type: Source status: dockerImageRepository: image-registry.openshift-image-registry.svc:5000/e2e-test/busybox tags: - conditions: - generation: 2 lastTransitionTime: "2024-03-27T12:43:56Z" message: 'Internal error occurred: virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-29-4-4zE9mRvED4RQoUxQ: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority' reason: InternalError status: "False" type: ImportSuccess items: null tag: latest While image virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-29-4-4zE9mRvED4RQoUxQ can be properly consumed if directly used for a container on a pod on the same cluster. user-ca-bundle config map is properly propagated from hypershift: $ oc get configmap -n openshift-config user-ca-bundle NAME DATA AGE user-ca-bundle 1 3h32m $ openssl x509 -text -noout -in <(oc get cm -n openshift-config user-ca-bundle -o json | jq -r '.data["ca-bundle.crt"]') Certificate: Data: Version: 3 (0x2) Serial Number: 11:3f:15:23:97:ac:c2:d5:f6:54:06:1a:9a:22:f2:b5:bf:0c:5a:00 Signature Algorithm: sha256WithRSAEncryption Issuer: C = US, ST = NC, L = Raleigh, O = Test Company, OU = Testing, CN = test.metalkube.org Validity Not Before: Mar 27 08:28:07 2024 GMT Not After : Mar 27 08:28:07 2025 GMT Subject: C = US, ST = NC, L = Raleigh, O = Test Company, OU = Testing, CN = test.metalkube.org Subject Public Key Info: Public Key Algorithm: rsaEncryption Public-Key: (2048 bit) Modulus: 00:c1:49:1f:18:d2:12:49:da:76:05:36:3e:6b:1a: 82:a7:22:0d:be:f5:66:dc:97:44:c7:ca:31:4d:f3: 7f:0a:d3:de:df:f2:b6:23:f9:09:b1:7a:3f:19:cc: 22:c9:70:90:30:a7:eb:49:28:b6:d1:e0:5a:14:42: 02:93:c4:ac:cc:da:b1:5a:8f:9c:af:60:19:1a:e3: b1:34:c2:b6:2f:78:ec:9f:fe:38:75:91:0f:a6:09: 78:28:36:9e:ab:1c:0d:22:74:d5:52:fe:0a:fc:db: 5a:7c:30:9d:84:7d:f7:6a:46:fe:c5:6f:50:86:98: cc:35:1f:6c:b0:e6:21:fc:a5:87:da:81:2c:7b:e4: 4e:20:bb:35:cc:6c:81:db:b3:95:51:cf:ff:9f:ed: 00:78:28:1d:cd:41:1d:03:45:26:45:d4:36:98:bd: bf:5c:78:0f:c7:23:5c:44:5d:a6:ae:85:2b:99:25: ae:c0:73:b1:d2:87:64:3e:15:31:8e:63:dc:be:5c: ed:e3:fe:97:29:10:fb:5c:43:2f:3a:c2:e4:1a:af: 80:18:55:bc:40:0f:12:26:6b:f9:41:da:e2:a4:6b: fd:66:ae:bc:9c:e8:2a:5a:3b:e7:2b:fc:a6:f6:e2: 73:9b:79:ee:0c:86:97:ab:2e:cc:47:e7:1b:e5:be: 0c:9f Exponent: 65537 (0x10001) X509v3 extensions: X509v3 Basic Constraints: CA:TRUE, pathlen:0 X509v3 Subject Alternative Name: DNS:virthost.ostest.test.metalkube.org Signature Algorithm: sha256WithRSAEncryption Signature Value: 58:d2:da:f9:2a:c0:2d:7a:d9:9f:1f:97:e1:fd:36:a7:32:d3: ab:3f:15:cd:68:8e:be:7c:11:ec:5e:45:50:c4:ec:d8:d3:c5: 22:3c:79:5a:01:63:9e:5a:bd:02:0c:87:69:c6:ff:a2:38:05: 21:e4:96:78:40:db:52:c8:08:44:9a:96:6a:70:1e:1e:ae:74: e2:2d:fa:76:86:4d:06:b1:cf:d5:5c:94:40:17:5d:9f:84:2c: 8b:65:ca:48:2b:2d:00:3b:42:b9:3c:08:1b:c5:5d:d2:9c:e9: bc:df:9a:7c:db:30:07:be:33:2a:bb:2d:69:72:b8:dc:f4:0e: 62:08:49:93:d5:0f:db:35:98:18:df:e6:87:11:ce:65:5b:dc: 6f:f7:f0:1c:b0:23:40:1e:e3:45:17:04:1a:bc:d1:57:d7:0d: c8:26:6d:99:fe:28:52:fe:ba:6a:a1:b8:d1:d1:50:a9:fa:03: bb:b7:ad:0e:82:d2:e8:34:91:fa:b4:f9:81:d1:9b:6d:0f:a3: 8c:9d:c4:4a:1e:08:26:71:b9:1a:e8:49:96:0f:db:5c:76:db: ae:c7:6b:2e:ea:89:5d:7f:a3:ba:ea:7e:12:97:12:bc:1e:7f: 49:09:d4:08:a6:4a:34:73:51:9e:a2:9a:ec:2a:f7:fc:b5:5c: f8:20:95:ad This is probably a side effect of https://issues.redhat.com/browse/RFE-3093 - imagestream to trust CA added during the installation, that is also affecting imagestreams that requires a CA cert injected by hypershift during hosted-cluster creation in the disconnected use case.
Version-Release number of selected component (if applicable):
v4.14, v4.15, v4.16
How reproducible:
100%
Steps to Reproduce:
once connected to a disconnected hosted cluster, create an image stream pointing to an image on the internal mirror registry: 1. $ oc --namespace=e2e-test tag virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-29-4-4zE9mRvED4RQoUxQ busybox:latest 2. $ oc --namespace=e2e-test set image-lookup busybox 3. then check the image stream
Actual results:
status: dockerImageRepository: image-registry.openshift-image-registry.svc:5000/e2e-test/busybox tags: - conditions: - generation: 2 lastTransitionTime: "2024-03-27T12:43:56Z" message: 'Internal error occurred: virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-29-4-4zE9mRvED4RQoUxQ: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority' although the same image can be directly consumed by a pod on the same cluster
Expected results:
status: dockerImageRepository: image-registry.openshift-image-registry.svc:5000/e2e-test/busybox tags: - conditions: - generation: 8 lastTransitionTime: "2024-03-27T13:30:46Z" message: dockerimage.image.openshift.io "virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-29-4-4zE9mRvED4RQoUxQ" not found reason: NotFound status: "False" type: ImportSuccess
Additional info:
This is probably a side effect of https://issues.redhat.com/browse/RFE-3093 Marking the imagestream as: importPolicy: importMode: Legacy insecure: true is enough to workaround this.
This is a clone of issue OCPBUGS-36424. The following is the description of the original issue:
—
Description of problem:
DeploymentConfigs deprecation info alert is shows on the Edit deployment form. It should be shows on only deploymentConfigs pages.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create a deployment 2. Open Edit deployment form from the actions menu 3.
Actual results:
DeploymentConfigs deprecation info alert present on the edit deployment form
Expected results:
DeploymentConfigs deprecation info alert should not be shown for the Deployment
Additional info:
Description of problem:
A node fails to join cluster as it's CSR contains incorrect hostname
oc describe csr csr-7hftm Name: csr-7hftm Labels: <none> Annotations: <none> CreationTimestamp: Tue, 24 Oct 2023 10:22:39 -0400 Requesting User: system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Signer: kubernetes.io/kube-apiserver-client-kubelet Status: Pending Subject: Common Name: system:node:openshift-worker-1 Serial Number: Organization: system:nodes Events: <none>
oc get csr csr-7hftm -o yaml apiVersion: certificates.k8s.io/v1 kind: CertificateSigningRequest metadata: creationTimestamp: "2023-10-24T14:22:39Z" generateName: csr- name: csr-7hftm resourceVersion: "96957" uid: 84b94213-0c0c-40e4-8f90-d6612fbdab58 spec: groups: - system:serviceaccounts - system:serviceaccounts:openshift-machine-config-operator - system:authenticated request: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlIN01JR2lBZ0VBTUVBeEZUQVRCZ05WQkFvVERITjVjM1JsYlRwdWIyUmxjekVuTUNVR0ExVUVBeE1lYzNsegpkR1Z0T201dlpHVTZiM0JsYm5Ob2FXWjBMWGR2Y210bGNpMHhNRmt3RXdZSEtvWkl6ajBDQVFZSUtvWkl6ajBECkFRY0RRZ0FFMjRabE1JWGE1RXRKSGgwdWg2b3RVYTc3T091MC9qN0xuSnFqNDJKY0dkU01YeTJVb3pIRTFycmYKOTFPZ3pOSzZ5Z1R0Qm16NkFOdldEQTZ0dUszMlY2QUFNQW9HQ0NxR1NNNDlCQU1DQTBnQU1FVUNJRFhHMlFVWQoxMnVlWXhxSTV3blArRFBQaE5oaXhiemJvaTBpQzhHci9kMXRBaUVBdEFDcVVwRHFLYlFUNWVFZXlLOGJPN0dlCjhqVEI1UHN1SVpZM1pLU1R2WG89Ci0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo= signerName: kubernetes.io/kube-apiserver-client-kubelet uid: c3adb2e0-6d60-4f56-a08d-6b01d3d3c065 usages: - digital signature - client auth username: system:serviceaccount:openshift-machine-config-operator:node-bootstrapper status: {}
Version-Release number of selected component (if applicable):
4.14.0-rc.6
How reproducible:
So far only on one setup
Steps to Reproduce:
1. Deploy dualstack baremetal cluster with day1 networking with static DHCP hostnames 2. 3.
Actual results:
A node fails to join the cluster
Expected results:
All nodes join the cluster
Continued work on the move to structured intervals requires us to replace all uses of the legacy format in in origin so we can reclaim the "locator" (and "message") properties for the new structured interval, and stop duplicating a lot of text when we store and upload and process these.
This is a clone of issue OCPBUGS-34040. The following is the description of the original issue:
—
Description of problem:
monitor-add-nodes.sh returns Error: open .addnodesparams: permission denied.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
sometimes
Steps to Reproduce:
1. Monitor adding a day2 node using monitor-add-nodes.sh 2. 3.
Actual results:
Error: open .addnodesparams: permission denied.
Expected results:
monitor-add-nodes runs successfully
Additional info:
zhenying niu found an issue the node-joiner-monitor.sh
[core@ocp-edge49 installer]$ ./node-joiner-monitor.sh 192.168.122.6 namespace/openshift-node-joiner-mz8anfejbn created serviceaccount/node-joiner-monitor created clusterrole.rbac.authorization.k8s.io/node-joiner-monitor unchanged clusterrolebinding.rbac.authorization.k8s.io/node-joiner-monitor configured pod/node-joiner-monitor created Now using project "openshift-node-joiner-mz8anfejbn" on server "https://api.ostest.test.metalkube.org:6443". pod/node-joiner-monitor condition met time=2024-05-21T09:24:19Z level=info msg=Monitoring IPs: [192.168.122.6] Error: open .addnodesparams: permission denied Usage: node-joiner monitor-add-nodes [flags] Flags: -h, --help help for monitor-add-nodes Global Flags: --dir string assets directory (default ".") --kubeconfig string Path to the kubeconfig file. --log-level string log level (e.g. "debug | info | warn | error") (default "info") time=2024-05-21T09:24:19Z level=fatal msg=open .addnodesparams: permission denied Cleaning up Removing temporary file /tmp/nodejoiner-mZ8aNfEjbn
[~afasano@redhat.com] found the root cause, the working directory was not set, so the pwd folder /output is used, and is not writable. An easy fix would be to just use /tmp, ie: {code:java} command: ["/bin/sh", "-c", "node-joiner monitor-add-nodes $ipAddresses --dir=/tmp --log-level=info; sleep 5"]
Last payload showed several occurrences of this problem seemingly surfacing the same way:
Example jobs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-sdn-techpreview-serial/1736680969496170496 (blocked the payload)
Looking at sippy, both tests took a dip in pass rate on the 16th, meaning a regression may have merged late on the 15th (friday) or somewhere on the 16th (less likely)
The problem kills the install and thus we are getting no intervals charts to help debug.
Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/272
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-34054. The following is the description of the original issue:
—
The OCM-operator's imagePullSecretCleanupController attempts to prevent new pods from using an image pull secret that needs to be deleted, but this results in the OCM creating a new image pull secret in the meantime.
The overlap occurs when OCM-operator has detected the registry is removed, simultaneously triggering the imagePullSecretCleanup controller to start deleting and updating the OCM config to stop creating, but the OCM behavior change is delayed until its pods are restarted.
In 4.16 this churn is minimized due to the OCM naming the image pull secrets consistently, but the churn can occur during an upgrade given that the OCM-operator is updated first.
This is a clone of issue OCPBUGS-42232. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-35036. The following is the description of the original issue:
—
Description of problem:
The following logs are from namespaces/openshift-apiserver/pods/apiserver-6fcd57c747-57rkr/openshift-apiserver/openshift-apiserver/logs/current.log
2024-06-06T15:57:06.628216833Z E0606 15:57:06.628186 1 finisher.go:175] FinishRequest: post-timeout activity - time-elapsed: 139.823053ms, panicked: true, err: <nil>, panic-reason: runtime error: invalid memory address or nil pointer dereference 2024-06-06T15:57:06.628216833Z goroutine 192790 [running]: 2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers/finisher.finishRequest.func1.1() 2024-06-06T15:57:06.628216833Z k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/finisher/finisher.go:105 +0xa5 2024-06-06T15:57:06.628216833Z panic({0x498ac60?, 0x74a51c0?}) 2024-06-06T15:57:06.628216833Z runtime/panic.go:914 +0x21f 2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/importer.(*ImageStreamImporter).importImages(0xc0c5bf0fc0, {0x5626bb0, 0xc0a50c7dd0}, 0xc07055f4a0, 0xc0a2487600) 2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/importer/importer.go:263 +0x1cf5 2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/importer.(*ImageStreamImporter).Import(0xc0c5bf0fc0, {0x5626bb0, 0xc0a50c7dd0}, 0x0?, 0x0?) 2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/importer/importer.go:110 +0x139 2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/registry/imagestreamimport.(*REST).Create(0xc0033b2240, {0x5626bb0, 0xc0a50c7dd0}, {0x5600058?, 0xc07055f4a0?}, 0xc08e0b9ec0, 0x56422e8?) 2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/registry/imagestreamimport/rest.go:337 +0x1574 2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers.(*namedCreaterAdapter).Create(0x55f50e0?, {0x5626bb0?, 0xc0a50c7dd0?}, {0xc0b5704000?, 0x562a1a0?}, {0x5600058?, 0xc07055f4a0?}, 0x1?, 0x2331749?) 2024-06-06T15:57:06.628216833Z k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/create.go:254 +0x3b 2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers.CreateResource.createHandler.func1.1() 2024-06-06T15:57:06.628216833Z k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/create.go:184 +0xc6 2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers.CreateResource.createHandler.func1.2() 2024-06-06T15:57:06.628216833Z k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/create.go:209 +0x39e 2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers/finisher.finishRequest.func1() 2024-06-06T15:57:06.628216833Z k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/finisher/finisher.go:117 +0x84
Version-Release number of selected component (if applicable):
We applied into all clusters in CI and checked 3 of them and all 3 share the same errors.
oc --context build09 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-rc.3 True False 3d9h Error while reconciling 4.16.0-rc.3: the cluster operator machine-config is degraded oc --context build02 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-rc.2 True False 15d Error while reconciling 4.16.0-rc.2: the cluster operator machine-config is degraded oc --context build03 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.16 True False 34h Error while reconciling 4.15.16: the cluster operator machine-config is degraded
How reproducible:
We applied this PR https://github.com/openshift/release/pull/52574/files to the clusters.
It breaks at least 3 of them.
"qci-pull-through-cache-us-east-1-ci.apps.ci.l2s4.p1.openshiftapps.com" is a registry cache server https://github.com/openshift/release/blob/master/clusters/app.ci/quayio-pull-through-cache/qci-pull-through-cache-us-east-1.yaml
Additional info:
There are lots of image imports in OpenShift CI jobs.
It feels like the registry cache server returns unexpected results to the openshift-apiserver:
2024-06-06T18:13:13.781520581Z E0606 18:13:13.781459 1 strategy.go:60] unable to parse manifest for "sha256:c5bcd0298deee99caaf3ec88de246f3af84f80225202df46527b6f2b4d0eb3c3": unexpected end of JSON input
Our theory is that the requests of imports from all CI clusters crashed the cache server and it sent some unexpected data which caused apiserver to panic.
The expected behaviour is that if the image cannot be pulled from the first mirror in the ImageDigestMirrorSet, then it will be failed over to the next one.
This is a clone of issue OCPBUGS-35542. The following is the description of the original issue:
—
Description of problem:
After destroying cluster, there are still some files leftover in <install-dir>/.clusterapi_output $ ls -ltra total 1516 drwxr-xr-x. 1 fedora fedora 596 Jun 17 03:46 .. drwxr-x---. 1 fedora fedora 88 Jun 17 06:09 .clusterapi_output -rw-r--r--. 1 fedora fedora 1552382 Jun 17 06:09 .openshift_install.log drwxr-xr-x. 1 fedora fedora 80 Jun 17 06:09 . $ ls -ltr .clusterapi_output/ total 40 -rw-r--r--. 1 fedora fedora 2335 Jun 17 05:58 envtest.kubeconfig -rw-r--r--. 1 fedora fedora 20542 Jun 17 06:03 kube-apiserver.log -rw-r--r--. 1 fedora fedora 10656 Jun 17 06:03 etcd.log Then continue installing new cluster within same install dir, installer exited with error as below: $ ./openshift-install create cluster --dir ipi-aws INFO Credentials loaded from the "default" profile in file "/home/fedora/.aws/credentials" INFO Consuming Install Config from target directory FATAL failed to fetch Cluster: failed to load asset "Cluster": local infrastructure provisioning artifacts already exist. There may already be a running cluster After removing .clusterapi_output/envtest.kubeconfig, and creating cluster again, installation is continued.
Version-Release number of selected component (if applicable):
4.16 nightly build
How reproducible:
always
Steps to Reproduce:
1. Launch capi-based installation 2. Destroy cluster 3. Launch new cluster within same install dir
Actual results:
Fail to launch new cluster within the same install dir, because .clusterapi_output/envtest.kubeconfig is still there.
Expected results:
Succeed to create a new cluster within the same install dir
Additional info:
Following up from OCPBUGS-16357, we should enable health check of stale registration sockets in our operators.
We will need - https://github.com/kubernetes-csi/node-driver-registrar/pull/322 and we will have to enable healthcheck for registration sockets - https://github.com/kubernetes-csi/node-driver-registrar#example
Description of problem:
When using the registry-overrides flag to override registries for control plane components, it seems like the current implementation prpagates the override to some data plane components. It seems that certain components like multus, dns, and ingress get values for their containers' images from env vars set in operators on the control plane (cno/dns operator/konnectivity), and hence also get the overridden registry propagated to them.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1.Input a registry override through the HyperShift Operator 2.Check registry fields for components on data plane 3.
Actual results:
Data plane components that get registry values from env vars set in dns-operator, ingress-operator, cluster-network-operator, and cluster-node-tuning-operator get overridden registries.
Expected results:
overriden registries should not get propagated to data plane
Additional info:
Description of problem:
When using the modal dialogs in a hook as part of the actions hook (i.e. useApplicationsActionsProvider) the console will throw an error since the console framework will pass null objects as part of the render cycle. According to Jon Jackson, the console should be safe from null objects but it looks like the code for useDeleteModal and getGroupVersionKindForresource are not safe,
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Use one of the modal APIs in an actions provider hook 2. 3.
Actual results:
Caught error in a child component: TypeError: Cannot read properties of undefined (reading 'split') at i (main-chunk-9fbeef79a…d3a097ed.min.js:1:1) at u (main-chunk-9fbeef79a…d3a097ed.min.js:1:1) at useApplicationActionsProvider (useApplicationActionsProvider.tsx:23:43) at ApplicationNavPage (ApplicationDetails.tsx:38:67) at na (vendors~main-chunk-8…87b.min.js:174297:1) at Hs (vendors~main-chunk-8…87b.min.js:174297:1) at Sc (vendors~main-chunk-8…87b.min.js:174297:1) at Cc (vendors~main-chunk-8…87b.min.js:174297:1) at _c (vendors~main-chunk-8…87b.min.js:174297:1) at pc (vendors~main-chunk-8…87b.min.js:174297:1)
Expected results:
Works with no error
Additional info:
Description of problem:
When trying to onboard a xFusion baremetal node using redfish-virtual media (no provisioning network), it fails after the node registration with this error: Normal InspectionError 60s metal3-baremetal-controller Failed to inspect hardware. Reason: unable to start inspection: The attribute Links/ManagedBy is missing from the resource /redfish/v1/Systems/1
Version-Release number of selected component (if applicable):
4.14.18
How reproducible:
Just add a xFusion baremetal node, specifing in the manifest Spec: Automated Cleaning Mode: metadata Bmc: Address: redfish-virtualmedia://w.z.x.y/redfish/v1/Systems/1 Credentials Name: hu28-tovb-bmc-secret Disable Certificate Verification: true Boot MAC Address: <MAC> Boot Mode: UEFI Online: false Preprovisioning Network Data Name: openstack-hu28-tovb-network-config-secret
Steps to Reproduce:
1. 2. 3.
Actual results:
Inspection fails with afore mentioned error, no preprovisioning image is mounted on the hoste virtualmedia
Expected results:
VirtualMedia get mounted and inspection starts.
Additional info:
Description of problem:
The shutdown-delay-duration argument for the openshift-oauth-apiserver is set to 3s in hypershift, but set to 15s in core openshift. Hypershift should update the value to match.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When scaling from zero replicas, the cluster autoscaler can panic if there are taints on the machineset with no "value" field defined.
Version-Release number of selected component (if applicable):
4.16/master
How reproducible:
always
Steps to Reproduce:
1. create a machineset with a taint that has no value field and 0 replicas 2. enable the cluster autoscaler 3. force a workload to scale the tainted machineset
Actual results:
a panic like this is observed I0325 15:36:38.314276 1 clusterapi_provider.go:68] discovered node group: MachineSet/openshift-machine-api/k8hmbsmz-c2483-9dnddr4sjc (min: 0, max: 2, replicas: 0) panic: interface conversion: interface {} is nil, not string goroutine 79 [running]: k8s.io/autoscaler/cluster-autoscaler/cloudprovider/clusterapi.unstructuredToTaint(...) /go/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go:246 k8s.io/autoscaler/cluster-autoscaler/cloudprovider/clusterapi.unstructuredScalableResource.Taints({0xc000103d40?, 0xc000121360?, 0xc002386f98?, 0x2?}) /go/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go:214 +0x8a5 k8s.io/autoscaler/cluster-autoscaler/cloudprovider/clusterapi.(*nodegroup).TemplateNodeInfo(0xc002675930) /go/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_nodegroup.go:266 +0x2ea k8s.io/autoscaler/cluster-autoscaler/core/utils.GetNodeInfoFromTemplate({0x276b230, 0xc002675930}, {0xc001bf2c00, 0x10, 0x10}, {0xc0023ffe60?, 0xc0023ffe90?}) /go/src/k8s.io/autoscaler/cluster-autoscaler/core/utils/utils.go:41 +0x9d k8s.io/autoscaler/cluster-autoscaler/processors/nodeinfosprovider.(*MixedTemplateNodeInfoProvider).Process(0xc00084f848, 0xc0023f7680, {0xc001dcdb00, 0x3, 0x0?}, {0xc001bf2c00, 0x10, 0x10}, {0xc0023ffe60, 0xc0023ffe90}, ...) /go/src/k8s.io/autoscaler/cluster-autoscaler/processors/nodeinfosprovider/mixed_nodeinfos_processor.go:155 +0x599 k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc000617550, {0x4?, 0x0?, 0x3a56f60?}) /go/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:352 +0xcaa main.run(0x0?, {0x2761b48, 0xc0004c04e0}) /go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:529 +0x2cd main.main.func2({0x0?, 0x0?}) /go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:617 +0x25 created by k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:213 +0x105
Expected results:
expect the machineset to scale up
Additional info:
i think the e2e test that exercises this is only running on periodic jobs and as such we missed this error in OCPBUGS-27509 .
Description of problem:
When running agent-based installation with arm64 and multi payload, after booting the iso file, assisted-service raise the error, and the installation fail to start: Openshift version 4.16.0-0.nightly-arm64-2024-04-02-182838 for CPU architecture arm64 is not supported: no release image found for openshiftVersion: '4.16.0-0.nightly-arm64-2024-04-02-182838' and CPU architecture 'arm64'" go-id=419 pkg=Inventory request_id=5817b856-ca79-43c0-84f1-b38f733c192f The same error when running the installation with multi-arch build in assisted-service.log: Openshift version 4.16.0-0.nightly-multi-2024-04-01-135550 for CPU architecture multi is not supported: no release image found for openshiftVersion: '4.16.0-0.nightly-multi-2024-04-01-135550' and CPU architecture 'multi'" go-id=306 pkg=Inventory request_id=21a47a40-1de9-4ee3-9906-a2dd90b14ec8 Amd64 build works fine for now.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Create agent iso file with openshift-install binary: openshift-install agent create image with arm64/multi payload 2. Booting the iso file 3. Track the "openshift-install agent wait-for bootstrap-complete" output and assisted-service log
Actual results:
The installation can't start with error
Expected results:
The installation is working fine
Additional info:
assisted-service log: https://docs.google.com/spreadsheets/d/1Jm-eZDrVz5so4BxsWpUOlr3l_90VmJ8FVEvqUwG8ltg/edit#gid=0 Job fail url: multi payload: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-multi-nightly-baremetal-compact-agent-ipv4-dhcp-day2-amd-mixarch-f14/1774134780246364160 arm64 payload: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-arm64-nightly-baremetal-pxe-ha-agent-ipv4-static-connected-f14/1773354788239446016
Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/59
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/operator-framework-rukpak/pull/81
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
following signing-key deletion, there is a service CA rotation process which might temporary disrupt cluster operators, but eventually all should regenerate. in recent 4.14 nighties however this is not the case anymore. following a deletion of the signing-key using oc delete secret/signing-key -n openshift-service-ca operators will progress for a while, but eventually console as well as monitoring will end up in available=false and degraded=true, which is only recoverable by manually deleting all the pods in the cluster.
console 4.14.0-0.nightly-2023-06-30-131338 False False True 159m RouteHealthAvailable: route not yet available, https://console-openshift-console.apps.evakhoni-0412.qe.gcp.devcluster.openshift.com returns '503 Service Unavailable'
monitoring 4.14.0-0.nightly-2023-06-30-131338 False True True 161m reconciling Console Plugin failed: retrieving ConsolePlugin object failed: conversion webhook for console.openshift.io/v1alpha1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": tls: failed to verify certificate: x509: certificate signed by unknown authority
same deletion in the previous versions of 4.14-ec.2 or earlier doesn't have this issue, and able to recover eventually without any manual pod deletion. I believe this to be regression bug.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-30-131338 and other recent 4.14 nightlies
How reproducible:
100%
Steps to Reproduce:
1.oc delete secret/signing-key -n openshift-service-ca 2. wait at least 30+ minutes 3. observe oc get co
Actual results:
console and monitoring degraded and not recovering
Expected results:
able to recover eventually as in previous versions
Additional info:
using manual deletion of all pods it is possible to recover the cluster from this state as follows: for I in $(oc get ns -o jsonpath='{range .items[*]} {.metadata.name}{"\n"} {end}'); \ do oc delete pods --all -n $I; \ sleep 1; \ done
must-gather:
https://drive.google.com/file/d/1Y3RrYZlz0EncG-Iqt8USFPsTd-br36Zt/view?usp=sharing
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed preparing ignition data: ignition failed to provision storage: failed to create storage: failed to create bucket: googleapi: Error 409: Your previous request to create the named bucket succeeded and you already own it., conflict
Description of problem:
Configure vm type as Standard_NP10s in install-config, which only supports Generation V1. -------------- compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: azure: type: Standard_NP10s replicas: 3 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: azure: type: Standard_NP10s replicas: 3 Continue installation, installer failed when provisioning bootstrap node. -------------- ERROR ERROR Error: creating Linux Virtual Machine: (Name "jima1211test-rqfhm-bootstrap" / Resource Group "jima1211test-rqfhm-rg"): compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="BadRequest" Message="The selected VM size 'Standard_NP10s' cannot boot Hypervisor Generation '2'. If this was a Create operation please check that the Hypervisor Generation of the Image matches the Hypervisor Generation of the selected VM Size. If this was an Update operation please select a Hypervisor Generation '2' VM Size. For more information, see https://aka.ms/azuregen2vm" ERROR ERROR with azurerm_linux_virtual_machine.bootstrap, ERROR on main.tf line 193, in resource "azurerm_linux_virtual_machine" "bootstrap": ERROR 193: resource "azurerm_linux_virtual_machine" "bootstrap" { ERROR ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failure applying terraform for "bootstrap" stage: error applying Terraform configs: failed to apply Terraform: exit status 1 ERROR ERROR Error: creating Linux Virtual Machine: (Name "jima1211test-rqfhm-bootstrap" / Resource Group "jima1211test-rqfhm-rg"): compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="BadRequest" Message="The selected VM size 'Standard_NP10s' cannot boot Hypervisor Generation '2'. If this was a Create operation please check that the Hypervisor Generation of the Image matches the Hypervisor Generation of the selected VM Size. If this was an Update operation please select a Hypervisor Generation '2' VM Size. For more information, see https://aka.ms/azuregen2vm" ERROR ERROR with azurerm_linux_virtual_machine.bootstrap, ERROR on main.tf line 193, in resource "azurerm_linux_virtual_machine" "bootstrap": ERROR 193: resource "azurerm_linux_virtual_machine" "bootstrap" { ERROR ERROR seems that issue is introduced by https://github.com/openshift/installer/pull/7642/
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-09-012410
How reproducible:
Always
Steps to Reproduce:
1. configure vm type to Standard_NP10s on control-plane in install-config.yaml 2. install cluster 3.
Actual results:
installer failed when provisioning bootstrap node
Expected results:
installation get successful
Additional info:
Description of problem:
When expression using CEL is the alpha feature of the Pipeline. and it is not handled and not supported in the UI console so that UI breaks
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create a Pipeline with when expression using the CEL expression 2. Run the pipeline and navigate to the PipelineRun details page 3.
Actual results:
UI breaks
Expected results:
UI should not break
Additional info:
CEL expression doc https://github.com/tektoncd/pipeline/blob/main/docs/pipelines.md#use-cel-expression-in-whenexpression
Please review the following PR: https://github.com/openshift/cloud-network-config-controller/pull/135
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
In a cluster updating from 4.5.11 through many intermediate versions to 4.14.17 and on to 4.15.3 (initiated 2024-03-18T07:33:11Z), multus pods are sad about api-int X.509:
$ tar -xOz inspect.local.5020316083985214391/namespaces/openshift-kube-apiserver/core/events.yaml <hivei01ue1.inspect.local.5020316083985214391.gz | yaml2json | jq -r '[.items[] | select(.reason == "FailedCreatePodSandBox")][0].message' (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-928-ip-10-164-221-242.ec2.internal_openshift-kube-apiserver_9e87f20b-471a-447e-9679-edce26b4ef78_0(8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c): error adding pod openshift-kube-apiserver_installer-928-ip-10-164-221-242.ec2.internal to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c Netns:/var/run/netns/6e2b0b10-5006-4bf9-bd74-17333e0cdceb IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-apiserver;K8S_POD_NAME=installer-928-ip-10-164-221-242.ec2.internal;K8S_POD_INFRA_CONTAINER_ID=8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c;K8S_POD_UID=9e87f20b-471a-447e-9679-edce26b4ef78 Path: StdinData:[REDACTED]} ContainerID:"8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c" Netns:"/var/run/netns/6e2b0b10-5006-4bf9-bd74-17333e0cdceb" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-apiserver;K8S_POD_NAME=installer-928-ip-10-164-221-242.ec2.internal;K8S_POD_INFRA_CONTAINER_ID=8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c;K8S_POD_UID=9e87f20b-471a-447e-9679-edce26b4ef78" Path:"" ERRORED: error configuring pod [openshift-kube-apiserver/installer-928-ip-10-164-221-242.ec2.internal] networking: Multus: [openshift-kube-apiserver/installer-928-ip-10-164-221-242.ec2.internal/9e87f20b-471a-447e-9679-edce26b4ef78]: error waiting for pod: Get "https://api-int.REDACTED:6443/api/v1/namespaces/openshift-kube-apiserver/pods/installer-928-ip-10-164-221-242.ec2.internal?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority
4.15.3, so we have 4.15.2's OCPBUGS-30304 but not 4.15.5's OCPBUGS-30237.
Seen in two clusters after updating from 4.14 to 4.15.3.
Unclear.
Sad multus pods.
Happy cluster.
$ openssl s_client -showcerts -connect api-int.REDACTED:6443 < /dev/null ... Certificate chain 0 s:CN = api-int.REDACTED i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228 a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256 v:NotBefore: Mar 25 19:35:55 2024 GMT; NotAfter: Apr 24 19:35:56 2024 GMT ... 1 s:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228 i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228 a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256 v:NotBefore: Mar 18 07:33:47 2024 GMT; NotAfter: Mar 16 07:33:48 2034 GMT ...
So that's created seconds after the update was initiated. We have inspect logs for some namespaces, but they don't go back quite that far, because the machine-config roll at the end of the update into 4.15.3 rolled all the pods:
$ tar -xOz inspect.local.5020316083985214391/namespaces/openshift-kube-apiserver-operator/pods/kube-apiserver-operator-6cbfdd467c-4ctq7/kube-apiserver-operator/kube-apiserver-operator/logs/current.log <hivei01ue1.inspect.local.5020316083985214391.gz | head -n2 2024-03-18T08:22:05.058253904Z I0318 08:22:05.056255 1 cmd.go:241] Using service-serving-cert provided certificates 2024-03-18T08:22:05.058253904Z I0318 08:22:05.056351 1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
We were able to recover individual nodes via:
Description of problem:
The ovn-ipsec-host pods are crashlooping on a 24 node cluster.
Version-Release number of selected component (if applicable):
4.16.0, master
How reproducible:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/50690/rehearse-50690-pull-ci-openshift-qe-ocp-qe-perfscale-ci-main-azure-4.15-nightly-x86-control-plane-ipsec-24nodes/1780216294851743744
Steps to Reproduce:
Running rehearse test for the PR https://github.com/openshift/release/pull/50690
Actual results:
CI lane fails at control-plane-ipsec-24nodes-ipi-install-install step. Seeing following errors from ipsec pod: 2024-04-16T14:18:01.158407293Z + counter=0 2024-04-16T14:18:01.158407293Z + '[' -f /etc/cni/net.d/10-ovn-kubernetes.conf ']' 2024-04-16T14:18:01.158512920Z ovnkube-node has configured node. 2024-04-16T14:18:01.158519623Z + echo 'ovnkube-node has configured node.' 2024-04-16T14:18:01.158519623Z + pgrep pluto 2024-04-16T14:18:01.166444142Z pluto is not running, enable the service and/or check system logs 2024-04-16T14:18:01.166465551Z + echo 'pluto is not running, enable the service and/or check system logs' 2024-04-16T14:18:01.166465551Z + exit 2
Expected results:
The step must pass and CI lane should succeed eventually.
Additional info:
The mcp status for the worker pool contains the following:
status: certExpirys: - bundle: KubeAPIServerServingCAData expiry: "2034-04-14T12:58:49Z" subject: CN=admin-kubeconfig-signer,OU=openshift - bundle: KubeAPIServerServingCAData expiry: "2024-04-17T12:58:51Z" subject: CN=kube-csr-signer_@1713274017 - bundle: KubeAPIServerServingCAData expiry: "2024-04-17T12:58:51Z" subject: CN=kubelet-signer,OU=openshift - bundle: KubeAPIServerServingCAData expiry: "2025-04-16T12:58:51Z" subject: CN=kube-apiserver-to-kubelet-signer,OU=openshift - bundle: KubeAPIServerServingCAData expiry: "2025-04-16T12:58:51Z" subject: CN=kube-control-plane-signer,OU=openshift - bundle: KubeAPIServerServingCAData expiry: "2034-04-14T12:58:50Z" subject: CN=kubelet-bootstrap-kubeconfig-signer,OU=openshift - bundle: KubeAPIServerServingCAData expiry: "2025-04-16T13:26:54Z" subject: CN=openshift-kube-apiserver-operator_node-system-admin-signer@1713274014 conditions: - lastTransitionTime: "2024-04-16T13:28:53Z" message: "" reason: "" status: "False" type: RenderDegraded - lastTransitionTime: "2024-04-16T13:34:52Z" message: "" reason: "" status: "False" type: Updated - lastTransitionTime: "2024-04-16T13:35:08Z" message: "" reason: "" status: "False" type: NodeDegraded - lastTransitionTime: "2024-04-16T13:35:08Z" message: "" reason: "" status: "False" type: Degraded - lastTransitionTime: "2024-04-16T13:34:52Z" message: All nodes are updating to MachineConfig rendered-worker-226a284eb61d46506202285ee1cf4688 reason: "" status: "True" type: Updating configuration: name: rendered-worker-95c2861c75a83c0523dcba922c3b9982 source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 98-worker-generated-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 97-worker-generated-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-generated-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 80-ipsec-worker-extensions - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker degradedMachineCount: 0 machineCount: 24 observedGeneration: 140 readyMachineCount: 8 unavailableMachineCount: 1 updatedMachineCount: 8
Description of problem:
Looking at the code snippet at line 198, the wg.Add(1) should be moved closer to the function it is waiting for (line 226).
Having another function in between that could exit could leave the controller in a state where it will be waiting for a defer that can never occur, meaning that the controller will never terminate.
Version-Release number of selected component (if applicable):
Found on the master branch while cross-referencing errors/logs for a cluster.
How reproducible:
Not reproducible.
Additional info:
Not required: resolution has already found
This is a clone of issue OCPBUGS-32257. The following is the description of the original issue:
—
Description of problem:
When using the registry-overrides flag to override registries for control plane components, it seems like the current implementation prpagates the override to some data plane components. It seems that certain components like multus, dns, and ingress get values for their containers' images from env vars set in operators on the control plane (cno/dns operator/konnectivity), and hence also get the overridden registry propagated to them.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1.Input a registry override through the HyperShift Operator 2.Check registry fields for components on data plane 3.
Actual results:
Data plane components that get registry values from env vars set in dns-operator, ingress-operator, cluster-network-operator, and cluster-node-tuning-operator get overridden registries.
Expected results:
overriden registries should not get propagated to data plane
Additional info:
The node-exporter pods throws following errors if `symbolic_name` is not present or provided by fibre channel vendor.
$ oc logs node-exporter-m6lbc -n openshift-monitoring -c node-exporter | tail -2
2023-09-27T12:13:39.403106561Z ts=2023-09-27T12:13:39.403Z caller=collector.go:169 level=error msg="collector failed" name=fibrechannel duration_seconds=0.000249813 err="error obtaining FibreChannel class info: failed to read file \"/host/sys/class/fc_host/host0/symbolic_name\": open /host/sys/class/fc_host/host0/symbolic_name: no such file or directory"
The ibmvfc kernel module does not supply `symbolic_name`.
https://github.com/torvalds/linux/blob/master/drivers/scsi/ibmvscsi/ibmvfc.c#L6308
sh-5.1# cd /sys/class/fc_host/host0
sh-5.1# ls -ltr
total 0
rrr-. 1 root root 65536 Sep 28 19:43 speed
rrr-. 1 root root 65536 Sep 28 19:43 port_type
rrr-. 1 root root 65536 Sep 28 19:43 port_state
rrr-. 1 root root 65536 Sep 28 19:43 port_name
rrr-. 1 root root 65536 Sep 28 19:43 port_id
rrr-. 1 root root 65536 Sep 28 19:43 node_name
rrr-. 1 root root 65536 Sep 28 19:43 fabric_name
rw-rr-. 1 root root 65536 Sep 28 19:43 dev_loss_tmo
rw-rr-. 1 root root 65536 Oct 3 09:24 uevent
rw-rr-. 1 root root 65536 Oct 3 09:24 tgtid_bind_type
rrr-. 1 root root 65536 Oct 3 09:24 supported_classes
lrwxrwxrwx. 1 root root 0 Oct 3 09:24 subsystem -> ../../../../../../class/fc_host
drwxr-xr-x. 2 root root 0 Oct 3 09:24 power
rrr-. 1 root root 65536 Oct 3 09:24 maxframe_size
-w------. 1 root root 65536 Oct 3 09:24 issue_lip
lrwxrwxrwx. 1 root root 0 Oct 3 09:24 device -> ../../../host0
This is a clone of issue OCPBUGS-34734. The following is the description of the original issue:
—
Description of problem:
For the fix of OCPBUGS-29494, only the hosted cluster was fixed, and changes to the node pool were ignored. The node pool encountered the following error:
- lastTransitionTime: "2024-05-31T09:11:40Z" message: 'failed to check if we manage haproxy ignition config: failed to look up image metadata for registry.ci.openshift.org/ocp/4.14-2024-05-29-171450@sha256:9b88c6e3f7802b06e5de7cd3300aaf768e85d785d0847a70b35857e6d1000d51: failed to obtain root manifest for registry.ci.openshift.org/ocp/4.14-2024-05-29-171450@sha256:9b88c6e3f7802b06e5de7cd3300aaf768e85d785d0847a70b35857e6d1000d51: unauthorized: authentication required' observedGeneration: 1 reason: ValidationFailed status: "False" type: ValidMachineConfig
Version-Release number of selected component (if applicable):
4.14, 4.15, 4.16, 4.17
How reproducible:
100%
Steps to Reproduce:
1. try to deploy an hostedCluster on a disconnected environment without explicitly set hypershift.openshift.io/control-plane-operator-image annotation. 2. 3.
Expected results:
without set hypershift.openshift.io/control-plane-operator-image annotation nodepool can be ready
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/265
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-33129. The following is the description of the original issue:
—
Description of problem:
Given that we create a new pool, and we enable OCB in this pool, and we remove the pool and the MachineOSConfig resource, and we create another new pool to enable OCB again, then the controller pod panics.
Version-Release number of selected component (if applicable):
pre-merge https://github.com/openshift/machine-config-operator/pull/4327
How reproducible:
Always
Steps to Reproduce:
1. Create a new infra MCP apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: infra spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]} nodeSelector: matchLabels: node-role.kubernetes.io/infra: "" 2. Create a MachineOSConfig for infra pool oc create -f - << EOF apiVersion: machineconfiguration.openshift.io/v1alpha1 kind: MachineOSConfig metadata: name: infra spec: machineConfigPool: name: infra buildInputs: imageBuilder: imageBuilderType: PodImageBuilder baseImagePullSecret: name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy") renderedImagePushSecret: name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}') renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest" EOF 3. When the build is finished, remove the MachineOSConfig and the pool oc delete machineosconfig infra oc delete mcp infra 4. Create a new infra1 pool apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: infra1 spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra1]} nodeSelector: matchLabels: node-role.kubernetes.io/infra1: "" 5. Create a new machineosconfig for infra1 pool oc create -f - << EOF apiVersion: machineconfiguration.openshift.io/v1alpha1 kind: MachineOSConfig metadata: name: infra1 spec: machineConfigPool: name: infra1 buildInputs: imageBuilder: imageBuilderType: PodImageBuilder baseImagePullSecret: name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy") renderedImagePushSecret: name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}') renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest" containerFile: - containerfileArch: noarch content: |- RUN echo 'test image' > /etc/test-image.file EOF
Actual results:
The MCO controller pod panics (in updateMachineOSBuild): E0430 11:21:03.779078 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 265 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x3547bc0?, 0x53ebb20}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00035e000?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b panic({0x3547bc0?, 0x53ebb20?}) /usr/lib/golang/src/runtime/panic.go:914 +0x21f github.com/openshift/api/machineconfiguration/v1.(*MachineConfigPool).GetNamespace(0x53f6200?) <autogenerated>:1 +0x9 k8s.io/client-go/tools/cache.MetaObjectToName({0x3e2a8f8, 0x0}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:131 +0x25 k8s.io/client-go/tools/cache.ObjectToName({0x3902740?, 0x0?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:126 +0x74 k8s.io/client-go/tools/cache.MetaNamespaceKeyFunc({0x3902740?, 0x0?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:112 +0x3e k8s.io/client-go/tools/cache.DeletionHandlingMetaNamespaceKeyFunc({0x3902740?, 0x0?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:336 +0x3b github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueAfter(0xc0007097a0, 0x0, 0x0?) /go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:761 +0x33 github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueDefault(...) /go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:772 github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).updateMachineOSBuild(0xc0007097a0, {0xc001c37800?, 0xc000029678?}, {0x3904000?, 0xc0028361a0}) /go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:395 +0xd1 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:246 k8s.io/client-go/tools/cache.(*processorListener).run.func1() /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:970 +0xea k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0005e5738?, {0x3de6020, 0xc0008fe780}, 0x1, 0xc0000ac720) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x6974616761706f72?, 0x3b9aca00, 0x0, 0x69?, 0xc0005e5788?) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f k8s.io/apimachinery/pkg/util/wait.Until(...) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 k8s.io/client-go/tools/cache.(*processorListener).run(0xc000b97c20) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:966 +0x69 k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1() /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x4f created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 248 /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x210a6e9] When the controller pod is restarted, it panics again, but in a different function (addMachineOSBuild): E0430 11:26:54.753689 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 97 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x3547bc0?, 0x53ebb20}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x15555555aa?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b panic({0x3547bc0?, 0x53ebb20?}) /usr/lib/golang/src/runtime/panic.go:914 +0x21f github.com/openshift/api/machineconfiguration/v1.(*MachineConfigPool).GetNamespace(0x53f6200?) <autogenerated>:1 +0x9 k8s.io/client-go/tools/cache.MetaObjectToName({0x3e2a8f8, 0x0}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:131 +0x25 k8s.io/client-go/tools/cache.ObjectToName({0x3902740?, 0x0?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:126 +0x74 k8s.io/client-go/tools/cache.MetaNamespaceKeyFunc({0x3902740?, 0x0?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:112 +0x3e k8s.io/client-go/tools/cache.DeletionHandlingMetaNamespaceKeyFunc({0x3902740?, 0x0?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:336 +0x3b github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueAfter(0xc000899560, 0x0, 0x0?) /go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:761 +0x33 github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueDefault(...) /go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:772 github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).addMachineOSBuild(0xc000899560, {0x3904000?, 0xc0006a8b60}) /go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:386 +0xc5 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:239 k8s.io/client-go/tools/cache.(*processorListener).run.func1() /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:972 +0x13e k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00066bf38?, {0x3de6020, 0xc0008f8b40}, 0x1, 0xc000c2ea20) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0xc00066bf88?) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f k8s.io/apimachinery/pkg/util/wait.Until(...) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 k8s.io/client-go/tools/cache.(*processorListener).run(0xc000ba6240) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:966 +0x69 k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1() /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x4f created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 43 /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x210a6e9]
Expected results:
No panic should happen. Errors should be controlled.
Additional info:
In order to recover from this panic, we need to manually delete the MachineOSBuild resources that are related to the pool that does not exist anymore.
Description of problem:
The status controller of CCO reconciles 500+ times/h on average on a resting 6-node mint-mode OCP cluster on AWS.
Steps to Reproduce:
1. Install a 6-node mint-mode OCP cluster on AWS 2. Do nothing with it and wait for a couple of hours 3. Plot the following metric in the metrics dashboard of OCP console: rate(controller_runtime_reconcile_total{controller="status"}[1h]) * 3600
Actual results:
500+ reconciles/h on a resting cluster
Expected results:
12-50 reconciles/h on a resting cluster Note: the reconcile() function always requeues after 5min so the theoretical minimum is 12 reconciles/h
Component Readiness has found a potential regression in [Jira:"Networking / router"] monitor test service-type-load-balancer-availability cleanup.
Probability of significant regression: 100.00%
Sample (being evaluated) Release: 4.16
Start Time: 2024-04-02T00:00:00Z
End Time: 2024-04-08T23:59:59Z
Success Rate: 94.67%
Successes: 213
Failures: 12
Flakes: 0
Base (historical) Release: 4.15
Start Time: 2024-02-01T00:00:00Z
End Time: 2024-02-28T23:59:59Z
Success Rate: 100.00%
Successes: 751
Failures: 0
Flakes: 0
The failure message that we're after here is
{ failed during cleanup
Get "https://api.ci-op-tgk1b3if-9d969.ci2.azure.devcluster.openshift.com:6443/api/v1/namespaces/e2e-service-lb-test-xqptd": http2: client connection lost}
Looking at the sample runs, the failure is in monitortest e2e xml junit files, and it appears this one always happens after upgrade, but before conformance. Unfortunately that means we may not have reliable intervals during the time this occurs. It also means there's no excuse for a connection lost to the apiserver.
Example: this junit xml from this job run
The problem actually dates back to March 3, see attachment for the full list of job runs affected. Almost entirely Azure, entirely 4.16 (never happened prior as far as we can see back).
It occurs in a poll loop checking if a namespace exists after being deleted. Failure rate seems to be around 5% of the time on this specific job.
This is a clone of issue OCPBUGS-35252. The following is the description of the original issue:
—
Clone of original bug to ensure the change is made in HyperShift
Description of problem:
We shouldn't enforce PSa in 4.16, neither by label sync, neither by global cluster config.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100%
Steps to Reproduce:
As a cluster admin: 1. create two new namespaces/projects: pokus, openshift-pokus 2. as a cluster-admin, attempt to create a privileged pod in both the namespaces from 1.
Actual results:
pod creation is blocked by pod security admission
Expected results:
only a warning about pod violating the namespace pod security level should be emitted
Additional info:
Description of problem:
Rebase CAPO upstream for OCP 4.16
Version-Release number of selected component (if applicable):
4.16
This is a clone of issue OCPBUGS-38842. The following is the description of the original issue:
—
Component Readiness has found a potential regression in the following test:
[sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers for ns/openshift-image-registry
Probability of significant regression: 98.02%
Sample (being evaluated) Release: 4.17
Start Time: 2024-08-15T00:00:00Z
End Time: 2024-08-22T23:59:59Z
Success Rate: 94.74%
Successes: 180
Failures: 10
Flakes: 0
Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 89
Failures: 0
Flakes: 0
Also hitting 4.17, I've aligned this bug to 4.18 so the backport process is cleaner.
The problem appears to be a permissions error preventing the pods from starting:
2024-08-22T06:14:14.743856620Z ln: failed to create symbolic link '/etc/pki/ca-trust/extracted/pem/directory-hash/ca-certificates.crt': Permission denied
Originating from this code: https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/podtemplatespec.go#L489
Both 4.17 and 4.18 nightlies bumped rhcos and in there is an upgrade like this:
container-selinux-3-2.231.0-1.rhaos4.16.el9-noarch container-selinux-3-2.231.0-2.rhaos4.17.el9-noarch
With slightly different versions in each stream, but both were on 3-2.231.
Hits other tests too:
operator conditions image-registry
Operator upgrade image-registry
[sig-cluster-lifecycle] Cluster completes upgrade
[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
[sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade]
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/130
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Support of apiVersion v1alpha1 has been removed. So, it is better to upgrade the apiVersion to v1beta1.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Hypershift Operator is scheduing control plane on Deleting nodes
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
https://web-rca.devshift.net/incident/ITN-2024-00068
1. HO was trying to create an HCP on a node being deleted 2. HO couldn't find the paired node 'cause it was already deleted Forcing the removal of pending node (blocked by PDB) solves the issue
Actual results:
Expected results:
Additional info:
Description of problem:
Oh no! Something went wrong" in Topology -> Observese Tab
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-14-115151
How reproducible:
Always
Steps to Reproduce:
1.Navigate to Topology -> click one deployment and go to Observer Tab 2. 3.
Actual results:
The page crushed ErrorDescription:Component trace:Copy to clipboardat te (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-plugins-shared~main-chunk-b3bd2b20c770a4e73b50.min.js:31:9773) at j (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-plugins-shared~main-chunk-b3bd2b20c770a4e73b50.min.js:12:3324) at div at s (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-c9c3c11a060d045a85da.min.js:60:70124) at div at g (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-c9c3c11a060d045a85da.min.js:6:11163) at div at d (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-c9c3c11a060d045a85da.min.js:1:174472) at t.a (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/dev-console/code-refs/topology-chunk-769d28af48dd4b29136f.min.js:1:487478) at t.a (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/dev-console/code-refs/topology-chunk-769d28af48dd4b29136f.min.js:1:486390) at div at l (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-c9c3c11a060d045a85da.min.js:60:106304) at div
Expected results: {code:none} not crush
Additional info:
The last 4 IPv6 jobs are failing on the same error
https://prow.ci.openshift.org/job-history/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-ovn-ipv6
master-bmh-update.log looses access to the the API when trying to get/update the BMH details
May 01 03:32:23 localhost.localdomain master-bmh-update.sh[4663]: Waiting for 3 masters to become provisioned May 01 03:32:23 localhost.localdomain master-bmh-update.sh[24484]: E0501 03:32:23.531242 24484 memcache.go:265] couldn't get current server API group list: Get "https://api-int.ostest.test.metalkube.org:6443/api?timeout=32s": dial tcp [fd2e:6f44:5dd8:c956::5]:6443: connect: connection refused May 01 03:32:23 localhost.localdomain master-bmh-update.sh[24484]: E0501 03:32:23.531808 24484 memcache.go:265] couldn't get current server API group list: Get "https://api-int.ostest.test.metalkube.org:6443/api?timeout=32s": dial tcp [fd2e:6f44:5dd8:c956::5]:6443: connect: connection refused May 01 03:32:23 localhost.localdomain master-bmh-update.sh[24484]: E0501 03:32:23.533281 24484 memcache.go:265] couldn't get current server API group list: Get "https://api-int.ostest.test.metalkube.org:6443/api?timeout=32s": dial tcp [fd2e:6f44:5dd8:c956::5]:6443: connect: connection refused May 01 03:32:23 localhost.localdomain master-bmh-update.sh[24484]: E0501 03:32:23.533630 24484 memcache.go:265] couldn't get current server API group list: Get "https://api-int.ostest.test.metalkube.org:6443/api?timeout=32s": dial tcp [fd2e:6f44:5dd8:c956::5]:6443: connect: connection refused May 01 03:32:23 localhost.localdomain master-bmh-update.sh[24484]: E0501 03:32:23.535180 24484 memcache.go:265] couldn't get current server API group list: Get "https://api-int.ostest.test.metalkube.org:6443/api?timeout=32s": dial tcp [fd2e:6f44:5dd8:c956::5]:6443: connect: connection refused May 01 03:32:23 localhost.localdomain master-bmh-update.sh[24484]: The connection to the server api-int.ostest.test.metalkube.org:6443 was refused - did you specify the right host or port?
Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/279
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The customer uses Azure File CSI driver and without this they cannot make use of the Azure Workload Identity work which was one of the banner features of OCP 4.14. This feature is currently available in 4.16, however it will take the customer 3-6 months to validate 4.16 and start its rollout putting their plans to complete a large migration to Azure by end of 2024 at risk. Could you please backport either the 1.29.3 feature for Azure Workload Idenity or rebase our Azure File CSI driver in 4.14 and 4.15 to at least 1.29.3 which includes the desired feature.
Version-Release number of selected component (if applicable):
azure-file-csi-driver in 4.14 and 4.15 - In 4.14, azure-file-csi-driver is version 1.28.1 - In 4.15, azure-file-csi-driver is version 1.29.2
How reproducible:
Always
Steps to Reproduce:
1. Install ocp 4.14 with Azure Workload Managed Identity 2. Try to configure Managed Workload Identiy with Azure CSI file https://github.com/kubernetes-sigs/azurefile-csi-driver/blob/master/docs/workload-identity-static-pv-mount.md
Actual results:
Is not usable
Expected results:
Azure Workload Identity should be manage with Azure File CSi as part of the whole feature
Additional info:
This is a clone of issue OCPBUGS-37832. The following is the description of the original issue:
—
CCMs attempt direct connections when the mgmt cluster on which the HCP runs is proxied and does not allow direction outbound connections.
Example from the AWS CCM
I0731 21:46:33.948466 1 event.go:389] "Event occurred" object="openshift-ingress/router-default" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: error listing AWS instances: \"WebIdentityErr: failed to retrieve credentials\\ncaused by: RequestError: send request failed\\ncaused by: Post \\\"https://sts.us-east-1.amazonaws.com/\\\": dial tcp 72.21.206.96:443: i/o timeout\""
Please review the following PR: https://github.com/openshift/console/pull/13434
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/multus-cni/pull/212
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
When trying to install cluster on 4.15 LVMS with CNV and MCE operator
In operator page i am unabe to continue with installation
since host discovery pointing that hosts required more resources (as if i also have selected odf)
How reproducible:
Steps to reproduce:
1.Create a multi node cluster 4.15
2.make sure to have enough resources for lvms , cnv and mce operator
3.select cnv lvms and mce operator
Actual results:
In operator page it show that also cpu and ram resources related to ODF are required (which is not selected) and user unable to start installation
Expected results:
Should be able to start installation
Description of problem:
In OCP 4.14 the catalog pods in openshift-marketplace where defined as: $ oc get pods -n openshift-marketplace redhat-operators-4bnz4 -o yaml apiVersion: v1 kind: Pod metadata: ... labels: olm.catalogSource: redhat-operators olm.pod-spec-hash: 658b699dc name: redhat-operators-4bnz4 namespace: openshift-marketplace ... spec: containers: - image: registry.redhat.io/redhat/redhat-operator-index:v4.14 imagePullPolicy: Always Now on OCP 4.15 they are defined as: apiVersion: v1 kind: Pod metadata: ... name: redhat-operators-44wxs namespace: openshift-marketplace ownerReferences: - apiVersion: operators.coreos.com/v1alpha1 blockOwnerDeletion: false controller: true kind: CatalogSource name: redhat-operators uid: 3b41ac7b-7ad1-4d58-a62f-4a9e667ae356 resourceVersion: "877589" uid: 65ad927c-3764-4412-8d34-82fd856a4cbc spec: containers: - args: - serve - /extracted-catalog/catalog - --cache-dir=/extracted-catalog/cache command: - /bin/opm ... image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7259b65d8ae04c89cf8c4211e4d9ddc054bb8aebc7f26fac6699b314dc40dbe3 imagePullPolicy: Always ... initContainers: ... - args: - --catalog.from=/configs - --catalog.to=/extracted-catalog/catalog - --cache.from=/tmp/cache - --cache.to=/extracted-catalog/cache command: - /utilities/copy-content image: registry.redhat.io/redhat/redhat-operator-index:v4.15 imagePullPolicy: IfNotPresent ... And due to `imagePullPolicy: IfNotPresent` on the initContainer used to extract the index image (referenced by tag) content, they are never really updated.
Version-Release number of selected component (if applicable):
OCP 4.15.0
How reproducible:
100%
Steps to Reproduce:
1. wait for the next version of a released operator on OCP 4.15 2. 3.
Actual results:
Operator catalogs are never really refreshed due to imagePullPolicy: IfNotPresent for the index image
Expected results:
Operator catalogs are periodically (every 10 minutes by default) refreshed
Additional info:
Please review the following PR: https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/178
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/configmap-reload/pull/58
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-34397. The following is the description of the original issue:
—
Description of problem:
Upstream machine-config-operator has renamed their CRDs https://github.com/openshift/machine-config-operator/tree/master/install. HyperShift must make similar changes now in 4.16.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
This was discovered during Contrail testing when a large number of additional manifests specific to contrail were added to the openshift/ dir. The additional manifests are here - https://github.com/Juniper/contrail-networking/tree/main/releases/23.1/ocp. When creating the agent image the following error occurred: failed to fetch Agent Installer ISO: failed to generate asset \"Agent Installer ISO\": failed to create overwrite reader for ignition: content length (802204) exceeds embed area size (262144)"]
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-openshift-apiserver-operator/pull/561
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
iam:TagInstanceProfile is not listed in official document [1], IPI install would fail if iam:TagInstanceProfile permission is missing level=error msg=Error: creating IAM Instance Profile (ci-op-4hw2rz1v-49c30-zt9vx-worker-profile): AccessDenied: User: arn:aws:iam::301721915996:user/ci-op-4hw2rz1v-49c30-minimal-perm is not authorized to perform: iam:TagInstanceProfile on resource: arn:aws:iam::301721915996:instance-profile/ci-op-4hw2rz1v-49c30-zt9vx-worker-profile because no identity-based policy allows the iam:TagInstanceProfile action level=error msg= status code: 403, request id: bb0641f5-d01c-4538-b333-261a804ddb59 [1] https://docs.openshift.com/container-platform/4.14/installing/installing_aws/installing-aws-account.html#installation-aws-permissions_installing-aws-account
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-14-115151
How reproducible:
Always
Steps to Reproduce:
1. install a common IPI cluster with minimal permission provided in official document 2. 3.
Actual results:
Install failed.
Expected results:
Additional info:
install does a precheck for iam:TagInstanceProfile
Please review the following PR: https://github.com/openshift/operator-framework-catalogd/pull/36
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/128
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
There are multiple dashboards, so this page title should be "Dashboards" rather than "Dashboard". The admin console version of this page is already titled "Dashboards".
Version-Release number of selected component (if applicable):
4.16
Steps to Reproduce:
1. Open "Developer" View > Observe
Actual results:
See the tab title is "Dashboard"
Expected results:
The tab title should be "Dashboards"
Description of problem:
When triggering a build from a webhook (HTTP POST request), it fails with 403 - FORBIDDEN if the request does not have an OpenShift authorization token.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Create a BuildConfig with a webhook trigger and configured secret 2. Make appropriate cURL call to trigger the build via webhook
Actual results:
Webhook call refused with 403 Forbidden: "message": "buildconfigs.build.openshift.io \"sample-build\" is forbidden: User \"system:anonymous\" cannot create resource \"buildconfigs/webhooks\" in API group \"build.openshift.io\" in the namespace \"e2e-test-cli-start-build-dxxkx\"",
Expected results:
Builds can be triggered via webhook
Additional info:
https://docs.openshift.com/container-platform/4.15/cicd/builds/triggering-builds-build-hooks.html#builds-webhook-triggers_triggering-builds-build-hooks
This is a clone of issue OCPBUGS-33493. The following is the description of the original issue:
—
The provisioning CR is now created with a paused annotation (since https://github.com/openshift/installer/pull/8346)
On baremetal IPI installs, this annotation is removed at the conclusion of bootstrapping.
On assisted/ABI installs there is nothing to remove it, so cluster-baremetal-operator never deploys anything.
Description of problem:
A change to how Power VS Workspaces are queried is not compatible with the version of terraform-provider-ibm
Version-Release number of selected component (if applicable):
How reproducible:
Easily
Steps to Reproduce:
1. Try to deploy with Power VS 2. Fail with an error stating that [ERROR] Error retrieving service offering: ServiceDoesnotExist: Given service : "power-iaas" doesn't exist
Actual results:
Fail with [ERROR] Error retrieving service offering: ServiceDoesnotExist: Given service : "power-iaas" doesn't exist
Expected results:
Install should succeed.
Additional info:
Noticed in k8s 1.30 PR, here's the run where it happened:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_kubernetes/1953/pull-ci-openshift-kubernetes-master-e2e-aws-ovn-fips/1788800196772106240
E0510 05:58:26.315444 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 992 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x26915e0?, 0x471dff0}) /go/src/github.com/openshift/console-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x0?}) /go/src/github.com/openshift/console-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b panic({0x26915e0?, 0x471dff0?}) /usr/lib/golang/src/runtime/panic.go:914 +0x21f github.com/openshift/console-operator/pkg/console/controllers/healthcheck.(*HealthCheckController).CheckRouteHealth.func2() /go/src/github.com/openshift/console-operator/pkg/console/controllers/healthcheck/controller.go:156 +0x62 k8s.io/client-go/util/retry.OnError.func1() /go/src/github.com/openshift/console-operator/vendor/k8s.io/client-go/util/retry/util.go:51 +0x30 k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection(0x2fdcde8?) /go/src/github.com/openshift/console-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:145 +0x3e k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff({0x989680, 0x3ff0000000000000, 0x3fb999999999999a, 0x5, 0x0}, 0x2fdcde8?) /go/src/github.com/openshift/console-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:461 +0x5a k8s.io/client-go/util/retry.OnError({0x989680, 0x3ff0000000000000, 0x3fb999999999999a, 0x5, 0x0}, 0x2667a00?, 0xc001b185d0?) /go/src/github.com/openshift/console-operator/vendor/k8s.io/client-go/util/retry/util.go:50 +0xa5 github.com/openshift/console-operator/pkg/console/controllers/healthcheck.(*HealthCheckController).CheckRouteHealth(0xc001b097e8?, {0x2fdce90?, 0xc00057c870?}, 0x16?, 0x2faf140?) /go/src/github.com/openshift/console-operator/pkg/console/controllers/healthcheck/controller.go:152 +0x9a github.com/openshift/console-operator/pkg/console/controllers/healthcheck.(*HealthCheckController).Sync(0xc000748ae0, {0x2fdce90, 0xc00057c870}, {0x7f84e80672b0?, 0x7f852f941108?}) /go/src/github.com/openshift/console-operator/pkg/console/controllers/healthcheck/controller.go:143 +0x8eb github.com/openshift/library-go/pkg/controller/factory.(*baseController).reconcile(0xc000b57950, {0x2fdce90, 0xc00057c870}, {0x2fd5350?, 0xc001b185a0?}) /go/src/github.com/openshift/console-operator/vendor/github.com/openshift/library-go/pkg/controller/factory/base_controller.go:201 +0x43 github.com/openshift/library-go/pkg/controller/factory.(*baseController).processNextWorkItem(0xc000b57950, {0x2fdce90, 0xc00057c870}) /go/src/github.com/openshift/console-operator/vendor/github.com/openshift/library-go/pkg/controller/factory/base_controller.go:260 +0x1b4 github.com/openshift/library-go/pkg/controller/factory.(*baseController).runWorker.func1({0x2fdce90, 0xc00057c870}) /go/src/github.com/openshift/console-operator/vendor/github.com/openshift/library-go/pkg/controller/factory/base_controller.go:192 +0x89 k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1() /go/src/github.com/openshift/console-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0x22 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?) /go/src/github.com/openshift/console-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0014b7b60?, {0x2faf040, 0xc001b18570}, 0x1, 0xc0014b7b60) /go/src/github.com/openshift/console-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00057c870?, 0x3b9aca00, 0x0, 0x0?, 0x0?) /go/src/github.com/openshift/console-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x2fdce90, 0xc00057c870}, 0xc00139c770, 0x0?, 0x0?, 0x0?) /go/src/github.com/openshift/console-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0x93 k8s.io/apimachinery/pkg/util/wait.UntilWithContext(...) /go/src/github.com/openshift/console-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:170 github.com/openshift/library-go/pkg/controller/factory.(*baseController).runWorker(0x0?, {0x2fdce90?, 0xc00057c870?}) /go/src/github.com/openshift/console-operator/vendor/github.com/openshift/library-go/pkg/controller/factory/base_controller.go:183 +0x4d github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run.func2() /go/src/github.com/openshift/console-operator/vendor/github.com/openshift/library-go/pkg/controller/factory/base_controller.go:117 +0x65 created by github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run in goroutine 749 /go/src/github.com/openshift/console-operator/vendor/github.com/openshift/library-go/pkg/controller/factory/base_controller.go:112 +0x2ba
Please review the following PR: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/268
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
"Oh no! Something went wrong." will shown on Pending pod details page
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-04-14-063437
How reproducible:
always
Steps to Reproduce:
1. Create a dummy pod with pending status eg: apiVersion: v1 kind: Pod metadata: name: nginx labels: env: test spec: containers: - name: nginx image: nginx imagePullPolicy: IfNotPresent nodeSelector: disktype: ssd OR apiVersion: v1 kind: Pod metadata: name: dummy-pod spec: containers: - name: dummy-pod image: ubuntu restartPolicy: Always nodeSelector: testtype: pending 2. Navigate to Pod Details page 3.
Actual results:
Oh no! Something went wrong. will shown TypeError Description:Cannot read properties of undefined (reading 'restartCount') Component trace:Copy to clipboardat fe (https://console-openshift-console.apps.qe-daily-416-0415.qe.azure.devcluster.openshift.com/static/main-chunk-7643d3f1edb399bb7d65.min.js:1:562500) at div at div at ve (https://console-openshift-console.apps.qe-daily-416-0415.qe.azure.devcluster.openshift.com/static/main-chunk-7643d3f1edb399bb7d65.min.js:1:563346) at div at ke (https://console-openshift-console.apps.qe-daily-416-0415.qe.azure.devcluster.openshift.com/static/main-chunk-7643d3f1edb399bb7d65.min.js:1:571308) at i (https://console-openshift-console.apps.qe-daily-416-0415.qe.azure.devcluster.openshift.com/static/main-chunk-7643d3f1edb399bb7d65.min.js:1:329180) at _ (https://console-openshift-console.apps.qe-daily-416-0415.qe.azure.devcluster.openshift.com/static/vendor-plugins-shared~main-chunk-4dc722526d0f0470939e.min.js:31:4920) at ne (https://console-openshift-console.apps.qe-daily-416-0415.qe.azure.devcluster.openshift.com/static/vendor-plugins-shared~main-chunk-4dc722526d0f0470939e.min.js:31:10364) at Suspense at div at k (https://console-openshift-console.apps.qe-daily-416-0415.qe.azure.devcluster.openshift.com/static/main-chunk-7643d3f1edb399bb7d65.min.js:1:118938)
Expected results:
no issue was found
Additional info:
Enable pod securtiy labels when create the pod via UI: $ oc label namespace <ns> security.openshift.io/scc.podSecurityLabelSync=false --overwrite $ oc label namespace <ns> pod-security.kubernetes.io/enforce=privileged --overwrite $ oc label namespace <ns> pod-security.kubernetes.io/audit=privileged --overwrite $ oc label namespace <ns> fix
The convention is a format like node-role.kubernetes.io/role: "", not node-role.kubernetes.io: role, however ROSA uses the latter format to indicate the infra role. This changes the node watch code to ignore it, as well as other potential variations like node-role.kubernetes.io/.
The current code panics when run against a ROSA cluster:
{{ E0209 18:10:55.533265 78 runtime.go:79] Observed a panic: runtime.boundsError{x:24, y:23, signed:true, code:0x3} (runtime error: slice bounds out of range [24:23])
goroutine 233 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x7a71840?, 0xc0018e2f48})
k8s.io/apimachinery@v0.27.2/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x1000251f9fe?})
k8s.io/apimachinery@v0.27.2/pkg/util/runtime/runtime.go:49 +0x75
panic({0x7a71840, 0xc0018e2f48})
runtime/panic.go:884 +0x213
github.com/openshift/origin/pkg/monitortests/node/watchnodes.nodeRoles(0x7ecd7b3?)
github.com/openshift/origin/pkg/monitortests/node/watchnodes/node.go:187 +0x1e5
github.com/openshift/origin/pkg/monitortests/node/watchnodes.startNodeMonitoring.func1(0}}
Description of problem:
When a proxy.config.openshift.io is specified on a HyperShift cluster (in this case ROSA HCP), the network cluster operator is degraded:
❯ k get co network NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGEhttps://github.com/openshift/ovn-kubernetes/pull/2135 network 4.14.6 True False True 2d1h The configuration is invalid for proxy 'cluster' (readinessEndpoint probe failed for endpoint 'https://api.openshift.com': endpoint probe failed for endpoint 'https://api.openshift.com' using proxy 'http://ip-172-17-1-38.ec2.internal:3128': Get "https://api.openshift.com": Service Unavailable). Use 'oc edit proxy.config.openshift.io cluster' to fix.
because the CNO pod runs on the management cluster and does not have connectivity to the customer's proxy which is accessible from the HyperShift worker nodes' network.
Version-Release number of selected component (if applicable):
4.14.6
How reproducible:
100%
Steps to Reproduce:
1. Create a proxy that's only accessible from a HyperShift cluster's workers network 2. Update the cluster's proxy.config.openshift.io cluster object accordingly 3. Observe that the network ClusterOperator is degraded
Actual results:
I'm not sure how important it is that the CNO has connectivity to api.openshift.com and leave it up for discussion. Maybe CNO should ignore the proxy configuration in HyperShift for its own health checks for example.
Expected results:
The network ClusterOperator is not degraded
Additional info:
Updating github.com/IBM/networking-go-sdk 0.45.0 causes issues with PowerVS infra create as the API has changed.
Description of problem:
On a hybrid cluster with Windows nodes and coreOS nodes mixed, egressIP cannot be applied to coreOS anymore. QE testing profile: 53_IPI on AWS & OVN & WindowsContainer
Version-Release number of selected component (if applicable):
4.14.3
How reproducible:
Always
Steps to Reproduce:
1. Setup cluster with template aos-4_14/ipi-on-aws/versioned-installer-ovn-winc-ci 2. Label on coreOS node as egress node % oc describe node ip-10-0-59-132.us-east-2.compute.internal Name: ip-10-0-59-132.us-east-2.compute.internal Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=m6i.xlarge beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=us-east-2 failure-domain.beta.kubernetes.io/zone=us-east-2b k8s.ovn.org/egress-assignable= kubernetes.io/arch=amd64 kubernetes.io/hostname=ip-10-0-59-132.us-east-2.compute.internal kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=m6i.xlarge node.openshift.io/os_id=rhcos topology.ebs.csi.aws.com/zone=us-east-2b topology.kubernetes.io/region=us-east-2 topology.kubernetes.io/zone=us-east-2b Annotations: cloud.network.openshift.io/egress-ipconfig: [{"interface":"eni-0c661bbdbb0dde54a","ifaddr":{"ipv4":"10.0.32.0/19"},"capacity":{"ipv4":14,"ipv6":15}}] csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0629862832fff4ae3"} k8s.ovn.org/host-cidrs: ["10.0.59.132/19"] k8s.ovn.org/hybrid-overlay-distributed-router-gateway-ip: 10.129.2.13 k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac: 0a:58:0a:81:02:0d k8s.ovn.org/l3-gateway-config: {"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-59-132.us-east-2.compute.internal","mac-address":"06:06:e2:7b:9c:45","ip-address... k8s.ovn.org/network-ids: {"default":"0"} k8s.ovn.org/node-chassis-id: fa1ac464-5744-40e9-96ca-6cdc74ffa9be k8s.ovn.org/node-gateway-router-lrp-ifaddr: {"ipv4":"100.64.0.7/16"} k8s.ovn.org/node-id: 7 k8s.ovn.org/node-mgmt-port-mac-address: a6:25:4e:55:55:36 k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.59.132/19"} k8s.ovn.org/node-subnets: {"default":["10.129.2.0/23"]} k8s.ovn.org/node-transit-switch-port-ifaddr: {"ipv4":"100.88.0.7/16"} k8s.ovn.org/remote-zone-migrated: ip-10-0-59-132.us-east-2.compute.internal k8s.ovn.org/zone-name: ip-10-0-59-132.us-east-2.compute.internal machine.openshift.io/machine: openshift-machine-api/wduan-debug-1120-vtxkp-worker-us-east-2b-z6wlc machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-5a29871efb344f7e3a3dc51c42c21113 machineconfiguration.openshift.io/desiredConfig: rendered-worker-5a29871efb344f7e3a3dc51c42c21113 machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-5a29871efb344f7e3a3dc51c42c21113 machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-5a29871efb344f7e3a3dc51c42c21113 machineconfiguration.openshift.io/lastSyncedControllerConfigResourceVersion: 22806 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Mon, 20 Nov 2023 09:46:53 +0800 Taints: <none> Unschedulable: false Lease: HolderIdentity: ip-10-0-59-132.us-east-2.compute.internal AcquireTime: <unset> RenewTime: Mon, 20 Nov 2023 14:01:05 +0800 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Mon, 20 Nov 2023 13:57:33 +0800 Mon, 20 Nov 2023 09:46:53 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Mon, 20 Nov 2023 13:57:33 +0800 Mon, 20 Nov 2023 09:46:53 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Mon, 20 Nov 2023 13:57:33 +0800 Mon, 20 Nov 2023 09:46:53 +0800 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Mon, 20 Nov 2023 13:57:33 +0800 Mon, 20 Nov 2023 09:47:34 +0800 KubeletReady kubelet is posting ready status Addresses: InternalIP: 10.0.59.132 InternalDNS: ip-10-0-59-132.us-east-2.compute.internal Hostname: ip-10-0-59-132.us-east-2.compute.internal Capacity: cpu: 4 ephemeral-storage: 125238252Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 16092956Ki pods: 250 Allocatable: cpu: 3500m ephemeral-storage: 114345831029 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 14941980Ki pods: 250 System Info: Machine ID: ec21151a2a80230ce1e1926b4f8a902c System UUID: ec21151a-2a80-230c-e1e1-926b4f8a902c Boot ID: cf4b2e39-05ad-4aea-8e53-be669b212c4f Kernel Version: 5.14.0-284.41.1.el9_2.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 414.92.202311150705-0 (Plow) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.27.1-13.1.rhaos4.14.git956c5f7.el9 Kubelet Version: v1.27.6+b49f9d1 Kube-Proxy Version: v1.27.6+b49f9d1 ProviderID: aws:///us-east-2b/i-0629862832fff4ae3 Non-terminated Pods: (21 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age --------- ---- ------------ ---------- --------------- ------------- --- openshift-cluster-csi-drivers aws-ebs-csi-driver-node-tlw5h 30m (0%) 0 (0%) 150Mi (1%) 0 (0%) 4h14m openshift-cluster-node-tuning-operator tuned-4fvgv 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 4h14m openshift-dns dns-default-z89zl 60m (1%) 0 (0%) 110Mi (0%) 0 (0%) 11m openshift-dns node-resolver-v9stn 5m (0%) 0 (0%) 21Mi (0%) 0 (0%) 4h14m openshift-image-registry image-registry-67b88dc677-76hfn 100m (2%) 0 (0%) 256Mi (1%) 0 (0%) 4h14m openshift-image-registry node-ca-hw62n 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 4h14m openshift-ingress-canary ingress-canary-9r9f8 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 4h13m openshift-ingress router-default-5957f4f4c6-tl9gs 100m (2%) 0 (0%) 256Mi (1%) 0 (0%) 4h18m openshift-machine-config-operator machine-config-daemon-h7fx4 40m (1%) 0 (0%) 100Mi (0%) 0 (0%) 4h14m openshift-monitoring alertmanager-main-1 9m (0%) 0 (0%) 120Mi (0%) 0 (0%) 4h12m openshift-monitoring monitoring-plugin-68995cb674-w2wr9 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 4h13m openshift-monitoring node-exporter-kbq8z 9m (0%) 0 (0%) 47Mi (0%) 0 (0%) 4h13m openshift-monitoring prometheus-adapter-54fc7b9c87-sg4vt 1m (0%) 0 (0%) 40Mi (0%) 0 (0%) 4h13m openshift-monitoring prometheus-k8s-1 75m (2%) 0 (0%) 1104Mi (7%) 0 (0%) 4h12m openshift-monitoring prometheus-operator-admission-webhook-84b7fffcdc-x8hsz 5m (0%) 0 (0%) 30Mi (0%) 0 (0%) 4h18m openshift-monitoring thanos-querier-59cbd86d58-cjkxt 15m (0%) 0 (0%) 92Mi (0%) 0 (0%) 4h13m openshift-multus multus-7gjnt 10m (0%) 0 (0%) 65Mi (0%) 0 (0%) 4h14m openshift-multus multus-additional-cni-plugins-gn7x9 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 4h14m openshift-multus network-metrics-daemon-88tf6 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 4h14m openshift-network-diagnostics network-check-target-kpv5v 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 4h14m openshift-ovn-kubernetes ovnkube-node-74nl9 80m (2%) 0 (0%) 1630Mi (11%) 0 (0%) 3h51m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 619m (17%) 0 (0%) memory 4296Mi (29%) 0 (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) Events: <none> % oc get node -l k8s.ovn.org/egress-assignable= NAME STATUS ROLES AGE VERSION ip-10-0-59-132.us-east-2.compute.internal Ready worker 4h14m v1.27.6+b49f9d1 3. Create egressIP object
Actual results:
% oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-1 10.0.59.101 % oc get cloudprivateipconfig No resources found
Expected results:
The egressIP should be applied to egress node
Additional info:
Please review the following PR: https://github.com/openshift/egress-router-cni/pull/79
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-version-operator/pull/1004
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/ironic-agent-image/pull/97
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/operator-framework-rukpak/pull/68
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-33453. The following is the description of the original issue:
—
Description of problem:
Can't access the openshift namespace images without auth after grant public access to openshift namespace
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-05-102537
How reproducible:
always
Steps to Reproduce:
1. $ oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge $ HOST=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}') 2. $ oc adm policy add-role-to-group system:image-puller system:unauthenticated --namespace openshift Warning: Group 'system:unauthenticated' not found clusterrole.rbac.authorization.k8s.io/system:image-puller added: "system:unauthenticated" 3. Try to fetch image metadata: $ oc image info --insecure "${HOST}/openshift/cli:latest"
Actual results:
$ oc image info default-route-openshift-image-registry.apps.wxj-a41659.qe.azure.devcluster.openshift.com/openshift/cli:latest --insecure error: unable to read image default-route-openshift-image-registry.apps.wxj-a41659.qe.azure.devcluster.openshift.com/openshift/cli:latest: unauthorized: authentication required
Expected results:
Could get the public image info without auth
Additional info:
This is a regression for 4.16, this feature works on 4.15 and below.
We added a carry patch to change the healthcheck behaviour in the Azure CCM: https://github.com/openshift/cloud-provider-azure/pull/72 and whilst we opened an upstream twin PR for that https://github.com/kubernetes-sigs/cloud-provider-azure/pull/3887 it got closed in favour of a different approach https://github.com/kubernetes-sigs/cloud-provider-azure/pull/4891 .
As such in the next rebase we need to drop the commit introduced in 72, in favour of downstreaming through the rebase the change in 4891. While doing that we need to explicitly set the new probe behaviour, as the default is still the classic behaviour, which doesn't work with our cluster architecture setup on Azure.
For the steps on how to do this, we can follow this comment: https://github.com/openshift/cloud-provider-azure/pull/88#issuecomment-1803832076
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
manifests are duplicated with cluster-config-api image
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Now for 4.16 ocp payload , only contain the oc.rhel9 , can't find the oc.rhel8. oc adm release extract --command='oc.rhel8' registry.ci.openshift.org/ocp/release:4.16.0-0.nightly-2024-01-24-133352 --command-os=linux/amd64 -a tmp/config.json --to octest/ error: image did not contain usr/share/openshift/linux_amd64/oc.rhel8
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1.oc adm release extract --command='oc.rhel8' registry.ci.openshift.org/ocp/release:4.16.0-0.nightly-2024-01-24-133352 --command-os=linux/amd64 -a tmp/config.json --to octest/ 2. 3.
Actual results:
failed to extract oc.rhel8
Expected results:
for ocp paypload should contain the oc.rhel8.
Additional info:
This is a clone of issue OCPBUGS-32348. The following is the description of the original issue:
—
After fixing https://issues.redhat.com/browse/OCPBUGS-29919 by merging https://github.com/openshift/baremetal-runtimecfg/pull/301 we have lost ability to properly debug the logic of selection Node IP used in runtimecfg.
In order to preserve debugability of this component, it should be possible to selectively enable verbose logs.
Description of problem:
tested https://github.com/openshift/cluster-monitoring-operator/pull/2187 with PR
launch 4.15,openshift/cluster-monitoring-operator#2187 aws
don't find "scrape.timestamp-tolerance" setting in prometheus and prometheus pod, no result for below commands
$ oc -n openshift-monitoring get prometheus k8s -oyaml | grep -i "scrape.timestamp-tolerance" $ oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep -i "scrape.timestamp-tolerance" $ oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep -i "scrape.timestamp-tolerance"
not in prometheus configuration file either
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml | head global: evaluation_interval: 30s scrape_interval: 30s external_labels: prometheus: openshift-monitoring/k8s prometheus_replica: prometheus-k8s-0 rule_files: - /etc/prometheus/rules/prometheus-k8s-rulefiles-0/*.yaml scrape_configs: - job_name: serviceMonitor/openshift-apiserver-operator/openshift-apiserver-operator/0
Please review the following PR: https://github.com/openshift/baremetal-runtimecfg/pull/300
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Snapshot support is being delivered for kubevirt-csi in 4.16, but the cli used to configure snapshot support did not expose the argument that makes using snapshots possible. The cli arg [--infra-volumesnapshot-class-mapping] was added to the developer cli [hypershift] but never made it to the productized cli [hcp] that end users will use.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100%
Steps to Reproduce:
1. hcp create cluster kubevirt -h | grep infra-volumesnapshot-class-mapping 2. 3.
Actual results:
no value is found
Expected results:
the infra-volumesnapshot-class-mapping cli arg should be found
Additional info:
Please review the following PR: https://github.com/openshift/cluster-authentication-operator/pull/644
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The following binaries need to get extracted from the release payload for both rhel8 and rhel9: oc ccoctl opm openshift-install oc-mirror The images that contain these, should produce artifacts of both kinds in some locatiuon, and probably make the artifact of their architecture available under a normal location in path. Example: /usr/share/<binary>.rhel8 /usr/share/<binary>.rhel9 /usr/bin/<binary> This ticket is about getting "oc adm release extract" to do the right thing in a backwards compatible way. If both binaries are available get those. If not, get from the old location.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Should not collect the previous.log which not corresponding with the --since/--since-time for `oc adm inspect` command
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. `oc adm inspect --since-time="2024-01-25T01:35:27Z" ns/openshift-multus`
Actual results:
also collect the previous.log which not corresponding with the specified time.
Expected results:
Only collect the logs after the --since/--since-time.
Additional info:
This is a clone of issue OCPBUGS-35099. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4466. The following is the description of the original issue:
—
Description of problem:
deploying compact 3-nodes cluster on GCP, by setting mastersSchedulable as true and removing worker machineset YAMLs, got panic
Version-Release number of selected component (if applicable):
$ openshift-install version openshift-install 4.13.0-0.nightly-2022-12-04-194803 built from commit cc689a21044a76020b82902056c55d2002e454bd release image registry.ci.openshift.org/ocp/release@sha256:9e61cdf7bd13b758343a3ba762cdea301f9b687737d77ef912c6788cbd6a67ea release architecture amd64
How reproducible:
Always
Steps to Reproduce:
1. create manifests 2. set 'spec.mastersSchedulable' as 'true', in <installation dir>/manifests/cluster-scheduler-02-config.yml 3. remove the worker machineset YAML file from <installation dir>/openshift directory 4. create cluster
Actual results:
Got "panic: runtime error: index out of range [0] with length 0".
Expected results:
The installation should succeed, or giving clear error messages.
Additional info:
$ openshift-install version openshift-install 4.13.0-0.nightly-2022-12-04-194803 built from commit cc689a21044a76020b82902056c55d2002e454bd release image registry.ci.openshift.org/ocp/release@sha256:9e61cdf7bd13b758343a3ba762cdea301f9b687737d77ef912c6788cbd6a67ea release architecture amd64 $ $ openshift-install create manifests --dir test1 ? SSH Public Key /home/fedora/.ssh/openshift-qe.pub ? Platform gcp INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" ? Project ID OpenShift QE (openshift-qe) ? Region us-central1 ? Base Domain qe.gcp.devcluster.openshift.com ? Cluster Name jiwei-1205a ? Pull Secret [? for help] ****** INFO Manifests created in: test1/manifests and test1/openshift $ $ vim test1/manifests/cluster-scheduler-02-config.yml $ yq-3.3.0 r test1/manifests/cluster-scheduler-02-config.yml spec.mastersSchedulable true $ $ rm -f test1/openshift/99_openshift-cluster-api_worker-machineset-?.yaml $ $ tree test1 test1 ├── manifests │ ├── cloud-controller-uid-config.yml │ ├── cloud-provider-config.yaml │ ├── cluster-config.yaml │ ├── cluster-dns-02-config.yml │ ├── cluster-infrastructure-02-config.yml │ ├── cluster-ingress-02-config.yml │ ├── cluster-network-01-crd.yml │ ├── cluster-network-02-config.yml │ ├── cluster-proxy-01-config.yaml │ ├── cluster-scheduler-02-config.yml │ ├── cvo-overrides.yaml │ ├── kube-cloud-config.yaml │ ├── kube-system-configmap-root-ca.yaml │ ├── machine-config-server-tls-secret.yaml │ └── openshift-config-secret-pull-secret.yaml └── openshift ├── 99_cloud-creds-secret.yaml ├── 99_kubeadmin-password-secret.yaml ├── 99_openshift-cluster-api_master-machines-0.yaml ├── 99_openshift-cluster-api_master-machines-1.yaml ├── 99_openshift-cluster-api_master-machines-2.yaml ├── 99_openshift-cluster-api_master-user-data-secret.yaml ├── 99_openshift-cluster-api_worker-user-data-secret.yaml ├── 99_openshift-machineconfig_99-master-ssh.yaml ├── 99_openshift-machineconfig_99-worker-ssh.yaml ├── 99_role-cloud-creds-secret-reader.yaml └── openshift-install-manifests.yaml2 directories, 26 files $ $ openshift-install create cluster --dir test1 INFO Consuming Openshift Manifests from target directory INFO Consuming Master Machines from target directory INFO Consuming Worker Machines from target directory INFO Consuming OpenShift Install (Manifests) from target directory INFO Consuming Common Manifests from target directory INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" panic: runtime error: index out of range [0] with length 0goroutine 1 [running]: github.com/openshift/installer/pkg/tfvars/gcp.TFVars({{{0xc000cf6a40, 0xc}, {0x0, 0x0}, {0xc0011d4a80, 0x91d}}, 0x1, 0x1, {0xc0010abda0, 0x58}, ...}) /go/src/github.com/openshift/installer/pkg/tfvars/gcp/gcp.go:70 +0x66f github.com/openshift/installer/pkg/asset/cluster.(*TerraformVariables).Generate(0x1daff070, 0xc000cef530?) /go/src/github.com/openshift/installer/pkg/asset/cluster/tfvars.go:479 +0x6bf8 github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc000c78870, {0x1a777f40, 0x1daff070}, {0x0, 0x0}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:226 +0x5fa github.com/openshift/installer/pkg/asset/store.(*storeImpl).Fetch(0x7ffc4c21413b?, {0x1a777f40, 0x1daff070}, {0x1dadc7e0, 0x8, 0x8}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:76 +0x48 main.runTargetCmd.func1({0x7ffc4c21413b, 0x5}) /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:259 +0x125 main.runTargetCmd.func2(0x1dae27a0?, {0xc000c702c0?, 0x2?, 0x2?}) /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:289 +0xe7 github.com/spf13/cobra.(*Command).execute(0x1dae27a0, {0xc000c70280, 0x2, 0x2}) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:876 +0x67b github.com/spf13/cobra.(*Command).ExecuteC(0xc000c3a500) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:990 +0x3bd github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:918 main.installerMain() /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:61 +0x2b0 main.main() /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:38 +0xff $
https://github.com/openshift/console-operator/pull/889 is causing failures in hypershift e2e https://testgrid.k8s.io/redhat-hypershift#4.16-aws-ovn
Some payloads are affected too. Retry sometimes avoided the problem. But this should still be reverted.
Please review the following PR: https://github.com/openshift/aws-pod-identity-webhook/pull/183
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-32632. The following is the description of the original issue:
—
Description of problem:
In PR - https://github.com/openshift/console/pull/13676 we worked on improving the performance of the PipelineRun list page and the issue https://issues.redhat.com/browse/OCPBUGS-32631 is created to still improve the performance of the PLR list page. Once this is complete, we have to improve the performance of Pipeline list page by considering below point, 1. TaskRuns should not be fetched for all the PLR's. 2. Use pipelinerun.status.conditions.message to get the status of TaskRuns 3. For any PLR, if string pipelinerun.status.conditions.message having data about Tasks status use that string only instead of fetching TaskRuns
Tired of scrolling through alerts and pod states that are seldom useful to get to things that we need every day.
This is a tracker bug for issues discovered when working on https://issues.redhat.com/browse/METAL-940. No QA verification will be possible until the feature is implemented much later.
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/102
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-35080. The following is the description of the original issue:
—
In a cluster with external OIDC environment we need to replace global refresh sync lock in OIDC provider with per-refresh-token one. The work should replace the sync lock that would apply to all HTTP-serving spawned goroutines with a sync-lock that is specific to each of the refresh tokens
Description of problem:
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Actual results:
Expected results:
That reduces token refresh request handling time by about 30%.
Additional info:
Maxim Patlasov pointed this out in STOR-1453 but still somehow we missed it. I tested this on 4.15.0-0.ci-2023-11-29-021749.
It is possible to set a custom TLSSecurityProfile without minTLSversion:
$ oc edit apiserver cluster
...
spec:
tlsSecurityProfile:
type: Custom
custom:
ciphers:
- ECDHE-ECDSA-CHACHA20-POLY1305
- ECDHE-ECDSA-AES128-GCM-SHA256
This causes the controller to crash loop:
$ oc get pods -n openshift-cluster-csi-drivers
NAME READY STATUS RESTARTS AGE
aws-ebs-csi-driver-controller-589c44468b-gjrs2 6/11 CrashLoopBackOff 10 (18s ago) 37s
...
because the `${TLS_MIN_VERSION}` placeholder is never replaced:
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
The observed config in the ClusterCSIDriver shows an empty string:
$ oc get clustercsidriver ebs.csi.aws.com -o json | jq .spec.observedConfig
{
"targetcsiconfig": {
"servingInfo":
}
}
which means minTLSVersion is empty when we get to this line, and the string replacement is not done:
So it seems we have a couple of options:
1) completely omit the --tls-min-version arg if minTLSVersion is empty, or
2) set --tls-min-version to the same default value we would use if TLSSecurityProfile is not present in the apiserver object
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
on a nodepool with Autoscaling Enabled, "oc scale nodepool" command is disabling Autoscaling, but leavs an invalis configuration with Autoscaling info that should have been cleared.
Version-Release number of selected component (if applicable):
(.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc version Client Version: 4.14.14 Kustomize Version: v5.0.1 Server Version: 4.14.14 Kubernetes Version: v1.27.10+28ed2d7 (.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$
How reproducible:
happens all the time
Steps to Reproduce:
1. deploy a hub cluster with 3 master nodes, and 0 workers, on it, a hostedcluster with 6 nodes(I've used this job to deploy: https://auto-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/CI/job/job-runner/2431/) 2. (.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc -n clusters patch nodepool hosted-0 --type=json -p '[{"op": "remove", "path": "/spec/replicas"},{"op":"add", "path": "/spec/autoScaling", "value": { "max": 6, "min": 6 }}]' nodepool.hypershift.openshift.io/hosted-0 patched (.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc get nodepool -A NAMESPACE NAME CLUSTER DESIRED NODES CURRENT NODES AUTOSCALING AUTOREPAIR VERSION UPDATINGVERSION UPDATINGCONFIG MESSAGE clusters hosted-0 hosted-0 6 True False 4.14.14 3. scale to 2 nodes in the nodepool: (.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc scale nodepool/hosted-0 --namespace clusters --kubeconfig ~/clusterconfigs/auth/hub-kubeconfig --replicas=2 nodepool.hypershift.openshift.io/hosted-0 scaled (.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc get nodepool -A NAMESPACE NAME CLUSTER DESIRED NODES CURRENT NODES AUTOSCALING AUTOREPAIR VERSION UPDATINGVERSION UPDATINGCONFIG MESSAGE clusters hosted-0 hosted-0 2 6 False False 4.14.14 4. and after scaledown ends : (.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc get nodepool -A NAMESPACE NAME CLUSTER DESIRED NODES CURRENT NODES AUTOSCALING AUTOREPAIR VERSION UPDATINGVERSION UPDATINGCONFIG MESSAGE clusters hosted-0 hosted-0 2 6 False False 4.14.14
Actual results:
(.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc describe nodepool hosted-0 --namespace clusters --kubeconfig ~/clusterconfigs/auth/hub-kubeconfig Name: hosted-0 Namespace: clusters Labels: <none> Annotations: hypershift.openshift.io/nodePoolCurrentConfig: de17bd57 hypershift.openshift.io/nodePoolCurrentConfigVersion: 84116781 hypershift.openshift.io/nodePoolPlatformMachineTemplate: hosted-0-52df983b API Version: hypershift.openshift.io/v1beta1 Kind: NodePool Metadata: Creation Timestamp: 2024-03-13T22:39:57Z Finalizers: hypershift.openshift.io/finalizer Generation: 4 Owner References: API Version: hypershift.openshift.io/v1beta1 Kind: HostedCluster Name: hosted-0 UID: ec16c5a2-b8dc-4c54-abe8-297020df4442 Resource Version: 818918 UID: 671bdaf2-c8f9-4431-9493-476e9fe44d76 Spec: Arch: amd64 Auto Scaling: Max: 6 Min: 6 Cluster Name: hosted-0 Management: Auto Repair: false Replace: Rolling Update: Max Surge: 1 Max Unavailable: 0 Strategy: RollingUpdate Upgrade Type: InPlace Node Drain Timeout: 30s Platform: Agent: Type: Agent Release: Image: quay.io/openshift-release-dev/ocp-release:4.14.14-x86_64 Replicas: 2
Expected results:
No spec.autoscaling data, only spec.Replicas:2, as were before Enabling Autoscaling. Spec: Auto Scaling: Max: 6 Min: 6
Additional info:
Description of problem:
[vSphere-CSI-Driver-Operator] does not update the VSphereCSIDriverOperatorCRAvailable status timely
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-04-162702
How reproducible:
Always
Steps to Reproduce:
1. Set up a vSphere cluster with 4.15 nightly; 2. Backup the secret/vmware-vsphere-cloud-credentials to "vmware-cc.yaml" 3. Change the secret/vmware-vsphere-cloud-credentials password to an invalid value under ns/openshift-cluster-csi-drivers by oc edit; 4. Wait for the cluster storage operator degrade and the driver controller pods CrashLoopBackOff, then recover the backup secret "vmware-cc.yaml" back by apply; 5. Observer the driver controller pods back to Running and the cluster storage operator should be back to healthy.
Actual results:
In Step5 : The driver controller pods back to Running but the cluster storage operator stuck at Degrade: True status for almost 1 hour$ oc get po NAME READY STATUS RESTARTS AGE vmware-vsphere-csi-driver-controller-664db7d497-b98vt 13/13 Running 0 16s vmware-vsphere-csi-driver-controller-664db7d497-rtj49 13/13 Running 0 23s vmware-vsphere-csi-driver-node-2krg6 3/3 Running 1 (3h4m ago) 3h5m vmware-vsphere-csi-driver-node-2t928 3/3 Running 2 (3h16m ago) 3h16m vmware-vsphere-csi-driver-node-45kb8 3/3 Running 2 (3h16m ago) 3h16m vmware-vsphere-csi-driver-node-8vhg9 3/3 Running 1 (3h16m ago) 3h16m vmware-vsphere-csi-driver-node-9fh9l 3/3 Running 1 (3h4m ago) 3h5m vmware-vsphere-csi-driver-operator-5954476ddc-rkpqq 1/1 Running 2 (3h10m ago) 3h17m vmware-vsphere-csi-driver-webhook-7b6b5d99f6-rxdt8 1/1 Running 0 3h16m vmware-vsphere-csi-driver-webhook-7b6b5d99f6-skcbd 1/1 Running 0 3h16m $ oc get co/storage -w NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE storage 4.15.0-0.nightly-2023-12-04-162702 False False True 8m39s VSphereCSIDriverOperatorCRAvailable: VMwareVSphereControllerAvailable: error logging into vcenter: ServerFaultCode: Cannot complete login due to an incorrect user name or password. storage 4.15.0-0.nightly-2023-12-04-162702 True False False 0s $ oc get co/storage NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE storage 4.15.0-0.nightly-2023-12-04-162702 True False False 3m41s
Expected results:
In Step5 : After driver controller pods back to Running the cluster storage operator should recover healthy status immediatelly
Additional info:
I compare with the previous CI results seems this issue happened after 4.15.0-0.nightly-2023-11-25-110147
This is a clone of issue OCPBUGS-33508. The following is the description of the original issue:
—
Description of problem:
Config custom AMI for cluster: platform.aws.defaultMachinePlatform.amiID Or installconfig.controlPlane.platform.aws.amiID installconfig.compute.platform.aws.amiID Master machines still use default AMI instead of custom one. aws ec2 describe-instances --filters "Name=tag:kubernetes.io/cluster/yunjiang-cap6-qjc5t,Va│ lues=owned" "Name=tag:Name,Values=*worker*" --output json | jq '.Reservations[].Instances[].ImageId' | sort | uniq │ "ami-0f71147cab4dbfb61" aws ec2 describe-instances --filters "Name=tag:kubernetes.io/cluster/yunjiang-cap6-qjc5t,Va│ lues=owned" "Name=tag:Name,Values=*master*" --output json | jq '.Reservations[].Instances[].ImageId' | sort | uniq │ "Ami-0ae9b509738034a2c" <- default ami
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-08-222442
How reproducible:
Steps to Reproduce:
1.See description 2. 3.
Actual results:
See description
Expected results:
master machines use custom AMI
Additional info:
In all releases tested, in particular, 4.16.0-0.okd-scos-2024-08-21-155613, Samples operator uses incorrect templates, resulting in following alert:
Samples operator is detecting problems with imagestream image imports. You can look at the "openshift-samples" ClusterOperator object for details. Most likely there are issues with the external image registry hosting the images that needs to be investigated. Or you can consider marking samples operator Removed if you do not care about having sample imagestreams available. The list of ImageStreams for which samples operator is retrying imports: fuse7-eap-openshift fuse7-eap-openshift-java11 fuse7-java-openshift fuse7-java11-openshift fuse7-karaf-openshift-jdk11 golang httpd java jboss-datagrid73-openshift jboss-eap-xp3-openjdk11-openshift jboss-eap-xp3-openjdk11-runtime-openshift jboss-eap-xp4-openjdk11-openshift jboss-eap-xp4-openjdk11-runtime-openshift jboss-eap74-openjdk11-openshift jboss-eap74-openjdk11-runtime-openshift jboss-eap74-openjdk8-openshift jboss-eap74-openjdk8-runtime-openshift jboss-webserver57-openjdk8-tomcat9-openshift-ubi8 jenkins jenkins-agent-base mariadb mysql nginx nodejs perl php postgresql13-for-sso75-openshift-rhel8 postgresql13-for-sso76-openshift-rhel8 python redis ruby sso75-openshift-rhel8 sso76-openshift-rhel8 fuse7-karaf-openshift jboss-webserver57-openjdk11-tomcat9-openshift-ubi8 postgresql
For example, the sample image for Mysql 8.0 is being pulled from registry.redhat.io/rhscl/mysql-80-rhel7:latest (and cannot be found using the dummy pull secret).
Works correctly on OKD FCOS builds.
Description of problem:
CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing
doesn't give much detail or suggest next-steps. Expanding it to include at least a more detailed error message would make it easier for the admin to figure out how to resolve the issue.
Version-Release number of selected component (if applicable):
It's in the dev branch, and probably dates back to whenever the canary system was added.
How reproducible:
100%
Steps to Reproduce:
1. Break ingress. FIXME: Maybe by deleting the cloud load balancer, or dropping a firewall in the way, or something.
2. See the canary pods start failing.
3. Ingress operator sets CanaryChecksRepetitiveFailures with a message.
Actual results:
Canary route checks for the default ingress controller are failing
Expected results:
Canary route checks for the default ingress controller are failing: ${ERROR_MESSAGE}. ${POSSIBLY_ALSO_MORE_TROUBLESHOOTING_IDEAS?}
Additional info:
Plumbing the error message through might be as straightforward as passing probeRouteEndpoint's err through to setCanaryFailingStatusCondition for formatting. Or maybe it's more complicated than that?
Description of problem:
The cluster operator "machine-config" degraded due to MachineConfigPool master is not ready, which tells error like "rendered-master-${hash} not found".
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-11-033133
How reproducible:
Always. We met the issue in 2 CI profiles, Flexy template "functionality-testing/aos-4_15/upi-on-gcp/versioned-installer-ovn-ipsec-ew-ns-ci", and PROW CI test "periodic-ci-openshift-openshift-tests-private-release-4.15-multi-ec-gcp-ipi-disc-priv-oidc-arm-mixarch-f14".
Steps to Reproduce:
The Flexy template brief steps: 1. "create install-config" and then "create manifests" 2. add manifest file to config ovnkubernetes network for IPSec (please refer to https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blame/master/functionality-testing/aos-4_15/hosts/create_ign_files.sh#L517-530) 3. (optional) "create ignition-config" 4. UPI installation steps (see OCP doc https://docs.openshift.com/container-platform/4.14/installing/installing_gcp/installing-gcp-user-infra.html#installation-gcp-user-infra-exporting-common-variables)
Actual results:
Installation failed, with the machine-config operator degraded
Expected results:
Installation should succeed.
Additional info:
The must-gather is at https://drive.google.com/file/d/12xbjWUknDL_DRNSS8T_Z3u4d1KrNlJgT/view?usp=drive_link
Please review the following PR: https://github.com/openshift/router/pull/550
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When specifying a control plane operator image for dev purposes, the control plane operator pod fails to come up with an InvalidImageRef status.
Version-Release number of selected component (if applicable):
Mgmt cluster is 4.15, HyperShift control plane operator is latest from main
How reproducible:
Always
Steps to Reproduce:
1. Create a hosted cluster with an annotation to override control plane operator image and point it to a non-digest image ref.
Actual results:
The cluster fails to come up with the CPO pod failing with InvalidImageRef
Expected results:
The cluster comes up fine.
Additional info:
Please review the following PR: https://github.com/openshift/thanos/pull/142
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-33428. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The monitoring system for Single Node OpenShift (SNO) cluster is triggering an alert named "HighOverallControlPlaneCPU" related to excessive control plane CPU utilization. However, this alert is misleading as it assumes a multi-node setup with high availability (HA) considerations, which do not apply to SNO deployment.
The customer is receiving MNO alerts in the SNO cluster. Below are the details:
The vDU with 2xRINLINE card is installed on the SNO node with OCP 4.14.14.
Used hardware: Airframe OE22 2U server CPU Intel(R) Xeon Intel(R) Xeon(R) Gold 6428N SPR-SP S3, (32 cores 64 threads) with 128GB memory.
After all vDU pods became running, a few minutes later the following alert was triggered:
"labels":
,
"annotations": {
"description": "Given three control plane nodes, the overall CPU utilization may only be about 2/3 of all available capacity.
This is because if a single control plane node fails, the remaining two must handle the load of the cluster in order to be HA.
If the cluster is using more than 2/3 of all capacity, if one control plane node fails, the remaining two are likely to fail when they take the load.
To fix this, increase the CPU and memory on your control plane nodes.",
"runbook_url": https://github.com/openshift/runbooks/blob/master/alerts/cluster-kube-apiserver-operator/ExtremelyHighIndividualControlPlaneCPU.md,
"summary": "CPU utilization across all three control plane nodes is higher than two control plane nodes can sustain;
a single control plane node outage may cause a cascading failure; increase available CPU."
The alert description is misleading since this cluster is SNO, there is no HA in this cluster.
Increasing CPU capacity in SNO cluster is not an option.
Although the CPU usage is high, this alarm is not correct.
MNO and SNO clusters should have separate alert descriptions.
As a ROSA customer, I want to enforce that my workloads follow AWS best-practices by using AWS Regionalized STS Endpoints instead of the global one.
As Red Hat, I would like to follow AWS best-practices by using AWS Regionalized STS Endpoints instead of the global one.
Per AWS docs:
https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_enable-regions.html
AWS recommends using Regional AWS STS endpoints instead of the global endpoint to reduce latency, build in redundancy, and increase session token validity.
https://docs.aws.amazon.com/sdkref/latest/guide/feature-sts-regionalized-endpoints.html
All new SDK major versions releasing after July 2022 will default to regional. New SDK major versions might remove this setting and use regional behavior. To reduce future impact regarding this change, we recommend you start using regional in your application when possible.
Areas where HyperShift creates STS credentials use regionalized STS endpoints, e.g. https://github.com/openshift/hypershift/blob/ae1caa00ff3a2c2bfc1129f0168efc1e786d1d12/control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go#L1225-L1228
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-38690. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-38622. The following is the description of the original issue:
—
Description of problem:
See https://github.com/prometheus/prometheus/issues/14503 for more details
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Steps to Reproduce:
1. Make Prometheus scrape a target that exposes multiple samples of the same series with different explicit timestamps, for example:
# TYPE requests_per_second_requests gauge # UNIT requests_per_second_requests requests # HELP requests_per_second_requests test-description requests_per_second_requests 16 1722466225604 requests_per_second_requests 14 1722466226604 requests_per_second_requests 40 1722466227604 requests_per_second_requests 15 1722466228604 # EOF
2. Not all the samples will be ingested
3. If Prometheus continues scraping that target for a moment, the PrometheusDuplicateTimestamps will fire.
Actual results:
Expected results: all the samples should be considered (or course if the timestamps are too old or are too in the future, Prometheus may refuses them.)
Additional info:
Regression introduced in Prometheus 2.52. Proposed upstream fixes: https://github.com/prometheus/prometheus/pull/14683 https://github.com/prometheus/prometheus/pull/14685
when using the monitoring plugin with the console dashboards plugin, if a custom datasource defined in a dashboard is not found, the default in cluster prometheus is used to fetch data. This creates a false assumption to the user that the custom dashboard is working when in reality, it should fail.
How to reproduce:
Expected result
The dashboard should display an error as the custom datasource was not found
This is a clone of issue OCPBUGS-35097. The following is the description of the original issue:
—
Description of problem:
Some regions were migrated to PER in the 4.16 cycle. We want to enable them.
Version-Release number of selected component (if applicable):
How reproducible:
Easily
Steps to Reproduce:
1. Try to deploy to some PER enabled regions 2. Fail because the installer does not consider them valid. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
NHC failed to watch Metal3 remediation template
Version-Release number of selected component (if applicable):
OCP4.13 and higher
How reproducible:
100%
Steps to Reproduce:
1. Create Metal3RemediationTemplate 2. Install NHCv.0.7.0 3. Create NHC with Metal3RemediationTemplate
Actual results:
E0131 14:07:51.603803 1 reflector.go:147] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: Failed to watch infrastructure.cluster.x-k8s.io/v1beta1, Kind=Metal3RemediationTemplate: failed to list infrastructure.cluster.x-k8s.io/v1beta1, Kind=Metal3RemediationTemplate: metal3remediationtemplates.infrastructure.cluster.x-k8s.io is forbidden: User "system:serviceaccount:openshift-workload-availability:node-healthcheck-controller-manager" cannot list resource "metal3remediationtemplates" in API group "infrastructure.cluster.x-k8s.io" at the cluster scope E0131 14:07:59.912283 1 reflector.go:147] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: Failed to watch infrastructure.cluster.x-k8s.io/v1beta1, Kind=Metal3Remediation: unknown W0131 14:08:24.831958 1 reflector.go:539] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: failed to list infrastructure.cluster.x-k8s.io/v1beta1, Kind=Metal3RemediationTemplate: metal3remediationtemplates.infrastructure.cluster.x-k8s.io is forbidden: User "system:serviceaccount:openshift-workload-availability:node-healthcheck-controller-manager" cannot list resource
Expected results:
No errors
Additional info:
This wasn't supposed to have a junit associated but it looks like it did and is now killing payloads. It started failing because the LB was reaped, and we do not yet have confirmation the new one will be preserved.
This should be pulled out until we've got confirmation that (a) there is no junit for the backend, and (b) the LB is on the preserve whitelist.
Please review the following PR: https://github.com/openshift/image-customization-controller/pull/113
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Recycler pods are not starting on hostedcontrolplane in disconnected environments ( ImagePullBackOff on quay.io/openshift/origin-tools:latest ). The root cause is that the recycler-pod template (stored in the recycler-config ConfigMap) on hostedclusters is always pointing to `quay.io/openshift/origin-tools:latest` . The same configMap for the management cluster is correctly pointing to an image which is part of the release payload: $ oc get cm -n openshift-kube-controller-manager recycler-config -o json | jq -r '.data["recycler-pod.yaml"]' | grep "image" image: "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e458f24c40d41c2c802f7396a61658a5effee823f274be103ac22c717c157308" but on hosted clusters we have: $ oc get cm -n clusters-guest414a recycler-config -o json | jq -r '.data["recycler-pod.yaml"]' | grep "image" image: quay.io/openshift/origin-tools:latest This is likely due to: https://github.com/openshift/hypershift/blob/e1b75598a62a06534fab6385d60d0f9a808ccc52/control-plane-operator/controllers/hostedcontrolplane/kcm/config.go#L80 quay.io/openshift/origin-tools:latest is not part of any mirrored release payload and it's referenced by tag so it will not be available on disconnected environments.
Version-Release number of selected component (if applicable):
v4.14, v4.15, v4.16
How reproducible:
100%
Steps to Reproduce:
1. create an hosted cluster 2. check the content of the recycler-config configmap in an hostedcontrolplane namespace 3.
Actual results:
image field for the recycler-pod template is always pointing to `quay.io/openshift/origin-tools:latest` which is not part of the release payload
Expected results:
image field for the recycler-pod template is pointing to the right image (which one???) as extracted from the release payload
Additional info:
see: https://github.com/openshift/cluster-kube-controller-manager-operator/blob/64b4c1ba/bindata/assets/kube-controller-manager/recycler-cm.yaml#L21 to compare with cluster-kube-controller-manager-operator on OCP
Please review the following PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/208
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The two tasks will always "UPDATE" the underlying "APIService" resource even when no changes are to be made. This behavior significantly elevates the likelihood of encountering conflicts, especially during upgrades, with other controllers that are concurrently monitoring the same resources (CA controller e.g.). Moreover, this consumes resources for unnecessary work. The tasks rely on CreateOrUpdateAPIService: https://github.com/openshift/cluster-monitoring-operator/blob/5e394dd9de305cb6927a23c31b3f9651aa806fb8/pkg/client/client.go#L1725-L1745 which always "UPDATE" the APIService resource.
Version-Release number of selected component (if applicable):
How reproducible:
keep a cluster running then take a look at audit logs concerning the APIService v1beta1.metrics.k8s.io, you could use: oc adm must-gather -- /usr/bin/gather_audit_logs
Steps to Reproduce:
1. 2. 3.
Actual results:
You would see that every "get" is followed by an "update" You can also take a look at the code taking care of that: https://github.com/openshift/cluster-monitoring-operator/blob/5e394dd9de305cb6927a23c31b3f9651aa806fb8/pkg/client/client.go#L1725-L1745
Expected results:
"Updates" should be avoided if no changes are to be made.
Additional info:
it'd be even better if we could avoid the "get"s, but that would be another subject to discuss.
Description of problem:
[Multi-NIC]Egress traffic connect got timeout after remove another pod label in same namespace
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-08-024357
How reproducible:
Always
Steps to Reproduce:
1. Label one node as egress node 2. Create an egressIP object, egressIP was assigned to egress node secondary interface # oc get egressip -o yaml apiVersion: v1 items: - apiVersion: k8s.ovn.org/v1 kind: EgressIP metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"k8s.ovn.org/v1","kind":"EgressIP","metadata":{"annotations":{},"name":"egressip-66293"},"spec":{"egressIPs":["172.22.0.190"],"namespaceSelector":{"matchLabels":{"org":"qe"}},"podSelector":{"matchLabels":{"color":"pink"}}}} creationTimestamp: "2023-10-08T07:28:04Z" generation: 2 name: egressip-66293 resourceVersion: "461590" uid: f1ca3483-63f1-4f31-99b0-e6a55161c285 spec: egressIPs: - 172.22.0.190 namespaceSelector: matchLabels: org: qe podSelector: matchLabels: color: pink status: items: - egressIP: 172.22.0.190 node: worker-0 kind: List metadata: resourceVersion: "" 3. Created a namespace and two pod under it. % oc get pods -n hrw -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES hello-pod 1/1 Running 0 6m46s 10.129.2.7 worker-1 <none> <none> hello-pod1 1/1 Running 0 6s 10.131.0.14 worker-0 <none> <none> 4. Add label org=qe to namespace hrw # oc get ns hrw --show-labels NAME STATUS AGE LABELS hrw Active 21m kubernetes.io/metadata.name=hrw,*org=qe,*pod-security.kubernetes.io/audit-version=v1.24,pod-security.kubernetes.io/audit=restricted,pod-security.kubernetes.io/warn-version=v1.24,pod-security.kubernetes.io/warn=restricted 5. At this time, from both pods to access external endpoint, succeeded. % oc rsh -n hrw hello-pod ~ $ curl 172.22.0.1 --connect-timeout 5 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>404 Not Found</title> </head><body> <h1>Not Found</h1> <p>The requested URL was not found on this server.</p> </body></html> ~ $ exit % oc rsh -n hrw hello-pod1 ~ $ curl 172.22.0.1 --connect-timeout 5 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>404 Not Found</title> </head><body> <h1>Not Found</h1> <p>The requested URL was not found on this server.</p> </body></html> 6. Add label color=pink to both pods % oc label pod hello-pod color=pink -n hrw pod/hello-pod labeled % oc label pod hello-pod1 color=pink -n hrw pod/hello-pod1 labeled 7. Both pods can access external endpoint. 8. Remove label color=pink from pod hello-pod % oc label pod hello-pod color- -n hrw pod/hello-pod unlabeled
Actual results:
Access external endpoint from the pod which keep the label got connect timeout % oc rsh -n hrw hello-pod1 ~ $ curl 172.22.0.1 --connect-timeout 5 curl: (28) Connection timeout after 5000 ms ~ $ ~ $ ~ $ curl 172.22.0.1 --connect-timeout 5 curl: (28) Connection timeout after 5000 ms Note the label was removed from hello-pod , but try to access external endpoint from another pod, here hello-pod1 which should still use egressIP and be able to access
Expected results:
Should be able to access external endpoint
Additional info:
Description of problem:
Re-enable e2e tests Red Hat Openshift Pipelines operator is now available in the operator hub.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
This bug is to get the fixes from another PR in: https://github.com/openshift/cluster-etcd-operator/pull/1235 Namely that we were relying on a race condition in the fake library to sync the informer and client lister to generate the certificates. This fix entails a lister that directly goes via the client to avoid using the informer.
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
always, when reordering the statements in the code
Steps to Reproduce:
Reodering the code blocks as in https://github.com/openshift/cluster-etcd-operator/pull/1235/files#diff-273071b77ba329777b70cb3c4d3fb2e33bc8abf45cb3da28cbee512d591ab9ee will immediately expose the race condition in unit tests.
Actual results:
Expected results:
Additional info:
Description of problem:
Increase MAX_NODES_LIMIT to 300 for 4.16 and 200 for 4.15 so that users don't see alert "Loading is taking longer than expected" in topology page
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. Create more than 100 nodes in a namespace
Additional info:
Please review the following PR: https://github.com/openshift/installer/pull/7817
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Enable installer AWS SDK install, and create a C2S cluster, will hit following fatal error: level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to create bootstrap resources: failed to create bootstrap instance profile: failed to create role (yunjiang-14c2a-t4wp7-bootstrap-role): RequestCanceled: request context canceled
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-03-140457 4.16.0-0.nightly-2024-01-03-193825
How reproducible:
Always
Steps to Reproduce:
1. Enable AWS SDK install and create a C2S cluster 2. 3.
Actual results:
failed to create bootstrap instance profile: failed to create role (yunjiang-14c2a-t4wp7-bootstrap-role), bootstrap process failed
Expected results:
bootstrap process can be finished successfully.
Additional info:
No issue on terraform way.
Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/274
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/kube-state-metrics/pull/109
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
when user tries to perform mirror to mirror with all the catalogs in oc-mirror v2 i see that it fails with error as some catalog contains images that has both tags and digest. oc-mirror v2 fails to mirror such kind of operators and it does not generate IDMS and ITMS
Version-Release number of selected component (if applicable):
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202404231239.p0.ge7889a7.assembly.stream.el9-e7889a7", GitCommit:"e7889a7ec70dd66b0d6a7ba6dedc3e4b93ebf4de", GitTreeState:"clean", BuildDate:"2024-04-23T17:20:33Z", GoVersion:"go1.21.9 (Red Hat 1.21.9-1.el9_4) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. Install latest oc-mirror 2. create imageSetconfig as below 3. Now run the command `oc-mirror --v2 -c /tmp/customer.yaml --workspace file:///app1/knarra/customertest docker://localhost:5000 --dest-tls-verify=false`
Actual results:
Verify that mirroring fails with error as listed below 2024/04/30 18:24:02 [ERROR] : [Worker] err: Invalid source name docker://quay.io/cilium/cilium-etcd-operator:v2.0.7@sha256:04b8327f7f992693c2cb483b999041ed8f92efc8e14f2a5f3ab95574a65ea2dc: Docker references with both a tag and digest are currently not supported 2024/04/30 18:24:02 [ERROR] : [Worker] err: Invalid source name docker://quay.io/cilium/startup-script:62093c5c233ea914bfa26a10ba41f8780d9b737f@sha256:a1454ca1f93b69ecd2c43482c8e13dc418ae15e28a46009f5934300a20afbdba: Docker references with both a tag and digest are currently not supported 2024/04/30 18:24:02 [ERROR] : [Worker] err: Invalid source name docker://quay.io/coreos/etcd:v3.5.4@sha256:a67fb152d4c53223e96e818420c37f11d05c2d92cf62c05ca5604066c37295e9: Docker references with both a tag and digest are currently not supported
Expected results:
Images with both tag and digest should be skipped while mirroring
Additional info:
https://redhat-internal.slack.com/archives/C02JW6VCYS1/p1714501713082839?thread_ts=1714458218.270709&cid=C02JW6VCYS1
Description of problem:
Tekton Results API endpoint failed to fetch data on airgapped cluster.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
There are several testcases in conformance testsuite that are failing due to openshift-multus configuration.
We are running conformance testsuite as part of our Openshift on Openstack CI. We use that just to confirm correct functionality of the cluster. The command we are using to run the test suite is:
openshift-tests run --provider '{\"type\":\"openstack\"}' openshift/conformance/parallel
The name of the tests that failed are:
1. sig-arch] Managed cluster should ensure platform components have system-* priority class associated [Suite:openshift/conformance/parallel]
Reason is:
6 pods found with invalid priority class (should be openshift-user-critical or begin with system-): openshift-multus/whereabouts-reconciler-6q6h7 (currently "") openshift-multus/whereabouts-reconciler-87dwn (currently "") openshift-multus/whereabouts-reconciler-fvhwv (currently "") openshift-multus/whereabouts-reconciler-h68h5 (currently "") openshift-multus/whereabouts-reconciler-nlz59 (currently "") openshift-multus/whereabouts-reconciler-xsch6 (currently "")
2. [sig-arch] Managed cluster should only include cluster daemonsets that have maxUnavailable or maxSurge update of 10 percent or maxUnavailable of 33 percent [Suite:openshift/conformance/parallel]
Reason is:
fail [github.com/openshift/origin/test/extended/operators/daemon_set.go:105]: Sep 23 16:12:15.283: Daemonsets found that do not meet platform requirements for update strategy: expected daemonset openshift-multus/whereabouts-reconciler to have maxUnavailable 10% or 33% (see comment) instead of 1, or maxSurge 10% instead of 0 Ginkgo exit error 1: exit with code 1
3.[sig-arch] Managed cluster should set requests but not limits [Suite:openshift/conformance/parallel]
Reason is:
fail [github.com/openshift/origin/test/extended/operators/resources.go:196]: Sep 23 16:12:17.489: Pods in platform namespaces are not following resource request/limit rules or do not have an exception granted: apps/v1/DaemonSet/openshift-multus/whereabouts-reconciler/container/whereabouts defines a limit on cpu of 50m which is not allowed (rule: "apps/v1/DaemonSet/openshift-multus/whereabouts-reconciler/container/whereabouts/limit[cpu]") apps/v1/DaemonSet/openshift-multus/whereabouts-reconciler/container/whereabouts defines a limit on memory of 100Mi which is not allowed (rule: "apps/v1/DaemonSet/openshift-multus/whereabouts-reconciler/container/whereabouts/limit[memory]") Ginkgo exit error 1: exit with code 1
4. [sig-node][apigroup:config.openshift.io] CPU Partitioning cluster platform workloads should be annotated correctly for DaemonSets [Suite:openshift/conformance/parallel]
Reason is:
fail [github.com/openshift/origin/test/extended/cpu_partitioning/pods.go:159]: Expected <[]error | len:1, cap:1>: [ <*errors.errorString | 0xc0010fa380>{ s: "daemonset (whereabouts-reconciler) in openshift namespace (openshift-multus) must have pod templates annotated with map[target.workload.openshift.io/management:{\"effect\": \"PreferredDuringScheduling\"}]", }, ] to be empty
How reproducible: Always
Steps to Reproduce: Run conformance testsuite:
https://github.com/openshift/origin/blob/master/test/extended/README.md
Actual results: Testcases failing
Expected results: Testcases passing
This is a clone of issue OCPBUGS-35038. The following is the description of the original issue:
—
Description of problem:
In case the interface changes, we might miss updating AWS and not realize it.
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
No issue currently but could potentially break in the future.
Expected results:
Additional info:
Description of problem:
The hypershift operator ignores RegistryOverrides (from ICSP/IDMS) inspecting the control-plane-operator-image so on disconnected cluster the user should explicitly set hypershift.openshift.io/control-plane-operator-image annotation pointing to the mirrored image on the internal registry. Example: the correct match is in the IDMS: # oc get imagedigestmirrorset -oyaml | grep -B2 registry.ci.openshift.org/ocp/4.14-2024-02-14-135111 ... - mirrors: - virthost.ostest.test.metalkube.org:5000/localimages/local-release-image source: registry.ci.openshift.org/ocp/4.14-2024-02-14-135111 Creating an hosted cluster with: hcp create cluster kubevirt --image-content-sources /home/mgmt_iscp.yaml --additional-trust-bundle /etc/pki/ca-trust/source/anchors/registry.2.crt --name simone3 --node-pool-replicas 2 --memory 16Gi --cores 4 --root-volume-size 64 --namespace local-cluster --release-image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:66c6a46013cda0ad4e2291be3da432fdd03b4a47bf13067e0c7b91fb79eb4539 --pull-secret /tmp/.dockerconfigjson --generate-ssh on the hostedCluster object we see: status: conditions: - lastTransitionTime: "2024-02-14T22:01:30Z" message: 'failed to look up image metadata for registry.ci.openshift.org/ocp/4.14-2024-02-14-135111@sha256:84c74cc05250d0e51fe115274cc67ffcf0a4ac86c831b7fea97e484e646072a6: failed to obtain root manifest for registry.ci.openshift.org/ocp/4.14-2024-02-14-135111@sha256:84c74cc05250d0e51fe115274cc67ffcf0a4ac86c831b7fea97e484e646072a6: unauthorized: authentication required' observedGeneration: 3 reason: ReconciliationError status: "False" type: ReconciliationSucceeded and in the logs of the hypershift operator: {"level":"info","ts":"2024-02-14T22:18:11Z","msg":"registry override coincidence not found","controller":"hostedcluster","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedCluster","HostedCluster":{"name":"simone3","namespace":"local-cluster"},"namespace":"local-cluster","name":"simone3","reconcileID":"6d6a2479-3d54-42e3-9204-8d0ab1013745","image":"4.14-2024-02-14-135111"} {"level":"error","ts":"2024-02-14T22:18:12Z","msg":"Reconciler error","controller":"hostedcluster","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedCluster","HostedCluster":{"name":"simone3","namespace":"local-cluster"},"namespace":"local-cluster","name":"simone3","reconcileID":"6d6a2479-3d54-42e3-9204-8d0ab1013745","error":"failed to look up image metadata for registry.ci.openshift.org/ocp/4.14-2024-02-14-135111@sha256:84c74cc05250d0e51fe115274cc67ffcf0a4ac86c831b7fea97e484e646072a6: failed to obtain root manifest for registry.ci.openshift.org/ocp/4.14-2024-02-14-135111@sha256:84c74cc05250d0e51fe115274cc67ffcf0a4ac86c831b7fea97e484e646072a6: unauthorized: authentication required","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:326\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"} so the hypershift-operator is not using the RegistryOverrides mechanism to inspect the image from the internal registry (virthost.ostest.test.metalkube.org:5000/localimages/local-release-image in this example). Explicitly setting annotation: hypershift.openshift.io/control-plane-operator-image: virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:84c74cc05250d0e51fe115274cc67ffcf0a4ac86c831b7fea97e484e646072a6 on the hosted-cluster directly pointing to the mirrored control-plane-operator image is required to proceed on disconnected environments.
Version-Release number of selected component (if applicable):
4.14, 4.15, 4.16
How reproducible:
100%
Steps to Reproduce:
1. try to deploy an hostedCluster on a disconnected environment without explicitly set hypershift.openshift.io/control-plane-operator-image annotation. 2. 3.
Actual results:
A reconciliation error reported on the hostedCluster object: status: conditions: - lastTransitionTime: "2024-02-14T22:01:30Z" message: 'failed to look up image metadata for registry.ci.openshift.org/ocp/4.14-2024-02-14-135111@sha256:84c74cc05250d0e51fe115274cc67ffcf0a4ac86c831b7fea97e484e646072a6: failed to obtain root manifest for registry.ci.openshift.org/ocp/4.14-2024-02-14-135111@sha256:84c74cc05250d0e51fe115274cc67ffcf0a4ac86c831b7fea97e484e646072a6: unauthorized: authentication required' observedGeneration: 3 reason: ReconciliationError status: "False" type: ReconciliationSucceeded The hostedCluster is not spawn.
Expected results:
The hypershift operator uses the RegistryOverrides mechanism also for the control-plane-operator image. Explicitly setting hypershift.openshift.io/control-plane-operator-image annotation is not required.
Additional info:
- Maybe related to OCPBUGS-29110 - Explicitly setting hypershift.openshift.io/control-plane-operator-image annotation pointing to the mirrored image on the internal registry is a valid workaround.
This is a clone of issue OCPBUGS-38832. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-38722. The following is the description of the original issue:
—
Description of problem:
We should add validation in the Installer when public-only subnets is enabled to make sure that: 1. Print a warning if OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY is set 2. If this flag is only applicable for public cluster, we could consider exit earlier if publish: Internal 3. If this flag is only applicable for byo-vpc configuration, we could consider exit earlier if no subnets provided in install-config.
Version-Release number of selected component (if applicable):
all versions that support public-only subnets
How reproducible:
always
Steps to Reproduce:
1. Set OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY 2. Do a cluster install without specifying a VPC. 3.
Actual results:
No warning about the invalid configuration.
Expected results:
Additional info:
This is an internal-only feature, so this validations shouldn't affect the normal path used by customers.
Description of problem:
PodStartupStorageOperationsFailing alert is not getting raised when there are no successfull(zero) mount/attach happens on node
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-10-25-185510
How reproducible:
Always
Steps to Reproduce:
1. Install any platform cluster. 2. Create sc, pvc, dep. 3. Check dep pod reaching to containercreatingstate and check for alert
Actual results:
Alert is not getting raised when there are 0 successfull mount/attach happens
Expected results:
Alert should get raised when there are no successfull mount/attach happens
Additional info:
Discussion: https://redhat-internal.slack.com/archives/GK0DA0JR5/p1697793500890839 When we take same alerting expression from 4.12, we can observe the alert in ocp web console page.
Installation some operators. After some time the ResolutionFailed showing up:
$ kubectl get subscription.operators -A -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,ResolutionFailed:.status.conditions[?(@.type=="ResolutionFailed")].status,MSG:.status.conditions[?(@.type=="ResolutionFailed")].message' NAMESPACE NAME ResolutionFailed MSG infra-sso rhbk-operator True [failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.67.215:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused"] metallb-system metallb-operator-sub True [failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.67.215:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused"] multicluster-engine multicluster-engine True [failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.67.215:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused"] open-cluster-management acm-operator-subscription True [failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.67.215:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused"] openshift-cnv kubevirt-hyperconverged True [failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused", failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.202.255:50051: connect: connection refused"] openshift-gitops-operator openshift-gitops-operator True [failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused", failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.202.255:50051: connect: connection refused"] openshift-local-storage local-storage-operator True [failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused", failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.202.255:50051: connect: connection refused"] openshift-nmstate kubernetes-nmstate-operator <none> <none> openshift-operators devworkspace-operator-fast-redhat-operators-openshift-marketplace <none> <none> openshift-operators external-secrets-operator <none> <none> openshift-operators web-terminal <none> <none> openshift-storage lvms <none> <none> openshift-storage mcg-operator-stable-4.14-redhat-operators-openshift-marketplace <none> <none> openshift-storage ocs-operator-stable-4.14-redhat-operators-openshift-marketplace <none> <none> openshift-storage odf-csi-addons-operator-stable-4.14-redhat-operators-openshift-marketplace <none> <none> openshift-storage odf-operator <none> <none>
At the package server logs you can see one time the catalog source is not available, after a while the catalog source is available but the error doesn't disappear from the subscription.
Package server logs:
time="2023-12-05T14:27:09Z" level=warning msg="error getting bundle stream" action="refresh cache" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 172.30.37.69:50051: connect: connection refused\"" source="{redhat-operators openshift-marketplace}" time="2023-12-05T14:27:09Z" level=info msg="updating PackageManifest based on CatalogSource changes: {community-operators openshift-marketplace}" action="sync catalogsource" address="community-operators.openshift-marketplace.svc:50051" name=community-operators namespace=openshift-marketplace time="2023-12-05T14:28:26Z" level=info msg="updating PackageManifest based on CatalogSource changes: {redhat-marketplace openshift-marketplace}" action="sync catalogsource" address="redhat-marketplace.openshift-marketplace.svc:50051" name=redhat-marketplace namespace=openshift-marketplace time="2023-12-05T14:30:23Z" level=info msg="updating PackageManifest based on CatalogSource changes: {certified-operators openshift-marketplace}" action="sync catalogsource" address="certified-operators.openshift-marketplace.svc:50051" name=certified-operators namespace=openshift-marketplace time="2023-12-05T14:35:56Z" level=info msg="updating PackageManifest based on CatalogSource changes: {certified-operators openshift-marketplace}" action="sync catalogsource" address="certified-operators.openshift-marketplace.svc:50051" name=certified-operators namespace=openshift-marketplace time="2023-12-05T14:37:28Z" level=info msg="updating PackageManifest based on CatalogSource changes: {community-operators openshift-marketplace}" action="sync catalogsource" address="community-operators.openshift-marketplace.svc:50051" name=community-operators namespace=openshift-marketplace time="2023-12-05T14:37:28Z" level=info msg="updating PackageManifest based on CatalogSource changes: {redhat-operators openshift-marketplace}" action="sync catalogsource" address="redhat-operators.openshift-marketplace.svc:50051" name=redhat-operators namespace=openshift-marketplace time="2023-12-05T14:39:40Z" level=info msg="updating PackageManifest based on CatalogSource changes: {redhat-marketplace openshift-marketplace}" action="sync catalogsource" address="redhat-marketplace.openshift-marketplace.svc:50051" name=redhat-marketplace namespace=openshift-marketplace time="2023-12-05T14:46:07Z" level=info msg="updating PackageManifest based on CatalogSource changes: {certified-operators openshift-marketplace}" action="sync catalogsource" address="certified-operators.openshift-marketplace.svc:50051" name=certified-operators namespace=openshift-marketplace time="2023-12-05T14:47:37Z" level=info msg="updating PackageManifest based on CatalogSource changes: {redhat-operators openshift-marketplace}" action="sync catalogsource" address="redhat-operators.openshift-marketplace.svc:50051" name=redhat-operators namespace=openshift-marketplace time="2023-12-05T14:48:21Z" level=info msg="updating PackageManifest based on CatalogSource changes: {community-operators openshift-marketplace}" action="sync catalogsource" address="community-operators.openshift-marketplace.svc:50051" name=community-operators namespace=openshift-marketplace time="2023-12-05T14:49:53Z" level=info msg="updating
4.14.3
1. Install an operator for example metallb 2. Wait until the catalog pod is not available for on time. 3. ResolutionFailed doesn't disappear anymore
ResolutionFailed doesn't disappear anymore from subscription.
ResolutionFailed disappear from subscription.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Component Readiness has found a potential regression in [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers.
Probability of significant regression: 100.00%
Sample (being evaluated) Release: 4.15
Start Time: 2024-02-08T00:00:00Z
End Time: 2024-02-14T23:59:59Z
Success Rate: 91.30%
Successes: 63
Failures: 6
Flakes: 0
Base (historical) Release: 4.14
Start Time: 2023-10-04T00:00:00Z
End Time: 2023-10-31T23:59:59Z
Success Rate: 100.00%
Successes: 735
Failures: 0
Flakes: 0
Note: When you look at the link above you will notice some of the failures mention the bare metal operator. That's being investigated as part of https://issues.redhat.com/browse/OCPBUGS-27760. There have been 3 cases in the last week where the console was in a fail loop. Here's an example:
We need help understanding why this is happening and what needs to be done to avoid it.
This is a clone of issue OCPBUGS-34387. The following is the description of the original issue:
—
Description of problem:
Using "accessTokenInactivityTimeoutSeconds: 900" for "OAuthClient" config. One inactive or idle tab causes session expiry for all other tabs. Following are the tests performed: Test 1 - a single window with a single tab no activity would time out after 15 minutes. Test 2 - a single window two tabs. No activity in the first tab, but was active in the second tab. Timeout occurred for both tabs after 15 minutes. Test 3 - a single window with a single tab and activity, does not time out after 15 minutes. Hence single idle tab causes the user logout from rest of the tabs.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Set the OAuthClient.accessTokenInactivityTimeoutSeconds to 300(or any value) 2. Login using to OCP web console and open multiple tabs. 3. Keep one tab idle and work on the other open tabs. 4. After 5 minutes the session expires for all tabs.
Actual results:
One inactive or idle tab causes session expiry for all other tabs.
Expected results:
Session should not be expired if any tab is not idle.
Additional info:
This is a clone of issue OCPBUGS-37786. The following is the description of the original issue:
—
Description of problem:
In the use case when worker nodes require a proxy for outside access and the control plane is external (and only accessible via the internet), ovnkube-node pods never become available because the ovnkube-controller container cannot reach the Kube APIServer.
Version-Release number of selected component (if applicable):
How reproducible: Always
Steps to Reproduce:
1. Create an AWS hosted cluster with Public access and requires a proxy to access the internet.
2. Wait for nodes to become active
Actual results:
Nodes join cluster, but never become active
Expected results:
Nodes join cluster and become active
Please review the following PR: https://github.com/openshift/vsphere-problem-detector/pull/144
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The node selector for the console deployment requires deploying it on the master nodes, The node selector for the console deployment requires deploying it on the master nodes, while the replica count is determined by the infrastructureTopology, which primarily tracks the workers' setup. When an OpenShift cluster is installed with a single master node and multiple workers, this leads the console deployment to request 2 replicas as infrastructureTopology is set to HighlyAvailable. Instead, ControlPlaneTopology is set to SingleReplica as expected.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Install an openshift cluster with 1 master and 2 workers
Actual results:
The installation fails as the replicas for the console deployment is set to 2. apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2024-01-18T08:34:47Z" generation: 1 name: cluster resourceVersion: "517" uid: d89e60b4-2d9c-4867-a2f8-6e80207dc6b8 spec: cloudConfig: key: config name: cloud-provider-config platformSpec: aws: {} type: AWS status: apiServerInternalURI: https://api-int.adstefa-a12.qe.devcluster.openshift.com:6443 apiServerURL: https://api.adstefa-a12.qe.devcluster.openshift.com:6443 controlPlaneTopology: SingleReplica cpuPartitioning: None etcdDiscoveryDomain: "" infrastructureName: adstefa-a12-6wlvm infrastructureTopology: HighlyAvailable platform: AWS platformStatus: aws: region: us-east-2 type: AWS apiVersion: apps/v1 kind: Deployment metadata: annotations: .... creationTimestamp: "2024-01-18T08:54:23Z" generation: 3 labels: app: console component: ui name: console namespace: openshift-console spec: progressDeadlineSeconds: 600 replicas: 2
Expected results:
The replica is set to 1, tracking the ControlPlaneTopology value instead of hte infrastructureTopology.
Additional info:
Please review the following PR: https://github.com/openshift/operator-framework-catalogd/pull/45
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
useUserSettings hooks are required in the dynamic plugin so, it would be nice to have it available in the dynamic plugin SDK instead of duplicating the codes in the dynamic plugin.
Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/100
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Not able to reproduce it manually, but frequently happens when run auto scripts.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-05-195247
How reproducible:
Steps to Reproduce:
1. Label worker-0 node as egress node, created egressIP object,the egressIP was assigned to worker-0 node successfully on secondary NIC 2. Block 9107 port on worker-0 node and label worker-1 as egress node 3.
Actual results:
EgressIP was not moved to second node % oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-66330 172.22.0.196 40m Warning EgressIPConflict egressip/egressip-66330 Egress IP egressip-66330 with IP 172.22.0.196 is conflicting with a host (worker-0) IP address and will not be assigned sh-4.4# ip a show enp1s0 2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:1c:cf:40:5d:25 brd ff:ff:ff:ff:ff:ff inet 172.22.0.109/24 brd 172.22.0.255 scope global dynamic noprefixroute enp1s0 valid_lft 76sec preferred_lft 76sec inet6 fe80::21c:cfff:fe40:5d25/64 scope link noprefixroute valid_lft forever preferred_lft forever
Expected results:
EgressIP should move to second egress node
Additional info:
Workaround: deleted it and recreated it works % oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-66330 172.22.0.196 % oc delete egressip --all egressip.k8s.ovn.org "egressip-66330" deleted % oc create -f ../data/egressip/config1.yaml egressip.k8s.ovn.org/egressip-3 created % oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-3 172.22.0.196 worker-1 172.22.0.196
aws single-node are failing starting with https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.nightly/release/4.16.0-0.nightly-2024-03-27-123853
periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node
periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial
A bunch of operators are degraded, I did notice this but still investigating:
- lastTransitionTime: '2024-03-27T15:56:02Z' message: 'OAuthServerRouteEndpointAccessibleControllerAvailable: failed to retrieve route from cache: route.route.openshift.io "oauth-openshift" not found OAuthServerServiceEndpointAccessibleControllerAvailable: Get "https://172.30.201.206:443/healthz": dial tcp 172.30.201.206:443: connect: connection refused OAuthServerServiceEndpointsEndpointAccessibleControllerAvailable: endpoints "oauth-openshift" not found ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 1 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods). WellKnownAvailable: The well-known endpoint is not yet available: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this)'
Description of problem:
On SNO with DU profile(RT kernel) tuned profile is always degraded due to net.core.busy_read, net.core.busy_poll and kernel.numa_balancing sysctl not existing in RT kernel
Version-Release number of selected component (if applicable):
4.14.1
How reproducible:
100%
Steps to Reproduce:
1. Deploy SNO with DU profile(RT kernel) 2. Check tuned profile
Actual results:
oc -n openshift-cluster-node-tuning-operator get profile -o yaml apiVersion: v1 items: - apiVersion: tuned.openshift.io/v1 kind: Profile metadata: creationTimestamp: "2023-11-09T18:26:34Z" generation: 2 name: sno.kni-qe-1.lab.eng.rdu2.redhat.com namespace: openshift-cluster-node-tuning-operator ownerReferences: - apiVersion: tuned.openshift.io/v1 blockOwnerDeletion: true controller: true kind: Tuned name: default uid: 4e7c05a2-537e-4212-9009-e2724938dec9 resourceVersion: "287891" uid: 5f4d5819-8f84-4b3b-9340-3d38c41501ff spec: config: debug: false tunedConfig: {} tunedProfile: performance-patch status: conditions: - lastTransitionTime: "2023-11-09T18:26:39Z" message: TuneD profile applied. reason: AsExpected status: "True" type: Applied - lastTransitionTime: "2023-11-09T18:26:39Z" message: 'TuneD daemon issued one or more error message(s) during profile application. TuneD stderr: net.core.rps_default_mask' reason: TunedError status: "True" type: Degraded tunedProfile: performance-patch kind: List metadata: resourceVersion: ""
Expected results:
Not degraded
Additional info:
Looking at the tuned log the following errors show up which are probably causing the profile to get into degraded state: 2023-11-09 18:30:49,287 ERROR tuned.plugins.plugin_sysctl: Failed to read sysctl parameter 'net.core.busy_read', the parameter does not exist 2023-11-09 18:30:49,287 ERROR tuned.plugins.plugin_sysctl: sysctl option net.core.busy_read will not be set, failed to read the original value. 2023-11-09 18:30:49,287 ERROR tuned.plugins.plugin_sysctl: Failed to read sysctl parameter 'net.core.busy_poll', the parameter does not exist 2023-11-09 18:30:49,287 ERROR tuned.plugins.plugin_sysctl: sysctl option net.core.busy_poll will not be set, failed to read the original value. 2023-11-09 18:30:49,287 ERROR tuned.plugins.plugin_sysctl: Failed to read sysctl parameter 'kernel.numa_balancing', the parameter does not exist 2023-11-09 18:30:49,287 ERROR tuned.plugins.plugin_sysctl: sysctl option kernel.numa_balancing will not be set, failed to read the original value. These sysctl parameters seem not to be available with RT kernel.
This is a clone of issue OCPBUGS-34907. The following is the description of the original issue:
—
Description of problem:
The TechPreviewNoUpgrade featureset could be disabled on a 4.16 cluster after enabling it. But according to the official doc `Enabling this feature set cannot be undone and prevents minor version updates`, it should not be disabled. # ./oc get featuregate cluster -ojson|jq .spec { "featureSet": "TechPreviewNoUpgrade"} # ./oc patch featuregate cluster --type=json -p '[{"op":"remove", "path":"/spec/featureSet"}] 'featuregate.config.openshift.io/cluster patched # ./oc get featuregate cluster -ojson|jq .spec {}
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-03-060250
How reproducible:
always
Steps to Reproduce:
1. enable the TechPreviewNoUpgrade fs on a 4.16 cluster 2. then remove it 3.
Actual results:
TechPreviewNoUpgrade featureset was disabled
Expected results:
Enabling this feature set cannot be undone
Additional info:
https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L43-L44
Description of problem:
We want to update trigger from auto to manual or vice versa. We can do it with CLI 'oc set triggers deployment/<name> --manual'. It normally changes to deployment annotation metadata.annotations.image.openshift.io/triggers to "paused: true" or "paused: false" when set to auto. But when we enable or disable auto trigger by editing deployment from web console, it overrides annotation to "pause: false" or "pause: true" without 'd'.
Version-Release number of selected component (if applicable):
How reproducible:
Create simple httpd application. Follow [1] to set trigger using CLI. Steps to set trigger from console: Web console->deployment-> Edit deployment > Form view-> Images section -> Enable Deploy image from an image stream tag -> Enable Auto deploy when new Image is available an save the changes -> check annotations [1] https://docs.openshift.com/container-platform/4.12/openshift_images/triggering-updates-on-imagestream-changes.html
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
code: https://github.com/openshift/console/blob/master/frontend/packages/dev-console/src/utils/resource-label-utils.ts#L78
the okd build image job in ironic-agent-image is failing with the error message
Complete! % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 14 100 14 0 0 73 0 --:--:-- --:--:-- --:--:-- 73 File "<stdin>", line 1 404: Not Found ^ SyntaxError: illegal target for annotation INFO[2024-02-29T08:06:27Z] Ran for 4m3s ERRO[2024-02-29T08:06:27Z] Some steps failed: ERRO[2024-02-29T08:06:27Z] * could not run steps: step ironic-agent failed: error occurred handling build ironic-agent-amd64: the build ironic-agent-amd64 failed after 1m57s with reason DockerBuildFailed: Dockerfile build strategy has failed. INFO[2024-02-29T08:06:27Z] Reporting job state 'failed' with reason 'executing_graph:step_failed:building_project_image'
Description of problem:
The [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility setup test is a frequent offender in the OpenStack CSI jobs. We're seeing it fail on 4.14 up to 4.16.
Example of failed job.
Example of successful job.
It seems like the 1 min timeout is too short and does not give enough time for the pods backing the service to come up.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Description of problem:
The following tests are failing: - [sig-storage] In-tree Volumes [Driver: vsphere] [Testpattern: Inline-volume (ext4)] volumes should allow exec of files on the volume [Suite:openshift/conformance/parallel] [Suite:k8s] - [sig-storage] In-tree Volumes [Driver: vsphere] [Testpattern: Pre-provisioned PV (block volmode)] volumes should store data [Suite:openshift/conformance/parallel] [Suite:k8s] - [sig-storage] In-tree Volumes [Driver: vsphere] [Testpattern: Pre-provisioned PV (ext4)] volumes should store data [Suite:openshift/conformance/parallel] [Suite:k8s] Job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-ovn-upi-zones/1808605014239744000
Version-Release number of selected component (if applicable):
4.16 nightly
How reproducible:
consistently: https://prow.ci.openshift.org/job-history/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-ovn-upi-zones
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The story is to track i18n upload/download routine tasks which are perform every sprint.
A.C.
- Upload strings to Memosource at the start of the sprint and reach out to localization team
- Download translated strings from Memsource when it is ready
- Review the translated strings and open a pull request
- Open a followup story for next sprint
After updating the cluster to 4.12.42 (from 4.12.15), the customer noticed some issues for the scheduled PODs to start on the node.
The initial thought was a multus issue, and then we realised that the script /usr/local/bin/configure-ovs.sh was modified and reverting the modification fixed the issue.
Modification:
> if nmcli connection show "$vlan_parent" &> /dev/null; then > # if the VLAN connection is configured with a connection UUID as parent, we need to find the underlying device > # and create the bridge against it, as the parent connection can be replaced by another bridge. > vlan_parent=$(nmcli --get-values GENERAL.DEVICES conn show ${vlan_parent}) > fi
Reference:
4.12.42
Should be reproducible by setting inactive nmcli connections with the same names as the active once
Not tested, but this should be something like
1. create inactive same nmcli connections
2. run the script
Script failing
Script should manage the connection using the UUID instead of using the Name.
Or maybe it's an underline issue how nmcli is managing the relationship between objects.
The issue may be related to the way that nmcli is working, as it should use the UUID to match the `vlan.parent` as it does with the `connection.master`
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Hypershift's ignition server expects certain headers. Assisted needs to gather all of this data and pass it when fetching the ignition from hypershift's ignition server.
Work items:
Please review the following PR: https://github.com/openshift/coredns/pull/111
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Code to decide whether to update the Pull Secret replicates most of the functionality of the ApplySecret() func in library-go, which it then calls anyway.
This is hard to read, and misleading for anybody wanting to add similar functionality.
Description of problem:
In PipelineRun list page, while fetching taskruns for particular pipelinerun, add loading if TaskRuns is not fetched yet
Version-Release number of selected component (if applicable):
4.15.z
How reproducible:
Sometimes
Steps to Reproduce:
1.Create a failed pipelinerun 2.Check Task Status field 3.
Actual results:
Sometimes TaskRun Status value is -
Expected results:
Should show status bars
Additional info:
Description of the problem:
non-lowercase hostname in DHCP breaks assisted installation
How reproducible:
100%
Steps to reproduce:
Actual results:
bootkube fails
Expected results:{}
bootkube should succeed
Please review the following PR: https://github.com/openshift/csi-external-attacher/pull/66
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-42714. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-36261. The following is the description of the original issue:
—
Description of problem:
In hostedcluster installations, when the following OAuthServer service is configure without any configured hostname parameter, the oauth route is created in the management cluster with the standard hostname which following the pattern from ingresscontroller wilcard domain (oauth-<hosted-cluster-namespace>.<wildcard-default-ingress-controller-domain>): ~~~ $ oc get hostedcluster -n <namespace> <hosted-cluster-name> -oyaml - service: OAuthServer servicePublishingStrategy: type: Route ~~~ On the other hand, if any custom hostname parameter is configured, the oauth route is created in the management cluster with the following labels: ~~~ $ oc get hostedcluster -n <namespace> <hosted-cluster-name> -oyaml - service: OAuthServer servicePublishingStrategy: route: hostname: oauth.<custom-domain> type: Route $ oc get routes -n hcp-ns --show-labels NAME HOST/PORT LABELS oauth oauth.<custom-domain> hypershift.openshift.io/hosted-control-plane=hcp-ns <--- ~~~ The configured label makes the ingresscontroller does not admit the route as the following configuration is added by hypershift operator to the default ingresscontroller resource: ~~~ $ oc get ingresscontroller -n openshift-ingress-default default -oyaml routeSelector: matchExpressions: - key: hypershift.openshift.io/hosted-control-plane <--- operator: DoesNotExist <--- ~~~ This configuration should be allowed as there are use-cases where the route should have a customized hostname. Currently the HCP platform is not allowing this configuration and the oauth route does not work.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Easily
Steps to Reproduce:
1. Install HCP cluster 2. Configure OAuthServer with type Route 3. Add a custom hostname different than default wildcard ingress URL from management cluster
Actual results:
Oauth route is not admitted
Expected results:
Oauth route should be admitted by Ingresscontroller
Additional info:
This is a clone of issue OCPBUGS-33926. The following is the description of the original issue:
—
Description of problem:
During the creation of a 4.16 cluster using the nightly build (--channel-group nightly --version 4.16.0-0.nightly-2024-05-19-235324) with the following command:
osa create cluster --cluster-name $CLUSTER_NAME --sts --mode auto --machine-cidr 10.0.0.0/16 --compute-machine-type m6a.xlarge --region $REGION --oidc-config-id $OIDC_ID --channel-group nightly --version 4.16.0-0.nightly-2024-05-19-235324 --ec2-metadata-http-tokens optional --replicas 2 --service-cidr 172.30.0.0/16 --pod-cidr 10.128.0.0/14 --host-prefix 23 -y
How reproducible:
1. Run the command provided above to create a cluster.Observe the error during the IAM role creation step. 2. Observe the error during the IAM role creation step.
Actual results:
time="2024-05-20T03:21:03Z" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: failed to create IAM roles: failed to create inline policy for role master: AccessDenied: User: arn:aws:sts::890193308254:assumed-role/ManagedOpenShift-Installer-Role/1716175231092827911 is not authorized to perform: iam:PutRolePolicy on resource: role ManagedOpenShift-ControlPlane-Role because no identity-based policy allows the iam:PutRolePolicy action\n\tstatus code: 403, request id: 27f0f631-abdd-47e9-ba02-a2e71a7487dc" time="2024-05-20T03:21:04Z" level=error msg="error after waiting for command completion" error="exit status 4" installID=wx9l766h time="2024-05-20T03:21:04Z" level=error msg="error provisioning cluster" error="exit status 4" installID=wx9l766h time="2024-05-20T03:21:04Z" level=error msg="error running openshift-install, running deprovision to clean up" error="exit status 4" installID=wx9l766h time="2024-05-20T03:21:04Z" level=debug msg="OpenShift Installer v4.16.0
Expected results:
The cluster should be created successfully without IAM permission errors.
Additional info:
- The IAM role ManagedOpenShift-Installer-Role does not have the necessary permissions to perform iam:PutRolePolicy on the ManagedOpenShift-ControlPlane-Role. - This issue was observed with the nightly build 4.16.0-0.nightly-2024-05-19-235324.
More context: https://redhat-internal.slack.com/archives/C070BJ1NS1E/p1716182046041269
This is failing on hypershift:
[sig-operator] an end user can use OLM can subscribe to the operator [apigroup:config.openshift.io]
Please review the following PR: https://github.com/openshift/thanos/pull/134
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-39393. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-39111. The following is the description of the original issue:
—
Gather the nodenetworkconfigurationpolicy.nmstate.io/v1 and nodenetworkstate.nmstate.io/v1beta1 cluster scoped resources in the Insights data. This CRs are introduced by the NMState operator.
Description of problem:
As a logged in user Im unable to logout from cluster with external OIDC provider.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Login into cluster with external OIDC setup 2. 3.
Actual results:
Unable to logout
Expected results:
Logout successfully
Additional info:
Please review the following PR: https://github.com/openshift/api/pull/1700
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
W0109 17:47:02.340203 1 builder.go:109] graceful termination failed, controllers failed with error: failed to get infrastructure name: infrastructureName not set in infrastructure 'cluster'
This is a clone of issue OCPBUGS-42164. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-42143. The following is the description of the original issue:
—
Description of problem:
There is another panic occurred in https://issues.redhat.com/browse/OCPBUGS-34877?focusedId=25580631&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-25580631 which should be fixed
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Seen in 4.15-related update CI:
$ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/console.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False[^:]*: \(.*\)|\1 \2 \3|' | sed 's|[.]apps[.][^ /]*|.apps...|g' | sort | uniq -c | sort -n 1 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... dial tcp 52.158.160.194:443: connect: connection refused 1 console RouteHealth_StatusError route not yet available, https://console-openshift-console.apps... returns '503 Service Unavailable' 2 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... dial tcp: lookup console-openshift-console.apps... on 172.30.0.10:53: no such host 2 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... EOF 8 console RouteHealth_RouteNotAdmitted console route is not admitted 16 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... context deadline exceeded (Client.Timeout exceeded while awaiting headers)
For example this 4.14 to 4.15 run had:
: [bz-Management Console] clusteroperator/console should not change condition/Available Run #0: Failed 1h25m23s { 1 unexpected clusteroperator state transitions during e2e test run Nov 28 03:42:41.207 - 1s E clusteroperator/console condition/Available reason/RouteHealth_FailedGet status/False RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ci-op-d2qsp1gp-2a31d.aws-2.ci.openshift.org): Get "https://console-openshift-console.apps.ci-op-d2qsp1gp-2a31d.aws-2.ci.openshift.org": context deadline exceeded (Client.Timeout exceeded while awaiting headers)}
While a timeout for console Route isn't fantastic, an issue that only persists for 1s is not long enough to warrant immediate admin intervention. Teaching the console operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.
At least 4.15. Possibly other versions; I haven't checked.
.h2 How reproducible
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/console.*condition/Available.*status/False' | grep 'periodic.*failures match' | sort periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 12 runs, 17% failed, 50% of failures match = 8% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-ppc64le (all) - 5 runs, 20% failed, 100% of failures match = 20% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-s390x (all) - 4 runs, 100% failed, 25% of failures match = 25% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 12 runs, 17% failed, 100% of failures match = 17% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 7 runs, 29% failed, 50% of failures match = 14% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 12 runs, 25% failed, 33% of failures match = 8% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 23% failed, 28% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 28% failed, 23% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade (all) - 63 runs, 38% failed, 8% of failures match = 3% impact periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade (all) - 60 runs, 73% failed, 11% of failures match = 8% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 70 runs, 7% failed, 20% of failures match = 1% impact
Seems like it's primarily minor-version updates that trip this, and in jobs with high run counts, the impact percentage is single-digits.
There may be a way to reliable trigger these hiccups, but as a reproducer floor, running days of CI and checking to see whether impact percentages decrease would be a good way to test fixes post-merge.
Lots of console ClusterOperator going Available=False blips in 4.15 update CI.
Console goes Available=False if and only if immediate admin intervention is appropriate.
Please review the following PR: https://github.com/openshift/cluster-monitoring-operator/pull/2190
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
documentationBaseURL still points to 4.14
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1.Check documentationBaseURL on 4.16 cluster: # oc get configmap console-config -n openshift-console -o yaml | grep documentationBaseURL documentationBaseURL: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/ 2. 3.
Actual results:
1.documentationBaseURL is still pointing to 4.14
Expected results:
1.documentationBaseURL should point to 4.16
Additional info:
Description of problem:
We shouldn't enforce PSa in 4.16, neither by label sync, neither by global cluster config.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100%
Steps to Reproduce:
As a cluster admin: 1. create two new namespaces/projects: pokus, openshift-pokus 2. as a cluster-admin, attempt to create a privileged pod in both the namespaces from 1.
Actual results:
pod creation is blocked by pod security admission
Expected results:
only a warning about pod violating the namespace pod security level should be emitted
Additional info:
As a maintainer of the HyperShift repo, I would like to remove unused functions from the code base to reduce the code footprint of the repo.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-33566. The following is the description of the original issue:
—
Description of problem:
When the cloud-credential operator is used in manual mode, and awsSTSIAMRoleARN is not present in the secret operator pods, it throws aggressive errors every second. One of the customer concern about the number of errors from the operator pods Two errors per second ============================ time="2024-05-10T00:43:45Z" level=error msg="error syncing credentials: an empty awsSTSIAMRoleARN was found so no Secret was created" controller=credreq cr=openshift-cloud-credential-operator/aws-ebs-csi-driver-operator secret=openshift-cluster-csi-drivers/ebs-cloud-credentials time="2024-05-10T00:43:46Z" level=error msg="errored with condition: CredentialsProvisionFailure" controller=credreq cr=openshift-cloud-credential-operator/aws-ebs-csi-driver-operator secret=openshift-cluster-csi-drivers/ebs-cloud-credentials
Version-Release number of selected component (if applicable):
4.15.3
How reproducible:
Always present in managed rosa clusters
Steps to Reproduce:
1.create a rosa cluster 2.check the errors of cloud credentials operator pods 3.
Actual results:
The CCO logs continually throw errors
Expected results:
The CCO logs should not be continually throwing these errors.
Additional info:
The focus of this bug is only to remove the error lines from the logs. The underlying issue, of continually attempting to reconcile the CRs will be handled by other bugs.
Please review the following PR: https://github.com/openshift/operator-framework-rukpak/pull/67
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cloud-provider-gcp/pull/52
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/102
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-36263. The following is the description of the original issue:
—
The new test: [sig-node] kubelet metrics endpoints should always be reachable
Is picking up some upgrade job runs where we see the metrics endpoint go down for about 30 seconds, during the generic node update phase, and recover before we reboot the node. This is treated as a reason to flake the test because there was no overlap with reboot as initially written.
Example: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade/1806142925785010176
Interval chart showing the problem: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1806142925785010176/periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade/intervals?filterText=master-1&intervalFile=e2e-timelines_spyglass_20240627-024633.json&overrideDisplayFlag=0&selectedSources=E2EFailed&selectedSources=MetricsEndpointDown&selectedSources=NodeState
The master outage at 3:30:59 is causing a flake when I'd rather it didn't, because it doesn't extend into the reboot.
I'd like to tighten this up to include any overlap with update.
Will be backported to 4.16 to tighten the signal there as well.
Please review the following PR: https://github.com/openshift/azure-disk-csi-driver/pull/70
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-39134. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-31738. The following is the description of the original issue:
—
Description of problem:
The [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility setup test frequently fails on OpenStack platform, which in turn also causes the [sig-network] can collect pod-to-service poller pod logs and [sig-network] can collect host-to-service poller pod logs tests to fail.
These failure happen frequently in vh-mecha, for example for all CSI jobs, such as 4.16-e2e-openstack-csi-cinder.
Recently lextudio dropped pyasn1 so we want to be explicit and show that we install pysnmp-lextudio but normal pyasn1
This is a clone of issue OCPBUGS-34714. The following is the description of the original issue:
—
Description of problem:
While creating an install configuration for PowerVS IPI, the default region is not set leading to the survey getting stuck if nothing is entered at the command line.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. openshift-install create install-config
Actual results:
No default region is selected and hence pressing enter on the first option without going up or down results in the error "Sorry, your reply was invalid: invalid region """
Expected results:
dal gets selected as the default
Additional info:
Description of problem:
Remove react-helmet from the list of shared modules in console/frontend/packages/console-dynamic-plugin-sdk/src/shared-modules.ts (noted as deprecated from last release)
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
console-operator is updating the OIDC status without checking the feature gate
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Setup a OCP cluster without external OIDC provider, using default OAuth.
Steps to Reproduce:
1. 2. 3.
Actual results:
the OIDC related conditions are is being surfaced in the console-operator's config conditions.
Expected results:
the OIDC related conditions should not be surfaced in the console-operator's config conditions.
Additional info:
Please review the following PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/207
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When an OpenShift Container Platform cluster is installed on Bare Metal with RHACM, the "metal3-plugin" for the OpenShift Console is installed automatically. The "Nodes view (`<console>/k8s/cluster/core~v1~Node`) uses the `BareMetalNodesTable` which has very limited columns. However in the meantime OCP improved their Nodes table and added more features (like metrics) and we havent done any work in metal3. Customers are missing information like metrics or Pods, which are present in the standard Node view.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.12.33
How reproducible:
Always
Steps to Reproduce:
1. Install a cluster using RHACM on Bare Metal 2. Ensure the "metal3-plugin" is enabled 3. Navigate to the "Nodes" view in the OpenShift Container Platform Console (`<console>/k8s/cluster/core~v1~Node`)
Actual results:
Limited columns (Name, Status, Role, Machine, Management Address) is visible. Columns like Memory, CPU, Pods, Filesystem, Instance Type are missing
Expected results:
All the columns from the standard view are visible, plus the "Management Address" column
Additional info:
* Issue was discussed here: https://redhat-internal.slack.com/archives/C027TN14SGJ/p1702552957981989 * Screenshot of non-metal3 cluster: https://redhat-internal.slack.com/archives/C027TN14SGJ/p1702552980878029?thread_ts=1702552957.981989&cid=C027TN14SGJ * Screenshot of metal3 cluster: https://redhat-internal.slack.com/archives/C027TN14SGJ/p1702552995363389?thread_ts=1702552957.981989&cid=C027TN14SGJ
Description of problem:
Inspection is failing on hosts which special characters found in serial number of block devices: Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: 2024-07-03 09:16:11.325 1 DEBUG ironic_python_agent.inspector [-] collected data: {'inventory'....'error': "The following errors were encountered:\n* collector logs failed: 'utf-8' codec can't decode byte 0xff in position 12: invalid start byte"} call_inspector /usr/lib/python3.9/site-packages/ironic_python_agent/inspector.py:128 Serial found: "serial": "2HC015KJ0000\udcff\udcff\udcff\udcff\udcff\udcff\udcff\udcff" Interesting stacktrace error: Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1260-1267: surrogates not allowed Full stack trace: ~~~ Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: 2024-07-03 09:16:11.628 1 DEBUG oslo_concurrency.processutils [-] CMD "lsblk -bia --json -oKNAME,MODEL,SIZE,ROTA,TYPE,UUID,PARTUUID,SERIAL" returned: 0 in 0.006s e xecute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:422 Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: --- Logging error --- Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: --- Logging error --- Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: Traceback (most recent call last): Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: File "/usr/lib64/python3.9/logging/__init__.py", line 1086, in emit Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Traceback (most recent call last): Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib64/python3.9/logging/__init__.py", line 1086, in emit Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: stream.write(msg + self.terminator) Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1260-1267: surrogates not allowed Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Call stack: Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: stream.write(msg + self.terminator) Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1260-1267: surrogates not allowed Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/bin/ironic-python-agent", line 10, in <module> Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: sys.exit(run()) Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/cmd/agent.py", line 50, in run Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: agent.IronicPythonAgent(CONF.api_url, Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: Call stack: Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/agent.py", line 485, in run Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: self.process_lookup_data(content) Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/agent.py", line 400, in process_lookup_data Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: hardware.cache_node(self.node) Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 3179, in cache_node Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: dispatch_to_managers('wait_for_disks') Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 3124, in dispatch_to_managers Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: return getattr(manager, method)(*args, **kwargs) Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 997, in wait_for_disks Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: self.get_os_install_device() Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1518, in get_os_install_device Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: block_devices = self.list_block_devices_check_skip_list( Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1495, in list_block_devices_check_skip_list Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: block_devices = self.list_block_devices( Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1460, in list_block_devices Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: block_devices = list_all_block_devices() Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 526, in list_all_block_devices Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: report = il_utils.execute('lsblk', '-bia', '--json', Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_lib/utils.py", line 111, in execute Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: _log(result[0], result[1]) Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_lib/utils.py", line 99, in _log Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: LOG.debug('Command stdout is: "%s"', stdout) Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Message: 'Command stdout is: "%s"' Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Arguments: ('{\n "blockdevices": [\n {\n "kname": "loop0",\n "model": null,\n "size": 67467313152,\n "rota": false,\n "type": "loop",\n "uuid": "28f5ff52-7f5b-4e5a-bcf2-59813e5aef5a",\n "partuuid": null,\n "serial": null\n },{\n "kname": "loop1",\n "model": null,\n "size": 1027846144,\n "rota": false,\n "type": "loop",\n "uuid": null,\n "partuuid": null,\n "serial": null\n },{\n "kname": "sda",\n "model": "LITEON IT ECE-12",\n "size": 120034123776,\n "rota": false,\n "type": "disk",\n "uuid": null,\n "partuuid": null,\n "serial": "XXXXXXXXXXXXXXXXXX"\n },{\n "kname": "sdb",\n "model": "LITEON IT ECE-12",\n "size": 120034123776,\n "rota": false,\n "type": "disk",\n "uuid": null,\n "partuuid": null,\n "serial": "XXXXXXXXXXXXXXXXXXXX"\n },{\n "kname": "sdc",\n "model": "External",\n "size": 0,\n "rota": true,\n "type": "disk",\n "uuid": null,\n "partuuid": null,\n "serial": "2HC015KJ0000\udcff\udcff\udcff\udcff\udcff\udcff\udcff\udcff"\n }\n ]\n}\n',) ~~~
Version-Release number of selected component (if applicable):
OCP 4.14.28
How reproducible:
Always
Steps to Reproduce:
1. Add a BMH with a bad utf-8 characters in serial 2. 3.
Actual results:
Inspection fail
Expected results:
Inspection works
Additional info:
This is a clone of issue OCPBUGS-41685. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-37584. The following is the description of the original issue:
—
Description of problem:
Topology screen crashes and reports "Oh no! something went wrong" when a pod in completed state is selected.
Version-Release number of selected component (if applicable):
RHOCP 4.15.18
How reproducible:
100%
Steps to Reproduce:
1. Switch to developer mode 2. Select Topology 3. Select a project that has completed cron jobs like openshift-image-registry 4. Click the green CronJob Object 5. Observe Crash
Actual results:
The Topology screen crashes with error "Oh no! Something went wrong."
Expected results:
After clicking the completed pod / workload, the screen should display the information related to it.
Additional info:
Since many 4.y ago, before 4.11 and all the minor versions that are still supported, CRI-O has wiped images when it comes up after a node reboot and notices it has a new (minor?) version. This causes redundant pulls, as seen in this 4.11-to-4.12 update run:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade/1732741139229839360/artifacts/e2e-azure-sdn-upgrade/gather-extra/artifacts/nodes/ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4/journal | zgrep 'Starting update from rendered-\|crio-wipe\|Pulled image: registry.ci.openshift.org/ocp/4.12-2023-12-07-060628@sha256:3c3e67faf4b6e9e95bebb0462bd61c964170893cb991b5c4de47340a2f295dc2' Dec 07 13:05:42.474144 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 systemd[1]: crio-wipe.service: Succeeded. Dec 07 13:05:42.481470 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 systemd[1]: crio-wipe.service: Consumed 191ms CPU time Dec 07 13:59:51.000686 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 crio[1498]: time="2023-12-07 13:59:51.000591203Z" level=info msg="Pulled image: registry.ci.openshift.org/ocp/4.12-2023-12-07-060628@sha256:3c3e67faf4b6e9e95bebb0462bd61c964170893cb991b5c4de47340a2f295dc2" id=a62bc972-67d7-401a-9640-884430bd16f1 name=/runtime.v1.ImageService/PullImage Dec 07 14:00:55.745095 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 root[101294]: machine-config-daemon[99469]: Starting update from rendered-worker-ca36a33a83d49b43ed000fd422e09838 to rendered-worker-c0b3b4eadfe6cdfb595b97fa293a9204: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false} Dec 07 14:05:33.274241 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 systemd[1]: crio-wipe.service: Succeeded. Dec 07 14:05:33.289605 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 systemd[1]: crio-wipe.service: Consumed 216ms CPU time Dec 07 14:14:50.277011 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 crio[1573]: time="2023-12-07 14:14:50.276961087Z" level=info msg="Pulled image: registry.ci.openshift.org/ocp/4.12-2023-12-07-060628@sha256:3c3e67faf4b6e9e95bebb0462bd61c964170893cb991b5c4de47340a2f295dc2" id=1a092fbd-7ffa-475a-b0b7-0ab115dbe173 name=/runtime.v1.ImageService/PullImage
The redundant pulls cost network and disk traffic, and avoiding them should make those update-initiated reboots quicker and cheaper. The lack of update-initiated wipes is not expected to cost much, because the Kubelet's old-image garbage collection should be along to clear out any no-longer-used images if disk space gets tight.
At least 4.11. Possibly older 4.y; I haven't checked.
Every time.
1. Install a cluster.
2. Update to a release image with a different CRI-O (minor?) version.
3. Check logs on the nodes.
crio-wipe entries in the logs, with reports of target-release images being pulled before and after those wipes, as I quoted in the Description.
Target-release images pulled before the reboot, and found in the local cache if that image is needed again post-reboot.
Description of problem:
It seems something might be wrong with the logic for the new defaultChannel property. After initially syncing an operator to a tarball, subsequent runs complain the catalog is invalid, as if defaultChannel was never set.
Version-Release number of selected component (if applicable):
I tried oc-mirror v4.14.16 and v4.15.2
How reproducible:
100%
Steps to Reproduce:
1. Write this yaml config to an isc.yaml file in an empty dir. (It is worth noting that right now the default channel for this operator is of course something else – currently `latest`.)
kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 storageConfig: local: path: ./operator-images mirror: operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.14 packages: - name: openshift-pipelines-operator-rh defaultChannel: pipelines-1.11 channels: - name: pipelines-1.11 minVersion: 1.11.3 maxVersion: 1.11.3
2. Using oc-mirror v4.14.16 or v4.15.2, run:
oc-mirror -c ./isc.yaml file://operator-images
3. Without the defaultChannel property and a recent version of oc-mirror, that would have failed. Assuming it succeeds, run the same command a second time (with or without the --dry-run option) and note that it now fails. It seems nothing can be done. oc-mirror says the catalog is invalid.
Actual results:
$ oc-mirror -c ./isc.yaml file://operator-images Creating directory: operator-images/oc-mirror-workspace/src/publish Creating directory: operator-images/oc-mirror-workspace/src/v2 Creating directory: operator-images/oc-mirror-workspace/src/charts Creating directory: operator-images/oc-mirror-workspace/src/release-signatures No metadata detected, creating new workspace wrote mirroring manifests to operator-images/oc-mirror-workspace/operators.1711523827/manifests-redhat-operator-indexTo upload local images to a registry, run: oc adm catalog mirror file://redhat/redhat-operator-index:v4.14 REGISTRY/REPOSITORY <dir> openshift-pipelines/pipelines-chains-controller-rhel8 blobs: registry.redhat.io/openshift-pipelines/pipelines-chains-controller-rhel8 sha256:b06cce9e748bd5e1687a8d2fb11e5e01dd8b901eeeaa1bece327305ccbd62907 11.51KiB registry.redhat.io/openshift-pipelines/pipelines-chains-controller-rhel8 sha256:e5897b8264878f1f63f6eceed870b939ff39993b05240ce8292f489e68c9bd19 11.52KiB ... stats: shared=12 unique=274 size=24.71GiB ratio=0.98 info: Mirroring completed in 9m45.86s (45.28MB/s) Creating archive operator-images/mirror_seq1_000000.tar $ oc-mirror -c ./isc.yaml file://operator-images Found: operator-images/oc-mirror-workspace/src/publish Found: operator-images/oc-mirror-workspace/src/v2 Found: operator-images/oc-mirror-workspace/src/charts Found: operator-images/oc-mirror-workspace/src/release-signatures The current default channel was not valid, so an attempt was made to automatically assign a new default channel, which has failed. The failure occurred because none of the remaining channels contain an "olm.channel" priority property, so it was not possible to establish a channel to use as the default channel. This can be resolved by one of the following changes: 1) assign an "olm.channel" property on the appropriate channels to establish a channel priority 2) modify the default channel manually in the catalog 3) by changing the ImageSetConfiguration to filter channels or packages in such a way that it will include a package version that exists in the current default channel The rendered catalog is invalid. Run "oc-mirror list operators --catalog CATALOG-NAME --package PACKAGE-NAME" for more information. error: error generating diff: the current default channel "latest" for package "openshift-pipelines-operator-rh" could not be determined... ensure that your ImageSetConfiguration filtering criteria results in a package version that exists in the current default channel or use the 'defaultChannel' field
Expected results:
It should NOT throw that error and instead should either update (if you've added more to the imagesetconfig) or gracefully print the "No new images" message.
Description of problem:
Recently, the passing rate for test "static pods should start after being created" has dropped significantly for some platforms: https://sippy.dptools.openshift.org/sippy-ng/tests/4.15/analysis?test=%5Bsig-node%5D%20static%20pods%20should%20start%20after%20being%20created&filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22%5Bsig-node%5D%20static%20pods%20should%20start%20after%20being%20created%22%7D%5D%2C%22linkOperator%22%3A%22and%22%7D Take a look at this example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-techpreview/1712803313642115072 The test failed with the following message: { static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 6 on node: "ci-op-2z99zzqd-7f99c-rfp4q-master-0" didn't show up, waited: 3m0s} Seemingly revision 6 was never reached. But if we look at the log from kube-controller-manager-operator, it jumps from revision 5 to revision 7: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-techpreview/1712803313642115072/artifacts/e2e-azure-sdn-techpreview/gather-extra/artifacts/pods/openshift-kube-controller-manager-operator_kube-controller-manager-operator-7cd978d745-bcvkm_kube-controller-manager-operator.log The log also indicates that there is a possibility of race: W1013 12:59:17.775274 1 staticpod.go:38] revision 7 is unexpectedly already the latest available revision. This is a possible race! This might be a static controller issue. But I am starting with kube-controller-manager component for the case. Feel free to reassign. Here is a slack thread related to this: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1697472297510279
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
OCI external platform should be shown as Tech Preview when OCP 4.14 is selected.
https://redhat-internal.slack.com/archives/C04RBMZCBGW/p1711029226861489
How reproducible:
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
This is a clone of issue OCPBUGS-41552. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-39420. The following is the description of the original issue:
—
Description of problem:
ROSA HCP allows customers to select hostedcluster and nodepool OCP z-stream versions, respecting version skew requirements. E.g.:
Version-Release number of selected component (if applicable):
Reproducible on 4.14-4.16.z, this bug report demonstrates it for a 4.15.28 hostedcluster with a 4.15.25 nodepool
How reproducible:
100%
Steps to Reproduce:
1. Create a ROSA HCP cluster, which comes with a 2-replica nodepool with the same z-stream version (4.15.28) 2. Create an additional nodepool at a different version (4.15.25)
Actual results:
Observe that while nodepool objects report the different version (4.15.25), the resulting kernel version of the node is that of the hostedcluster (4.15.28) ❯ k get nodepool -n ocm-staging-2didt6btjtl55vo3k9hckju8eeiffli8 NAME CLUSTER DESIRED NODES CURRENT NODES AUTOSCALING AUTOREPAIR VERSION UPDATINGVERSION UPDATINGCONFIG MESSAGE mshen-hyper-np-4-15-25 mshen-hyper 1 1 False True 4.15.25 False False mshen-hyper-workers mshen-hyper 2 2 False True 4.15.28 False False ❯ k get no -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-129-139.us-west-2.compute.internal Ready worker 24m v1.28.12+396c881 10.0.129.139 <none> Red Hat Enterprise Linux CoreOS 415.92.202408100433-0 (Plow) 5.14.0-284.79.1.el9_2.aarch64 cri-o://1.28.9-5.rhaos4.15.git674ed4c.el9 ip-10-0-129-165.us-west-2.compute.internal Ready worker 98s v1.28.12+396c881 10.0.129.165 <none> Red Hat Enterprise Linux CoreOS 415.92.202408100433-0 (Plow) 5.14.0-284.79.1.el9_2.aarch64 cri-o://1.28.9-5.rhaos4.15.git674ed4c.el9 ip-10-0-132-50.us-west-2.compute.internal Ready worker 30m v1.28.12+396c881 10.0.132.50 <none> Red Hat Enterprise Linux CoreOS 415.92.202408100433-0 (Plow) 5.14.0-284.79.1.el9_2.aarch64 cri-o://1.28.9-5.rhaos4.15.git674ed4c.el9
Expected results:
Additional info:
Description of problem:
After applying networkpolicy on the namespace, and do live migration. Pod cannot be ready after updating the route table mtu and reboot
cat <<EOF | oc create -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: z3
spec:
podSelector: {}
policyTypes:
- Ingress
—
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
name: allow-all-ingress
namespace: z3
spec:
ingress:
- from:
- namespaceSelector:
matchLabels:
team: qe
podSelector:
matchLabels:
name: test
policyTypes:
- Ingress
—
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-from-openshift-ingress
namespace: z3
spec:
ingress:
- from:
- namespaceSelector:
matchLabels:
policy-group.network.openshift.io/ingress: ""
podSelector: {}
policyTypes:
- Ingress
EOF
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 6m25s (x4579 over 19h) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_hello-hdx8h_z3_8e3a0595-fabd-4953-a460-5c014290122d_0(383f4845fa3cc790f58c5d1a755fa46cc69c220a3669c65422a0423293c9863a): error adding pod z3_hello-hdx8h to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: 'ContainerID:"383f4845fa3cc790f58c5d1a755fa46cc69c220a3669c65422a0423293c9863a" Netns:"/var/run/netns/cbba5e98-ae28-4199-a573-ef1c24013442" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=z3;K8S_POD_NAME=hello-hdx8h;K8S_POD_INFRA_CONTAINER_ID=383f4845fa3cc790f58c5d1a755fa46cc69c220a3669c65422a0423293c9863a;K8S_POD_UID=8e3a0595-fabd-4953-a460-5c014290122d" Path:"" ERRORED: error configuring pod [z3/hello-hdx8h] networking: [z3/hello-hdx8h/8e3a0595-fabd-4953-a460-5c014290122d:openshift-sdn]: error adding container to network "openshift-sdn": failed to add route to 10.128.0.2/14 via SDN: invalid argument
': StdinData: {"binDir":"/var/lib/cni/bin","clusterNetwork":"/host/run/multus/cni/net.d/80-openshift-network.conf","cniVersion":"0.3.1","daemonSocketDir":"/run/multus/socket","globalNamespaces":"default,openshift-multus,openshift-sriov-network-operator","logLevel":"verbose","logToStderr":true,"name":"multus-cni-network","namespaceIsolation":true,"type":"multus-shim"}
Normal AddedInterface 2m4s multus Add eth0 [10.128.0.209/23] from openshift-sdn
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. setup 4.16 cluster
2. Create namespace and pods and then apply Networkpolicy
3. do live migration
Actual results:
After route table mtu is updated and reboot. the pods on that worker cannot be ready with error (see description)
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Description of problem:
E2E test failing 1. Clone oc-mirror repository git clone https://github.com/openshift/oc-mirror.git && cd oc-mirror 2. Find the oc-mirror image in the release: https://mirror.openshift.com/pub/openshift-v4/multi/clients/ocp/4.14.0-rc.2/ppc64le/release.txt oc-mirror quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fff150b00081ed565169de24cfc82481c5017de73986552d15d129530b62e531 3. Pull container podman pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fff150b00081ed565169de24cfc82481c5017de73986552d15d129530b62e531 4. Extract binary mkdir bin container_id=$(podman create quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fff150b00081ed565169de24cfc82481c5017de73986552d15d129530b62e531) podman cp ${container_id}:usr/bin/oc-mirror bin/oc-mirror 5. comfirm file [root@rdr-ani-014-bastion-0 oc-mirror]# file bin/oc-mirror bin/oc-mirror: ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500, version 1 (SYSV), dynamically linked, interpreter /lib64/ld64.so.2, for GNU/Linux 3.10.0, Go BuildID=HuBgap--bII0r0Nw0GxI/SOZCyTWk4pH5ciuQtUO8/ib6uaSW-eAJl24Zzk-G2/O4yxlKreHK_BaH9F4RU6, BuildID[sha1]=c018e70301e18c23f2c119ba451a32aff980d618, with debug_info, not stripped, too many notes (256) 6. Build go-toolset and run e2e test [root@rdr-ani-014-bastion-0 oc-mirror]# podman build -f Dockerfile -t local/go-toolset:latest Successfully tagged localhost/local/go-toolset:latest bf24f160059d7ae2ef99a77e6680cdac30e3ba942911b88c7e60dca88fd768f7 [root@rdr-ani-014-bastion-0 oc-mirror]# podman run -it -v $(pwd):/build:z --entrypoint /bin/bash local/go-toolset:latest ./test/e2e/e2e-simple.sh bin/oc-mirror | tee oc-mirror-e2e.log /build/test/e2e/operator-test.28124 /build % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 49.0M 100 49.0M 0 0 60.2M 0 --:--:-- --:--:-- --:--:-- 106M go: downloading github.com/google/go-containerregistry v0.16.1 go: downloading github.com/docker/cli v24.0.0+incompatible go: downloading github.com/opencontainers/image-spec v1.1.0-rc3 go: downloading github.com/spf13/cobra v1.7.0 go: downloading github.com/mitchellh/go-homedir v1.1.0 go: downloading golang.org/x/sync v0.2.0 go: downloading github.com/opencontainers/go-digest v1.0.0 go: downloading github.com/docker/distribution v2.8.2+incompatible go: downloading github.com/google/go-cmp v0.5.9 go: downloading github.com/containerd/stargz-snapshotter/estargz v0.14.3 go: downloading github.com/spf13/pflag v1.0.5 go: downloading github.com/klauspost/compress v1.16.5 go: downloading github.com/vbatts/tar-split v0.11.3 go: downloading github.com/pkg/errors v0.9.1 go: downloading github.com/docker/docker v24.0.0+incompatible go: downloading golang.org/x/sys v0.8.0 go: downloading github.com/docker/docker-credential-helpers v0.7.0 go: downloading github.com/sirupsen/logrus v1.9.1 bin/registry /build INFO: Running 22 test cases INFO: Running full_catalog . . . sha256:17de509b5c9e370d501951850ba07f6cbefa529f598f3011766767d1181726b3 localhost.localdomain:5001/skhoury/oc-mirror-dev:4138bec2 info: Mirroring completed in 40ms (119.4kB/s) worker 0 stopping worker 1 stopping worker 5 stopping worker 3 stopping worker 2 stopping worker 3 stopping worker 2 stopping worker 4 stopping work queue exiting No images specified for pruning Unpack release signatures worker 1 stopping work queue exiting worker 0 stopping Wrote release signatures to oc-mirror-workspace/results-1695964813 rebuilding catalog images Rendering catalog image "localhost.localdomain:5001/skhoury/oc-mirror-dev:test-catalog-latest" with file-based catalog error: error rebuilding catalog images from file-based catalogs: error regenerating the cache for localhost.localdomain:5001/skhoury/oc-mirror-dev:test-catalog-latest: fork/exec oc-mirror-workspace/images.1753960055/catalogs/localhost.localdomain:5000/skhoury/oc-mirror-dev/test-catalog-latest/bin/opm: exec format error
Version-Release number of selected component (if applicable):
4.14.0-rc.2
How reproducible:
Always
Steps to Reproduce
Same as details provided in description
Actual results:
E2E test is getting terminated in between the execution
Expected results:
E2E testing should pass with no errors
Additional info:
E2E logs are provided here: oc-mirror-e2e.log - https://github.ibm.com/redstack-power/project-mgmt/issues/3284#issuecomment-63722862 re-oc-mirror-e2e.log - https://github.ibm.com/redstack-power/project-mgmt/issues/3284#issuecomment-63806863
Please review the following PR: https://github.com/openshift/cluster-policy-controller/pull/144
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Many jobs are failing because route53 is throttling us during cluster creation.
We need a make external-dns make fewer calls.
The theoretical minimum is:
list zones - 1 call
list zone records - (# of records / 100) calls
create 3 records per HC - 1-3 calls depending on how they are batched
Please review the following PR: https://github.com/openshift/machine-api-provider-ibmcloud/pull/31
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/633
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Trying to run without --node-upgrade-type param fails for "spec.management.upgradeType: Unsupported value: \"\": supported values: \"Replace\", \"InPlace\"" although in --help it is documented to have a default value of 'InPlace'
Version-Release number of selected component (if applicable):
[kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp -v hcp version openshift/hypershift: af9c0b3ce9c612ec738762a8df893c7598cbf157. Latest supported OCP: 4.15.0 [
How reproducible:
happens all the time
Steps to Reproduce:
1.on an hosted cluster setup run : [kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2 --node-upgrade-type Replace --help Creates basic functional NodePool resources for Agent platformUsage: hcp create nodepool agent [flags]Flags: -h, --help help for agentGlobal Flags: --cluster-name string The name of the HostedCluster nodes in this pool will join. (default "example") --name string The name of the NodePool. --namespace string The namespace in which to create the NodePool. (default "clusters") --node-count int32 The number of nodes to create in the NodePool. (default 2) --node-upgrade-type UpgradeType The NodePool upgrade strategy for how nodes should behave when upgraded. Supported options: Replace, InPlace (default ) --release-image string The release image for nodes; if this is empty, defaults to the same release image as the HostedCluster. --render Render output as YAML to stdout instead of applying. 2.try to run with default value of --node-upgrade-type: [kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2
Actual results:
[kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2 2024-02-06T19:57:03+02:00 ERROR Failed to create nodepool {"error": "NodePool.hypershift.openshift.io \"nodepool-of-extra1\" is invalid: spec.management.upgradeType: Unsupported value: \"\": supported values: \"Replace\", \"InPlace\""} github.com/openshift/hypershift/cmd/nodepool/core.(*CreateNodePoolOptions).CreateRunFunc.func1 /home/kni/hypershift_working/hypershift/cmd/nodepool/core/create.go:39 github.com/spf13/cobra.(*Command).execute /home/kni/hypershift_working/hypershift/vendor/github.com/spf13/cobra/command.go:983 github.com/spf13/cobra.(*Command).ExecuteC /home/kni/hypershift_working/hypershift/vendor/github.com/spf13/cobra/command.go:1115 github.com/spf13/cobra.(*Command).Execute /home/kni/hypershift_working/hypershift/vendor/github.com/spf13/cobra/command.go:1039 github.com/spf13/cobra.(*Command).ExecuteContext /home/kni/hypershift_working/hypershift/vendor/github.com/spf13/cobra/command.go:1032 main.main /home/kni/hypershift_working/hypershift/product-cli/main.go:60 runtime.main /home/kni/hypershift_working/go/src/runtime/proc.go:250 Error: NodePool.hypershift.openshift.io "nodepool-of-extra1" is invalid: spec.management.upgradeType: Unsupported value: "": supported values: "Replace", "InPlace" NodePool.hypershift.openshift.io "nodepool-of-extra1" is invalid: spec.management.upgradeType: Unsupported value: "": supported values: "Replace", "InPlace"
Expected results:
should pass as if your adding the param : [kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2 --node-upgrade-type InPlace NodePool nodepool-of-extra1 created [kni@ocp-edge119 ~]$
Additional info:
A related issue is that we have a difference if the --help is used with other parameters or not : [kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2 --node-upgrade-type Replace --help > long.help.out [kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --help > short.help.out [kni@ocp-edge119 ~]$ diff long.help.out short.help.out 14c14 < --node-upgrade-type UpgradeType The NodePool upgrade strategy for how nodes should behave when upgraded. Supported options: Replace, InPlace (default ) --- > --node-upgrade-type UpgradeType The NodePool upgrade strategy for how nodes should behave when upgraded. Supported options: Replace, InPlace [kni@ocp-edge119 ~]$
Please review the following PR: https://github.com/openshift/kube-rbac-proxy/pull/91
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
When running a build (e.g. oc start-build) the build fails with reason CannotRetrieveServiceAccount and message Unable to look up the service account secrets for this build.
Description of problem:
Pre-test greenboot checks fail in during scenario run due to OVN-K pods reporting a "failed" status.
Version-Release number of selected component (if applicable):
I believe this is only affecting `periodic-ci-openshift-microshift-main-ocp-metal-nightly` jobs.
How reproducible:
Unsure. Has occurred 2 times in consecutive daily-periodic jobs.
Steps to Reproduce:
n/a
Actual results:
- https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-microshift-main-ocp-metal-nightly/1753221880392716288 - https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-microshift-main-ocp-metal-nightly/1753403067350388736
Expected results:
OVN-K Pods should deploy into a healthy state
Additional info:
Description of problem:
If a ROSA HCP customer uses the default worker security group that the CPO creates for some other purpose (i.e. creates their own VPC Endpoint or EC2 instance using this security group) and then starts an uninstallation - the uninstallation will hang indefinitely because the CPO is unable to delete the security group. https://github.com/openshift/hypershift/blob/9e6255e5e44c8464da0850f8c19dc085bdbaf8cb/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go#L317-L331
Version-Release number of selected component (if applicable):
4.14.8
How reproducible:
100%
Steps to Reproduce:
1. Create a ROSA HCP cluster 2. Attach the default worker security group to some other object unrelated to the cluster, like an EC2 instance or VPC Endpoint 3. Uninstall the ROSA HCP cluster
Actual results:
The uninstall hangs without much feedback to the customer
Expected results:
Either that the uninstall gives up and moves on eventually, or that clear feedback is provided to the customer, so that they know that the uninstall is held up because of an inability to delete a specific security group id. If this feedback mechanism is already in place, but not wired through to OCM, this may not be an OCPBUGS and could just be an OCM bug instead!
Additional info:
fatal error: concurrent map read and map write goroutine 31 [running]: k8s.io/apimachinery/pkg/runtime.(*Scheme).New(0xc0002e81c0, {{0x0, 0x0}, {0xc000bd17c6, 0x2}, {0xc000bd17c0, 0x6}}) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/runtime/scheme.go:296 +0x67 goroutine 82 [runnable]: k8s.io/apimachinery/pkg/runtime.(*Scheme).AddKnownTypeWithName(0xc0002e81c0, {{0x2ede1b9, 0x1a}, {0x2eb1bb8, 0x2}, {0x2ebb205, 0xa}}, {0x3388ab8?, 0xc00062e540}) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/runtime/scheme.go:176 +0x2b2
Description of problem:
Gateway API was added as a DevPreviewNoUpgrade feature before recent changes to the FeatureGate framework, and has not progressed to TechPreviewNoUpgrade. When the FeatureGate framework changed, Gateway API was mistakenly listed as a TechPreviewNoUpgrade feature in https://github.com/openshift/api/blob/master/features/features.go#L71 For 4.16 we are adding TechPreview testing to the cluster-ingress-operator for other features, and we do not want to test Gateway API as a tech preview feature. In fact, it has its own separate test.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
N/A
Steps to Reproduce:
N/A
Actual results:
Gateway API is tested as a Tech Preview feature
Expected results:
Gateway API should only be tested as a Dev Preview feature.
Additional info:
This is a clone of issue OCPBUGS-33631. The following is the description of the original issue:
—
Description of problem:
Currently we show the debug container action for pods that are failing. We should be showing the action also for pods in 'Succeeded' phase
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Log in into a cluster 2. Create an example Job resource 3. Check the job's pod and wait till it is in 'Succeeded' phase
Actual results:
Debug container action is not available, on the pod's Logs page
Expected results:
Debug container action is available, on the pod's Logs page
Additional info:
Since users are looking for this feature even for pods in any phase, we are treating this issue as bug. Related stories: RFE - https://issues.redhat.com/browse/RFE-1935 STORY - https://issues.redhat.com/browse/CONSOLE-4057 Code that needs to be removed - https://github.com/openshift/console/blob/ae115a9e8c72f930a67ee0c545d36f883cd6be34/frontend/public/components/utils/resource-log.tsx#L149-L151
This is a clone of issue OCPBUGS-35994. The following is the description of the original issue:
—
To reduce QE load, we've decided to block up the hole drilled in OCPBUGS-24535. We might not want a pure revert, if some of the changes are helpful (e.g. more helpful error messages).
We also want to drop the oc adm upgrade rollback subcommand which was the client-side tooling associated with the OCPBUGS-24535 hole.
Both 4.16 and 4.17 currently have the rollback subcommand and associated CVO-side hole.
Every time.
Try to perform the rollbacks that OCPBUGS-24535 allowed.
They work, as verified in OCPBUGS-24535.
They stop working, with reasonable ClusterVersion conditions explaining that even those rollback requests will not be accepted.
Please review the following PR: https://github.com/openshift/console-operator/pull/823
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
when user click ‘Cancel’ on any Secret creation page, it doesn’t return to Secrets list page
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-06-062415
How reproducible:
Always
Steps to Reproduce:
1. Go to Create Key/value secret|Image pull secret|Source secret|Webhook secret|FromYaml page eg:/k8s/ns/default/secrets/~new/generic 2. Click Cancel button 3.
Actual results:
The page does not go back to Secrets list page eg: /k8s/ns/default/core~v1~Secret
Expected results:
The page should go back to the Secrets list page
Additional info:
Please review the following PR: https://github.com/openshift/azure-file-csi-driver/pull/60
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Seen in CI:
I0409 09:52:54.280834 1 builder.go:299] openshift-cluster-etcd-operator version v0.0.0-alpha.0-1430-g3d5483e-3d5483e1 ... E0409 10:08:08.921203 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 1581 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x28cd3c0?, 0x4b191e0}) k8s.io/apimachinery@v0.29.0/pkg/util/runtime/runtime.go:75 +0x85 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0xc0016eccd0, 0x1, 0x27036c0?}) k8s.io/apimachinery@v0.29.0/pkg/util/runtime/runtime.go:49 +0x6b panic({0x28cd3c0?, 0x4b191e0?}) runtime/panic.go:914 +0x21f github.com/openshift/cluster-etcd-operator/pkg/operator/etcdcertsigner.addCertSecretToMap(0x0?, 0x0) github.com/openshift/cluster-etcd-operator/pkg/operator/etcdcertsigner/etcdcertsignercontroller.go:341 +0x27 github.com/openshift/cluster-etcd-operator/pkg/operator/etcdcertsigner.(*EtcdCertSignerController).syncAllMasterCertificates(0xc000521ea0, {0x32731e8, 0xc0006fd1d0}, {0x3280cb0, 0xc000194ee0}) github.com/openshift/cluster-etcd-operator/pkg/operator/etcdcertsigner/etcdcertsignercontroller.go:252 +0xa65 ...
It looks like syncAllMasterCertificates needs to be skipping the addCertSecretToMap calls for certs where EnsureTargetCertKeyPair returned an error.
Description of problem:
Trying to execute https://github.com/openshift-metal3/dev-scripts to deploy an OCP 4.16 or 4.17 cluster (with the same configuration OCP 4.14 and 4.15 are instead working) with: MIRROR_IMAGES=true INSTALLER_PROXY=true the bootstrap process fails with: level=debug msg= baremetalhost resource not yet available, will retry level=debug msg= baremetalhost resource not yet available, will retry level=info msg= baremetalhost: ostest-master-0: uninitialized level=info msg= baremetalhost: ostest-master-0: registering level=info msg= baremetalhost: ostest-master-1: uninitialized level=info msg= baremetalhost: ostest-master-1: registering level=info msg= baremetalhost: ostest-master-2: uninitialized level=info msg= baremetalhost: ostest-master-2: registering level=info msg= baremetalhost: ostest-master-1: inspecting level=info msg= baremetalhost: ostest-master-2: inspecting level=info msg= baremetalhost: ostest-master-0: inspecting E0514 12:16:51.985417 89709 reflector.go:147] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *unstructured.Unstructured: Get "https://api.ostest.test.metalkube.org:6443/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts?allowWatchBookmarks=true&resourceVersion=5466&timeoutSeconds=547&watch=true": Service Unavailable W0514 12:16:52.979254 89709 reflector.go:539] k8s.io/client-go/tools/watch/informerwatcher.go:146: failed to list *unstructured.Unstructured: Get "https://api.ostest.test.metalkube.org:6443/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts?resourceVersion=5466": Service Unavailable E0514 12:16:52.979293 89709 reflector.go:147] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: Get "https://api.ostest.test.metalkube.org:6443/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts?resourceVersion=5466": Service Unavailable E0514 12:37:01.927140 89709 reflector.go:147] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *unstructured.Unstructured: Get "https://api.ostest.test.metalkube.org:6443/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts?allowWatchBookmarks=true&resourceVersion=7800&timeoutSeconds=383&watch=true": Service Unavailable W0514 12:37:03.173425 89709 reflector.go:539] k8s.io/client-go/tools/watch/informerwatcher.go:146: failed to list *unstructured.Unstructured: Get "https://api.ostest.test.metalkube.org:6443/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts?resourceVersion=7800": Service Unavailable E0514 12:37:03.173473 89709 reflector.go:147] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: Get "https://api.ostest.test.metalkube.org:6443/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts?resourceVersion=7800": Service Unavailable level=debug msg=Fetching Bootstrap SSH Key Pair... level=debug msg=Loading Bootstrap SSH Key Pair... it looks like up to a certain point https://api.ostest.test.metalkube.org:6443 was reachable but then for some reason it started failing because its not using the proxy or is and it shouldn't be (???) The 3 master nodes are reported as: [root@ipi-ci-op-0qigcrln-b54ee-1790684582253694976 home]# oc get baremetalhosts -A NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE openshift-machine-api ostest-master-0 inspecting ostest-bbhxb-master-0 true inspection error 24m openshift-machine-api ostest-master-1 inspecting ostest-bbhxb-master-1 true inspection error 24m openshift-machine-api ostest-master-2 inspecting ostest-bbhxb-master-2 true inspection error 24m With something like: status: errorCount: 5 errorMessage: 'Failed to inspect hardware. Reason: unable to start inspection: Validation of image href http://0.0.0.0:8084/34427934-f1a6-48d6-9666-66872eec9ba2 failed, reason: Got HTTP code 503 instead of 200 in response to HEAD request.' errorType: inspection error on their status
Version-Release number of selected component (if applicable):
4.16, 4.17
How reproducible:
100%
Steps to Reproduce:
1. Try to create an OCP 4.16 cluster with dev-scrips with IP_STACK=v4, MIRROR_IMAGES=true and INSTALLER_PROXY=true 2. 3.
Actual results:
level=info msg= baremetalhost: ostest-master-0: inspecting E0514 12:16:51.985417 89709 reflector.go:147] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *unstructured.Unstructured: Get "https://api.ostest.test.metalkube.org:6443/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts?allowWatchBookmarks=true&resourceVersion=5466&timeoutSeconds=547&watch=true": Service Unavailable
Expected results:
Successful deployment
Additional info:
I'm using IP_STACK=v4, MIRROR_IMAGES=true and INSTALLER_PROXY=true with the same configuration (MIRROR_IMAGES=true and INSTALLER_PROXY=true) OCP 4.14 and OCP 4.15 are working. When removing INSTALLER_PROXY=true, OCP 4.16 is also working. I'm going to attach bootstrap gather logs
Please review the following PR: https://github.com/openshift/machine-api-operator/pull/1187
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
External consumers of MachineSets(), such as hive, need to be able to customize the client that queries the OpenStack cloud for trunk support.
OSASINFRA-3420, eliminating what looked like tech debt, removed that enablement, which had been added via a revert of a previous similar removal.
Reinstate the customizability, and include a docstring explanation to hopefully prevent it being removed again.
Description of problem:
when user clicks on perspective switcher after a hard refresh, the flicker appears
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-25-100326
How reproducible:
Always after user refresh the console
Steps to Reproduce:
1. user login to OCP console 2. refresh the whole console then click perspective switcher 3.
Actual results:
there is flicker when clicking on perspective switcher
Expected results:
no flickers
Additional info:
screen recording https://drive.google.com/file/d/1_2tPZ0DXNTapFP9sSz27vKbnwxxdWZSV/view?usp=drive_link
Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/103
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When upgrading cluster from 4.13.23 to 4.14.3, machine-config CO gets stuck due to a content mismatch error on all nodes. Node node-xxx-xxx is reporting: "unexpected on-disk state validating against rendered-master-734521b50f69a1602a3a657419ed4971: content mismatch for file \"/etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt\""
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. perform a upgrade from 4.13.x to 4.14.x 2. 3.
Actual results:
machine-config stalls during upgrade
Expected results:
the "content mismatch" shouldn't happen anymore according to the MCO engineering team
Additional info:
Description of problem:
Icons which were formally blue are no longer blue.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
ccoctl consumes CredentialsRequests extracted from OpenShift releases and manages secrets associated with those requests for the cluster. Over time, ccoctl has grown a number of CredentialRequest filters, including deletion annotations in CCO-175 and tech-preview annotations in cco#444.
But with OTA-559, 4.14 and later oc adm release extract ... learned about an --included parameter, which allows oc to perform that "will the cluster need this credential?" filtering, and there is no longer a need for ccoctl to perform that filtering, or for ccoctl callers to have to think through "do I need to enable tech-preview CRs for this cluster or not?".
4.14 and later.
100%.
$ cat <<EOF >install-config.yaml > apiVersion: v1 > platform: > gcp: > dummy: data > featureSet: TechPreviewNoUpgrade > EOF $ oc adm release extract --included --credentials-requests --install-config install-config.yaml --to credentials-requests quay.io/openshift-release-dev/ocp-release:4.14.0-rc.2-x86_64 $ ccoctl gcp create-all --dry-run --name=test --region=test --project=test --credentials-requests-dir=credentials-requests
ccoctl doesn't dry-run create the TechPreviewNoUpgrade openshift-cluster-api-gcp CredentialsRequest unless you pass it {--enable-tech-preview}}.
ccoctl does dry-run create the TechPreviewNoUpgrade openshift-cluster-api-gcp CredentialsRequest unless you pass it --enable-tech-preview=false.
Longer-term, we likely want to go through some phases of deprecating and maybe eventually removing --enable-tech-preview and the ccoctl-side filtering. But for now, I think we want to pivot to defaulting to true, so that anyone with existing flows that do not include the new --included extraction has an easy way to keep their workflow going (they can set --enable-tech-preview=false). And I think we should backport that to 4.14's ccoctl to simplify OSDOCS-4158's docs#62148. But we're close enough to 4.14's expected GA, that it's worth some consensus-building and alternative consideration, before trying to rush changes back to 4.14 branches.
Description of problem:
deploying compact 3-nodes cluster on GCP, by setting mastersSchedulable as true and removing worker machineset YAMLs, got panic
Version-Release number of selected component (if applicable):
$ openshift-install version openshift-install 4.13.0-0.nightly-2022-12-04-194803 built from commit cc689a21044a76020b82902056c55d2002e454bd release image registry.ci.openshift.org/ocp/release@sha256:9e61cdf7bd13b758343a3ba762cdea301f9b687737d77ef912c6788cbd6a67ea release architecture amd64
How reproducible:
Always
Steps to Reproduce:
1. create manifests 2. set 'spec.mastersSchedulable' as 'true', in <installation dir>/manifests/cluster-scheduler-02-config.yml 3. remove the worker machineset YAML file from <installation dir>/openshift directory 4. create cluster
Actual results:
Got "panic: runtime error: index out of range [0] with length 0".
Expected results:
The installation should succeed, or giving clear error messages.
Additional info:
$ openshift-install version openshift-install 4.13.0-0.nightly-2022-12-04-194803 built from commit cc689a21044a76020b82902056c55d2002e454bd release image registry.ci.openshift.org/ocp/release@sha256:9e61cdf7bd13b758343a3ba762cdea301f9b687737d77ef912c6788cbd6a67ea release architecture amd64 $ $ openshift-install create manifests --dir test1 ? SSH Public Key /home/fedora/.ssh/openshift-qe.pub ? Platform gcp INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" ? Project ID OpenShift QE (openshift-qe) ? Region us-central1 ? Base Domain qe.gcp.devcluster.openshift.com ? Cluster Name jiwei-1205a ? Pull Secret [? for help] ****** INFO Manifests created in: test1/manifests and test1/openshift $ $ vim test1/manifests/cluster-scheduler-02-config.yml $ yq-3.3.0 r test1/manifests/cluster-scheduler-02-config.yml spec.mastersSchedulable true $ $ rm -f test1/openshift/99_openshift-cluster-api_worker-machineset-?.yaml $ $ tree test1 test1 ├── manifests │ ├── cloud-controller-uid-config.yml │ ├── cloud-provider-config.yaml │ ├── cluster-config.yaml │ ├── cluster-dns-02-config.yml │ ├── cluster-infrastructure-02-config.yml │ ├── cluster-ingress-02-config.yml │ ├── cluster-network-01-crd.yml │ ├── cluster-network-02-config.yml │ ├── cluster-proxy-01-config.yaml │ ├── cluster-scheduler-02-config.yml │ ├── cvo-overrides.yaml │ ├── kube-cloud-config.yaml │ ├── kube-system-configmap-root-ca.yaml │ ├── machine-config-server-tls-secret.yaml │ └── openshift-config-secret-pull-secret.yaml └── openshift ├── 99_cloud-creds-secret.yaml ├── 99_kubeadmin-password-secret.yaml ├── 99_openshift-cluster-api_master-machines-0.yaml ├── 99_openshift-cluster-api_master-machines-1.yaml ├── 99_openshift-cluster-api_master-machines-2.yaml ├── 99_openshift-cluster-api_master-user-data-secret.yaml ├── 99_openshift-cluster-api_worker-user-data-secret.yaml ├── 99_openshift-machineconfig_99-master-ssh.yaml ├── 99_openshift-machineconfig_99-worker-ssh.yaml ├── 99_role-cloud-creds-secret-reader.yaml └── openshift-install-manifests.yaml2 directories, 26 files $ $ openshift-install create cluster --dir test1 INFO Consuming Openshift Manifests from target directory INFO Consuming Master Machines from target directory INFO Consuming Worker Machines from target directory INFO Consuming OpenShift Install (Manifests) from target directory INFO Consuming Common Manifests from target directory INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" panic: runtime error: index out of range [0] with length 0goroutine 1 [running]: github.com/openshift/installer/pkg/tfvars/gcp.TFVars({{{0xc000cf6a40, 0xc}, {0x0, 0x0}, {0xc0011d4a80, 0x91d}}, 0x1, 0x1, {0xc0010abda0, 0x58}, ...}) /go/src/github.com/openshift/installer/pkg/tfvars/gcp/gcp.go:70 +0x66f github.com/openshift/installer/pkg/asset/cluster.(*TerraformVariables).Generate(0x1daff070, 0xc000cef530?) /go/src/github.com/openshift/installer/pkg/asset/cluster/tfvars.go:479 +0x6bf8 github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc000c78870, {0x1a777f40, 0x1daff070}, {0x0, 0x0}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:226 +0x5fa github.com/openshift/installer/pkg/asset/store.(*storeImpl).Fetch(0x7ffc4c21413b?, {0x1a777f40, 0x1daff070}, {0x1dadc7e0, 0x8, 0x8}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:76 +0x48 main.runTargetCmd.func1({0x7ffc4c21413b, 0x5}) /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:259 +0x125 main.runTargetCmd.func2(0x1dae27a0?, {0xc000c702c0?, 0x2?, 0x2?}) /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:289 +0xe7 github.com/spf13/cobra.(*Command).execute(0x1dae27a0, {0xc000c70280, 0x2, 0x2}) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:876 +0x67b github.com/spf13/cobra.(*Command).ExecuteC(0xc000c3a500) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:990 +0x3bd github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:918 main.installerMain() /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:61 +0x2b0 main.main() /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:38 +0xff $
Description of problem:
Migrate an OpenShift Cluster to Azure AD Workload Identity, it is not have sufficient permissions to apply the Azure Pod Identity webhook configuration.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. According to the steps provided in the documentation: https://github.com/openshift/cloud-credential-operator/blob/master/docs/azure_workload_identity.md#steps-to-in-place-migrate-an-openshift-cluster-to-azure-ad-workload-identity 2. For step10. Failed to apply the azure pod identity webhook configuration.
Actual results:
For step10: [hmx@fedora CCO]$ oc replace -f ./CCO-456/output_dir/manifests/azure-ad-pod-identity-webhook-config.yaml Error from server (NotFound): error when replacing "./CCO-456/output_dir/manifests/azure-ad-pod-identity-webhook-config.yaml": secrets "azure-credentials" not found [hmx@fedora CCO]$ oc get po -n openshift-cloud-credential-operator NAME READY STATUS RESTARTS AGE cloud-credential-operator-594bf555b4-6srcq 2/2 Running 0 3h32m [hmx@fedora CCO]$ oc logs cloud-credential-operator-594bf555b4-6srcq -n openshift-cloud-credential-operator Defaulted container "kube-rbac-proxy" out of: kube-rbac-proxy, cloud-credential-operator Flag --logtostderr has been deprecated, will be removed in a future release, see https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/2845-deprecate-klog-specific-flags-in-k8s-components I0410 06:41:25.490507 1 kube-rbac-proxy.go:285] Valid token audiences: I0410 06:41:25.490752 1 kube-rbac-proxy.go:399] Reading certificate files I0410 06:41:25.491607 1 kube-rbac-proxy.go:447] Starting TCP socket on 0.0.0.0:8443 I0410 06:41:25.492241 1 kube-rbac-proxy.go:454] Listening securely on 0.0.0.0:8443 E0410 06:41:52.996659 1 webhook.go:154] Failed to make webhook authenticator request: Unauthorized E0410 06:41:52.997568 1 auth.go:47] Unable to authenticate the request due to an error: Unauthorized E0410 06:42:15.871706 1 webhook.go:154] Failed to make webhook authenticator request: Unauthorized E0410 06:42:15.871754 1 auth.go:47] Unable to authenticate the request due to an error: Unauthorized
Expected results:
Apply the azure pod identity webhook configuration successfully.
Additional info:
This is a clone of issue OCPBUGS-33834. The following is the description of the original issue:
—
4.16.0-0.nightly-2024-05-16-165920 aws-sdn-upgrade failures in 1791152612112863232
Undiagnosed panic detected in pod
{ pods/openshift-controller-manager_controller-manager-8d46bf695-cvdc6_controller-manager.log.gz:E0516 17:36:26.515398 1 runtime.go:79] Observed a panic: &runtime.TypeAssertionError{_interface:(*abi.Type)(0x3ca66c0), concrete:(*abi.Type)(0x3e9f720), asserted:(*abi.Type)(0x41dd660), missingMethod:""} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Secret)
This is a clone of issue OCPBUGS-31878. The following is the description of the original issue:
—
The multus-admission-controller does not retain its container resource requests/limits if manually set. The cluster-network-operator overwrites any modifications on the next reconciliation. This resource preservation support has already been added to all other components in https://github.com/openshift/hypershift/pull/1082 and https://github.com/openshift/hypershift/pull/3120. Similar changes should be made for the multus-admission-controller so all hosted control plane components demonstrate the same resource preservation behavior.
Description of problem:
When trying to deploy with an Internal publish strategy, DNS will fail because proxy VM cannot launch.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Set publishStrategy: Internal 2. Fail 3.
Actual results:
terraform fails
Expected results:
private cluster launches
Additional info:
Description of problem:
following signing-key deletion, there is a service CA rotation process which might temporary disrupt platform components, but eventually all should use the updated certificates.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-30-131338 and other recent 4.14 nightlies
How reproducible:
100%
Steps to Reproduce:
1.oc delete secret/signing-key -n openshift-service-ca 2. reload the management console 3.
Actual results:
The Observe tab disappears from the menu bar and the monitoring-plugin shows as unavailable.
Expected results:
No disruption
Additional info:
using manual deletion of the monitoring-plugin pods it is possible to recover the situation
Description of problem:
ConsolePluginComponents CMO task depends on the availability of a conversion service that is part of the console-operator Pod, that Pod is not duplicated, thus when it restarts due to a cluster upgrade or other that conversion webhook becomes unavailable and all the ConsolePlugin API queries from that CMO task fail.
Version-Release number of selected component (if applicable):
How reproducible:
Create a 4.14 cluster, make the console-operator unmanaged and bring it down, watch the ConsolePluginComponents tasks fail instantly after they're run.
Steps to Reproduce:
1. 2. 3.
Actual results:
The ConsolePluginComponents tasks fail instantly after they're run.
Expected results:
The tasks should be more resilient and retry. The long term solution is for that ConsolePlugin conversion service to be duplicated.
Additional info:
For OCP >=4.15, CMO v1 ConsolePlugin querie no longer rely on the conversion webhook because of https://github.com/openshift/api/pull/1477. But the retries will keep the task future proof + we'll be able to backport the fix.
This is a clone of issue OCPBUGS-35469. The following is the description of the original issue:
—
Description of problem:
As described in https://issues.redhat.com/browse/OCPQE-22479.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Standalone OCP encrypts various resources at rest in etcd: https://docs.openshift.com/container-platform/4.14/security/encrypting-etcd.html HyperShift control planes are only encrypting secrets. We should have parity with standalone.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Create HyperShift standalone control plane 2. Check that configmaps, routes, oauth access tokens or oauth authorize tokens are encrypted
Actual results:
Those resources are not encrypted
Expected results:
Those resources are encrypted
Additional info:
Resources to be encrypted are configured here: https://github.com/openshift/hypershift/blob/main/control-plane-operator/controllers/hostedcontrolplane/kas/kms/aws.go#L121-L126
Description of problem:
https://github.com/openshift/cluster-monitoring-operator/blob/release-4.16/Makefile#L266
@echo Installing tools from hack/tools.go
from https://github.com/openshift/cluster-monitoring-operator/blob/release-4.16/hack/tools/tools.go, it should be
@echo Installing tools from hack/tools/tools.go
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. see the description
Actual results:
hack/tools.go path is wrong in Makefile
Expected results:
should be hack/tools/tools.go
This is a clone of issue OCPBUGS-34365. The following is the description of the original issue:
—
Description of problem:
security-groups.yaml playbook runs the IPv6 security group rules creation tasks regardless of the os_subnet6 value. The when clause is not considering the os_subnet6 [1] value and is always executed.
It works with:
- name: 'Create security groups for IPv6' block: - name: 'Create master-sg IPv6 rule "OpenShift API"' [...] when: os_subnet6 is defined
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-11-033133
How reproducible:
Always
Steps to Reproduce:
1. Don't set the os_subnet6 in the inventory file [2] (so it's not dual-stack) 2. Deploy 4.15 UPI by running the UPI playbooks
Actual results:
IPv6 security group rules are created
Expected results:
IPv6 security group rules shouldn't be created
Additional info:
[1] https://github.com/openshift/installer/blob/46fd66272538c350327880e1ed261b70401b406e/upi/openstack/security-groups.yaml#L375
[2] https://github.com/openshift/installer/blob/46fd66272538c350327880e1ed261b70401b406e/upi/openstack/inventory.yaml#L77
Tracker issue for bootimage bump in 4.16. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-29441.
Description of problem:
It may happen that after changing some of the JS dependencies in Console frontend, the check-patternfly-modules.sh script may report false positives due to yarn install modifying the yarn.lock file in a way which is not expected by this script.
We should fix the check-patternfly-modules.sh script to parse yarn why command output instead.
4.16 etcd v1 FlowSchema is unrecognized by the outgoing 4.15 Kube API server, causing noise during 4.15 to 4.16 updates.
4.16.
Every time.
1. Install a 4.15 cluster.
2. Update to 4.16, while streaming CVO logs.
3. Search streamed CVO logs for Could not update flowschema "openshift-etcd-operator" ...the server does not recognize this resource, check extension API servers
A bunch of hits, until the Kube API server finishes updating to 4.16.
No distracting noise.
Evgeni noticed this in OTA-1246, and Petr root-caused it.
OCPBUGS-22969 moved a bunch of FlowSchema to v1, and for most of them, it won't be a problem. But etcd and Kube API server update in parallel in run-level 20, so the CVO pushing out the etcd manifest can race Kube API server being new enough to understand it. One possible solution would be sticking with v1beta3 for 4.16 and moving to v1 in 4.17 (like our customers will have to do). Another solution would be moving the 4.16 manifest out to run-level 31 (e.g. by renaming 0000_20_etcd-operator_10_flowschema.yaml to 0000_21_etcd-operator_10_flowschema.yaml) after branch-fork, since the need would be unique to the 4.16 branch.
This is a clone of issue OCPBUGS-34475. The following is the description of the original issue:
—
Description of problem:
When running a conformance suite against a hypershift cluster (for example, CNI conformance) the MonitorTests step fails because of missing files from the disruption monitor.
Version-Release number of selected component (if applicable):
4.15.13
How reproducible:
Consistent
Steps to Reproduce:
1. Create a hypershift cluster 2. Attempt to run an ose-tests suite. For example, the CNI conformance suite documented here: https://access.redhat.com/documentation/en-us/red_hat_software_certification/2024/html/red_hat_software_certification_workflow_guide/con_cni-certification_openshift-sw-cert-workflow-working-with-cloud-native-network-function#running-the-cni-tests_openshift-sw-cert-workflow-working-with-container-network-interface 3. Note errors in logs
Actual results:
found errors fetching in-cluster data: [failed to list files in disruption event folder on node ip-10-0-130-177.us-west-2.compute.internal: the server could not find the requested resource failed to list files in disruption event folder on node ip-10-0-152-10.us-west-2.compute.internal: the server could not find the requested resource] Failed to write events from in-cluster monitors, err: open /tmp/artifacts/junit/AdditionalEvents__in_cluster_disruption.json: no such file or directory
Expected results:
No errors
Additional info:
The first error can be avoided by creating the directory it's looking for on all nodes: for node in $(oc get nodes -oname); do oc debug -n default $node -- chroot /host mkdir -p /var/log/disruption-data/monitor-events; done However, I'm not sure if this directory not being created is due to the disruption monitor working properly on hypershift, or if this should be skipped on hypershift entirely. The second error is related to the ARTIFACT_DIR env var not being set locally, and can be avoided by creating a directory, setting that directory as the ARTIFACT_DIR, and then creating an empty "junit" dir inside of it. It looks like ARTIFACT_DIR defaults to a temporary directory if it's not set in the env, but the "junit" directory doesn't exist inside of it, so file creation in that non-existant directory fails.
This is a clone of issue OCPBUGS-42256. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-41631. The following is the description of the original issue:
—
Description of problem:
Panic seen in below CI job when run the below command
$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?name=^periodic&type=junit&search=machine-config-controller.*Observed+a+panic' | grep 'failures match' periodic-ci-openshift-insights-operator-stage-insights-operator-e2e-tests-periodic (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-insights-operator-release-4.17-insights-operator-e2e-tests-periodic (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
Panic observed:
E0910 09:00:04.283647 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 268 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x36c8b40, 0x5660c90}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000ce8540?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b panic({0x36c8b40?, 0x5660c90?}) /usr/lib/golang/src/runtime/panic.go:770 +0x132 github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).updateNode(0xc000d6e360, {0x3abd580?, 0xc00224a608}, {0x3abd580?, 0xc001bd2308}) /go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:585 +0x1f3 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:246 k8s.io/client-go/tools/cache.(*processorListener).run.func1() /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:976 +0xea k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001933f70, {0x3faaba0, 0xc000759710}, 0x1, 0xc00097bda0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000750f70, 0x3b9aca00, 0x0, 0x1, 0xc00097bda0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f k8s.io/apimachinery/pkg/util/wait.Until(...) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 k8s.io/client-go/tools/cache.(*processorListener).run(0xc000dc2630) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:972 +0x69 k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1() /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x52 created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 261 /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x33204b3]
Version-Release number of selected component (if applicable):
How reproducible:
Seen in this CI run -https://prow.ci.openshift.org/job-history/test-platform-results/logs/periodic-ci-openshift-insights-operator-stage-insights-operator-e2e-tests-periodic
Steps to Reproduce:
$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?name=^periodic&type=junit&search=machine-config-controller.*Observed+a+panic' | grep 'failures match'
Actual results:
Expected results:
No panic to observe
Additional info:
In the "request-serving" deployment model for HCPs, the request-serving nodes are being memory starved by the k8s API server. This has an observed impact on limiting the number of nodes a guest HCP cluster can provision, especially during upgrade events.
This is a spike card to investing setting the API Priority and Fairness [1] configuration, and exactly what configuration would be necessary to set.
[1] https://kubernetes.io/docs/concepts/cluster-administration/flow-control/
Please review the following PR: https://github.com/openshift/cluster-storage-operator/pull/432
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
https://github.com/openshift/origin/pull/28484
Broken for 3 weeks, possibly for different reasons.
Unit test failing now, looks like 4.16 data not getting picked up, and it probably should be.
This is a clone of issue OCPBUGS-34524. The following is the description of the original issue:
—
Description of problem:
Post featuregate enabling for egressfirewall doesn't work
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Setup 4.16 ovn cluster
2. Following doc to enable feature gate https://docs.openshift.com/container-platform/4.15/nodes/clusters/nodes-cluster-enabling-features.html#nodes-cluster-enabling-features-cli_nodes-cluster-enabling
3. Configure egressfirewall with dnsName
Actual results:
no dnsnameresolver under openshift-ovn-kubernetes
Expected results:
The feature is enabled and should have dnsnameresolver
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver/pull/68
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Agent CI jobs (compact and HA) are currently experiencing failures because the control-plane-machine-set operator is degraded, despite the SNO cluster operating normally.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100%
Actual results:
level=info msg=Cluster operator control-plane-machine-set Available is False with UnavailableReplicas: Missing 3 available replica(s)124level=error msg=Cluster operator control-plane-machine-set Degraded is True with UnmanagedNodes: Found 3 unmanaged node(s)125level=info msg=Cluster operator csi-snapshot-controller EvaluationConditionsDetected is Unknown with NoData: 126level=info msg=Cluster operator etcd EvaluationConditionsDetected is Unknown with NoData: 127level=info msg=Cluster operator ingress EvaluationConditionsDetected is False with AsExpected: 128level=info msg=Cluster operator insights ClusterTransferAvailable is False with NoClusterTransfer: no available cluster transfer129level=info msg=Cluster operator insights Disabled is False with AsExpected: 130level=info msg=Cluster operator insights SCAAvailable is False with Forbidden: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 403: {"code":"ACCT-MGMT-11","href":"/api/accounts_mgmt/v1/errors/11","id":"11","kind":"Error","operation_id":"dc5b9421-248f-4ac4-9135-ac5bf6bcd2ce","reason":"Account with ID 2DUeKzzTD9ngfsQ6YgkzdJn1jA4 denied access to perform create on Certificate with HTTP call POST /api/accounts_mgmt/v1/certificates"}131level=info msg=Cluster operator kube-apiserver EvaluationConditionsDetected is False with AsExpected: All is well132level=info msg=Cluster operator kube-controller-manager EvaluationConditionsDetected is Unknown with NoData: 133level=info msg=Cluster operator kube-scheduler EvaluationConditionsDetected is Unknown with NoData: 134level=info msg=Cluster operator network ManagementStateDegraded is False with : 135level=info msg=Cluster operator openshift-controller-manager EvaluationConditionsDetected is Unknown with NoData: 136level=info msg=Cluster operator storage EvaluationConditionsDetected is Unknown with NoData: 137level=error msg=Cluster initialization failed because one or more operators are not functioning properly.138level=error msg= The cluster should be accessible for troubleshooting as detailed in the documentation linked below,139level=error msg= https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html140ERROR: Installation failed. Aborting execution.
Expected results:
Install should be successful.
Additional info:
HA must gather: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-vsphere-agent-ha-f14/1771068123387006976/artifacts/vsphere-agent-ha-f14/gather-must-gather/artifacts/must-gather.tar Compact must gather: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/pr-logs/pull/openshift_release/50544/rehearse-50544-periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-vsphere-agent-compact-fips-f14/1775524930515898368/artifacts/vsphere-agent-compact-fips-f14/gather-must-gather/artifacts/must-gather.tar
While implementing MON-3669, we realized that none of the recording rules running on the telemeter server side are tested. Given how complex these rules can be, it's important for us to be confident that future changes won't bring regressions.
Even though not perfect, it's possible to unit tests Prometheus rules with the promtool binary (example in CMO: https://github.com/openshift/cluster-monitoring-operator/blob/2ca7067a4d1fc86b31f7a4816c85da6abc0c8abf/Makefile#L218-L221).
DoD
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/128
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The IngressController and DNSRecord CRDs were moved to dedicated packages following the introduction of a new method for generating CRDs in the OpenShift API repository ([openshift/api#1803|https://github.com/openshift/api/pull/1803]).
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. go mod edit -replace=github.com/openshift/api=github.com/openshift/api@ff84c2c732279b16baccf08c7dfc9ff8719c4807 2. go mod tidy 3. go mod vendor 4. make update
Actual results:
$ make update hack/update-generated-crd.sh --- vendor/github.com/openshift/api/operator/v1/0000_50_ingress-operator_00-ingresscontroller.crd.yaml 1970-01-01 01:00:00.000000000 +0100 +++ manifests/00-custom-resource-definition.yaml 2024-04-17 18:05:05.009605155 +0200 [LONG DIFF] cp: cannot stat 'vendor/github.com/openshift/api/operator/v1/0000_50_ingress-operator_00-ingresscontroller.crd.yaml': No such file or directory make: *** [Makefile:39: crd] Error 1
Expected results:
$ make update hack/update-generated-crd.sh hack/update-profile-manifests.sh
Additional info:
Add fr and es languages to i18n script for Memsource upload
Description of problem:
click on any node status popover, the popover dialog will always move to the first line
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-14-100410
How reproducible:
Always
Steps to Reproduce:
1. goes to Nodes list page, mark any worker as unschedulable 2. click on the 'Ready/Scheduling diabled' text, a popover dialog is opened 3.
Actual results:
2. the popover dialog will always jump to the first line, it looks like popover dialog is showing wrong status/info, also it's difficult for user to perform node actions
Expected results:
2. popover dialog should be shown exactly next to the correct node
Additional info:
Please review the following PR: https://github.com/openshift/csi-driver-shared-resource-operator/pull/101
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When clicking the "OpenShift Lightspeed" link on the cluster overview page, the OperatorHub modal does not automatically open. That is the result of Lightspeed having been added to a different catalog than is being referenced in the link. `/operatorhub/all-namespaces?keyword=lightspeed&details-item=lightspeed-operator-lightspeed-operator-catalog-openshift-marketplace` should be `/operatorhub/all-namespaces?keyword=lightspeed&details-item=lightspeed-operator-redhat-operators-openshift-marketplace`. This was fixed in the master branch with https://github.com/openshift/console/pull/14030/files#diff-6abb294e295409309d88e80b2bf2added8a5ebc7cbc12893baddb0b3212c7d85R38-R39, but this change needs to be backported to 4.16.z.
Description of problem:
Backport of https://issues.redhat.com/browse/CONSOLE-4108
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/csi-external-provisioner/pull/79
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Platform:
IPI on Baremetal
What happened?
In cases where no hostname is provided, host are automatically assigned the name "localhost" or "localhost.localdomain".
[kni@provisionhost-0-0 ~]$ oc get nodes
NAME STATUS ROLES AGE VERSION
localhost.localdomain Ready master 31m v1.22.1+6859754
master-0-1 Ready master 39m v1.22.1+6859754
master-0-2 Ready master 39m v1.22.1+6859754
worker-0-0 Ready worker 12m v1.22.1+6859754
worker-0-1 Ready worker 12m v1.22.1+6859754
What did you expect to happen?
Having all hosts come up as localhost is the worst possible user experience, because they'll fail to form a cluster but you won't know why.
However, we know the BMH name in the image-customization-controller, it would be possible to configure the ignition to set a default hostname if we don't have one from DHCP/DNS.
If not, we should at least fail the installation with a specific error message to this situation.
----------
30/01/22 - adding how to reproduce
----------
How to Reproduce:
1)prepare and installation with day-1 static ip.
add to install-config uner one of the nodes:
networkConfig:
routes:
config:
2)Ensure a DNS PTR for the address IS NOT configured.
3)create manifests and cluster from install-config.yaml
installation should either:
1)fail as early as possible, and provide some sort of feed back as to the fact that no hostname was provided.
2)derive the Hostname from the bmh or the ignition files
Description of problem:
Looks like we are facing a bug when trying to spin up a hosted control plane cluster while using proxy settings to connect to the internet. For example, on our worker node, the static pod kube-apiserver-proxy.yaml doesn't contain the noProxy settings which seem to cause the failure of deploying the hosted cluster. ~~~ [root@ocpugbo2cogswo03 manifests]# cat kube-apiserver-proxy.yaml_ apiVersion: v1 kind: Pod metadata: creationTimestamp: null labels: k8s-app: kube-apiserver-proxy name: kube-apiserver-proxy namespace: kube-system spec: containers: - command: - control-plane-operator - kubernetes-default-proxy - --listen-addr=<IP-Addr>:6443 - --proxy-addr=<Proxy-Addr>:<Proxy-port> - --apiserver-addr=<API-IP-Addr>:6443 image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ca95b9a71e41157c70378896758618b993ad90e6d80a23c46170da5c11f441f name: kubernetes-default-proxy resources: requests: cpu: 13m memory: 16Mi securityContext: runAsUser: 1001 hostNetwork: true priorityClassName: system-node-critical status: {} ~~~ Can you please check this issue.
Steps to Reproduce:
1. Install a cluster with ACM and HCP 2. Try to create a hosted cluster using proxy configuration 3. kube-apiserver-proxy is using proxy to reach API.
Actual results:
The kube-apiserver-proxy is using proxy to reach API. Worker nodes are unable to reach a Hosted Control Plane's API when a cluster-wide http proxy is configured.
Expected results:
kube-apiserver-proxy should not use proxy to reach API
Additional info:
Please review the following PR: https://github.com/openshift/image-customization-controller/pull/114
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
capi-based installer failing with missing openshift-cluster-api namespace
Version-Release number of selected component (if applicable):
How reproducible:
Always in CustomNoUpgrade
Steps to Reproduce:
1. 2. 3.
Actual results:
Install failure
Expected results:
namespace create, install succeeds or does not error on missing namespace
Additional info:
Please review the following PR: https://github.com/openshift/cluster-olm-operator/pull/39
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/baremetal-runtimecfg/pull/291
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-35511. The following is the description of the original issue:
—
Description of problem:
If an infra-id (which is uniquely generated by the installer) is reused the installer will fail with: level=info msg=Creating private Hosted Zone level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed provisioning resources after infrastructure ready: failed to create private hosted zone: error creating private hosted zone: HostedZoneAlreadyExists: A hosted zone has already been created with the specified caller reference. Users should not be reusing installer state in this manner, but we do it purposefully in our ipi-install-install step to mitigate infrastructure provisioning flakes: https://steps.ci.openshift.org/reference/ipi-install-install#line720 We can fix this by ensuring the caller ref is unique on each invocation.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-34975. The following is the description of the original issue:
—
Description of problem:
See https://issues.redhat.com//browse/CORS-3523 and https://issues.redhat.com//browse/CORS-3524 for the overall issue. Creating this bug for backporting purposes.
Version-Release number of selected component (if applicable):
all
How reproducible:
always in the terraform path
Steps to Reproduce:
1. 2. 3.
Actual results:
spot instances only supported for worker nodes.
Expected results:
spot instances used for all nodes.
Additional info:
Description of problem:
pki operator runs even when annotation to turn off PKI is on the hosted control plane
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This bug just focuses on denoising WellKnown_NotReady. More generic Available=False denoising is tracked in https://issues.redhat.com/browse/OCPBUGS-20056.
Reviving bugzilla#2010539, the authentication ClusterOperator occasionally blips Available=False with reason=WellKnown_NotReady. For example, this run includes:
: [bz-apiserver-auth] clusteroperator/authentication should not change condition/Available expand_less 47m21s { 1 unexpected clusteroperator state transitions during e2e test run. These did not match any known exceptions, so they cause this test-case to fail: Oct 03 19:11:20.502 - 245ms E clusteroperator/authentication condition/Available reason/WellKnown_NotReady status/False WellKnownAvailable: The well-known endpoint is not yet available: failed to GET kube-apiserver oauth endpoint https://10.0.0.3:6443/.well-known/oauth-authorization-server: dial tcp 10.0.0.3:6443: i/o timeout
While a dial timeout for the Kube API server isn't fantastic, an issue that only persists for 245ms is not long enough to warrant immediate admin intervention. Teaching the authentication operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.
4.8, 4.10, and 4.15. Likely all supported versions of the authentication operator have this exposure.
Looks like 10 to 50% of 4.15 runs have some kind of issue with authentication going Available=False, see Actual results below. These are likely for reasons that do not require admin intervention, although figuring that out is tricky today, feel free to push back if you feel that some of these do warrant admin immediate admin intervention.
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/authentication+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 18 runs, 44% failed, 13% of failures match = 6% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-sdn-arm64 (all) - 9 runs, 67% failed, 17% of failures match = 11% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-azure-ovn-arm64 (all) - 9 runs, 33% failed, 33% of failures match = 11% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-azure-ovn-heterogeneous (all) - 18 runs, 56% failed, 30% of failures match = 17% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-ovn-serial-aws-arm64 (all) - 9 runs, 33% failed, 33% of failures match = 11% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-serial-ovn-ppc64le-powervs (all) - 6 runs, 100% failed, 33% of failures match = 33% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 18 runs, 67% failed, 25% of failures match = 17% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-e2e-aws-sdn-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 18 runs, 50% failed, 33% of failures match = 17% impact periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 70 runs, 41% failed, 86% of failures match = 36% impact periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 80 runs, 21% failed, 76% of failures match = 16% impact periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-sdn-techpreview-serial (all) - 7 runs, 29% failed, 100% of failures match = 29% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 28% failed, 36% of failures match = 10% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 39% failed, 123% of failures match = 48% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 71 runs, 49% failed, 80% of failures match = 39% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-upgrade (all) - 7 runs, 57% failed, 75% of failures match = 43% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-sdn-upgrade (all) - 7 runs, 29% failed, 50% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial (all) - 7 runs, 100% failed, 57% of failures match = 57% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 70 runs, 34% failed, 4% of failures match = 1% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-azure-sdn (all) - 7 runs, 29% failed, 50% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-serial (all) - 7 runs, 43% failed, 67% of failures match = 29% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-upgrade (all) - 7 runs, 29% failed, 50% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-serial-ovn-ipv6 (all) - 7 runs, 29% failed, 50% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 7 runs, 71% failed, 20% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 7 runs, 43% failed, 33% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 7 runs, 100% failed, 57% of failures match = 57% impact periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 12 runs, 58% failed, 14% of failures match = 8% impact
Digging into reason and message frequency in 4.15-releated update CI:
$ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False[^:]*: \(.*\)|\1 \2 \3|' | sed 's/[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*/x.x.x.x/g;s|[.]apps[.][^/]*|.apps.../|g' | sort | uniq -c | sort -n 1 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: failing or missing response from https://x.x.x.x:8443/apis/oauth.openshift.io/v1: Get "https://x.x.x.x:8443/apis/oauth.openshift.io/v1": dial tcp x.x.x.x:8443: i/o timeout 1 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: failing or missing response from https://x.x.x.x:8443/apis/oauth.openshift.io/v1: Get "https://x.x.x.x:8443/apis/oauth.openshift.io/v1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 1 authentication APIServices_Error "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request 1 authentication APIServices_Error rpc error: code = Unavailable desc = the connection is draining 1 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable Get "https://oauth-openshift.apps...//healthz": dial tcp: lookup oauth-openshift.apps.../ 1 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable Get "https://oauth-openshift.apps...//healthz": dial tcp x.x.x.x:443: connect: connection refused 1 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable Get "https://[fd02::410f]:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 1 Nov 28 09:09:40.407 - 1s E clusteroperator/authentication condition/Available reason/APIServerDeployment_PreconditionNotFulfilled status/False 2 authentication APIServerDeployment_NoPod no .openshift-oauth-apiserver pods available on any node. 2 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: failing or missing response from https://x.x.x.x:8443/apis/user.openshift.io/v1: Get "https://x.x.x.x:8443/apis/user.openshift.io/v1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2 authentication APIServices_Error rpc error: code = Unknown desc = malformed header: missing HTTP content-type 4 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" 4 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: failing or missing response from https://x.x.x.x:8443/apis/user.openshift.io/v1: Get "https://x.x.x.x:8443/apis/user.openshift.io/v1": dial tcp x.x.x.x:8443: i/o timeout 6 authentication OAuthServerDeployment_NoDeployment deployment/openshift-authentication: could not be retrieved 7 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https"\nAPIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" 7 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable Get "https://x.x.x.x:443/healthz": dial tcp x.x.x.x:443: i/o timeout (Client.Timeout exceeded while awaiting headers) 8 authentication APIServerDeployment_NoPod no apiserver.openshift-oauth-apiserver pods available on any node. 9 authentication APIServerDeployment_NoDeployment deployment/openshift-oauth-apiserver: could not be retrieved 9 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable Get "https://oauth-openshift.apps...//healthz": EOF 11 authentication WellKnown_NotReady The well-known endpoint is not yet available: failed to GET kube-apiserver oauth endpoint https://x.x.x.x:6443/.well-known/oauth-authorization-server: dial tcp x.x.x.x:6443: i/o timeout 23 authentication APIServices_Error "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "user.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request 26 authentication APIServices_Error "user.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request 29 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" 29 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable Get "https://oauth-openshift.apps...//healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 30 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable Get "https://x.x.x.x:443/healthz": dial tcp x.x.x.x:443: connect: connection refused 34 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable Get "https://x.x.x.x:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
And simplifying by looking only at reason:
curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False.*|\1 \2|' | sort | uniq -c | sort -n 1 authentication APIServerDeployment_PreconditionNotFulfilled 6 authentication OAuthServerDeployment_NoDeployment 8 authentication APIServerDeployment_NoDeployment 10 authentication APIServerDeployment_NoPod 11 authentication WellKnown_NotReady 36 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable 43 authentication APIServices_PreconditionNotReady 66 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable 95 authentication APIServices_Error
Authentication goes Available=False on WellKnown_NotReady if and only if immediate admin intervention is appropriate.
The SDN live migration can not work properly in a cluster with specific configurations. CNO shall refuse proceeding the live migration in such a case. We need to add the pre-migration validation to CNO
The live migration shall be blocked for clusters with the following configuration
In upstream we started using ValidatingAdmissionPolicy API to enforce package uniqueness for ClusterExtension.
These API are still not enabled by default in OCP 4.16 (K8s 1.29), but should be enabled with TechPreviewNoUpgrade. For some reason our E2E CI job fails despite the fact that we are running it with tech preview.
Undiagnosed panic detected in pod
{ pods/openshift-console-operator_console-operator-598b8fdbb-v2tnv_console-operator.log.gz:E0510 11:53:50.992299 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) pods/openshift-console-operator_console-operator-598b8fdbb-v2tnv_console-operator.log.gz:E0510 11:53:51.008625 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) pods/openshift-console-operator_console-operator-598b8fdbb-v2tnv_console-operator.log.gz:E0510 11:53:51.023108 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) pods/openshift-console-operator_console-operator-598b8fdbb-v2tnv_console-operator.log.gz:E0510 11:53:51.033921 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) pods/openshift-console-operator_console-operator-598b8fdbb-v2tnv_console-operator.log.gz:E0510 11:53:51.045080 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)}
Sippy shows this appears to be hitting a few places though it's early and hard to see. Serial looks affected.
Job runs with failure, note due to the nature of where this panic is detected, job runs can pass despite this crash. The failure will show in prow, but the job will can be considered successful if nothing else failed.
search CI chart view seems to indicate it kicked off roughly 24 hours ago.
Possible revert candidate? https://github.com/openshift/console-operator/pull/895
This is a clone of issue OCPBUGS-34820. The following is the description of the original issue:
—
Description of problem:
Removing imageContentSources from HostedCluster does not update IDMS for the cluster.
Version-Release number of selected component (if applicable):
Tested with 4.15.14
How reproducible:
100%
Steps to Reproduce:
1. add imageContentSources to HostedCluster 2. verify it is applied to IDMS 3. remove imageContentSources from HostedCluster
Actual results:
IDMS is not updated to remove imageDigestMirrors contents
Expected results:
IDMS is updated to remove imageDigestMirrors contents
Additional info:
Workaround, set imageContentSources=[]
Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/152
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-38677. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-38657. The following is the description of the original issue:
—
https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/2781
https://kubernetes.slack.com/archives/CKFGK3SSD/p1704729665056699
https://github.com/okd-project/okd/discussions/1993#discussioncomment-10385535
Description of problem:
INFO Waiting up to 15m0s (until 2:23PM UTC) for machines [vsphere-ipi-b8gwp-bootstrap vsphere-ipi-b8gwp-master-0 vsphere-ipi-b8gwp-master-1 vsphere-ipi-b8gwp-master-2] to provision... E0819 14:17:33.676051 2162 session.go:265] "Failed to keep alive govmomi client, Clearing the session now" err="Post \"https://vctest.ars.de/sdk\": context canceled" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local" E0819 14:17:33.708233 2162 session.go:295] "Failed to keep alive REST client" err="Post \"https://vctest.ars.de/rest/com/vmware/cis/session?~action=get\": context canceled" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local" I0819 14:17:33.708279 2162 session.go:298] "REST client session expired, clearing session" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local"
Description of problem:
The test implementation in https://github.com/openshift/origin/commit/5487414d8f5652c301a00617ee18e5ca8f339cb4#L56 assumes there is just one kubelet service or at least that it is always the first one in the MCP. Which just changed in https://github.com/openshift/machine-config-operator/pull/4124 and the test is failing.
Version-Release number of selected component (if applicable):
master branch of 4.16
How reproducible:
always during test
Steps to Reproduce:
1. Test with https://github.com/openshift/machine-config-operator/pull/4124 applied
Actual results:
Test detects a wrong service and fails
Expected results:
Test finds the proper kubelet.service and passes
Additional info:
When clicking on the output image link on a Shipwright BuildRun details page, the link leads to the imagestream details page but shows 404 error.
The image link is:
https://console-openshift-console.apps...openshiftapps.com/k8s/ns/buildah-example/imagestreams/sample-kotlin-spring%3A1.0-shipwright
The BuildRun spec
apiVersion: shipwright.io/v1beta1 kind: BuildRun metadata: generateName: sample-spring-kotlin-build- name: sample-spring-kotlin-build-xh2dq namespace: buildah-example labels: build.shipwright.io/generation: '2' build.shipwright.io/name: sample-spring-kotlin-build spec: build: name: sample-spring-kotlin-build status: buildSpec: output: image: 'image-registry.openshift-image-registry.svc:5000/buildah-example/sample-kotlin-spring:1.0-shipwright' paramValues: - name: run-image value: 'paketocommunity/run-ubi-base:latest' - name: cnb-builder-image value: 'paketobuildpacks/builder-jammy-tiny:0.0.176' - name: app-image value: 'image-registry.openshift-image-registry.svc:5000/buildah-example/sample-kotlin-spring:1.0-shipwright' source: git: url: 'https://github.com/piomin/sample-spring-kotlin-microservice.git' type: Git strategy: kind: ClusterBuildStrategy name: buildpacks completionTime: '2024-02-12T12:15:03Z' conditions: - lastTransitionTime: '2024-02-12T12:15:03Z' message: All Steps have completed executing reason: Succeeded status: 'True' type: Succeeded output: digest: 'sha256:dc3d44bd4d43445099ab92bbfafc43d37e19cfaf1cac48ae91dca2f4ec37534e' source: git: branchName: master commitAuthor: Piotr Mińkowski commitSha: aeb03d60a104161d6fd080267bf25c89c7067f61 startTime: '2024-02-12T12:13:21Z' taskRunName: sample-spring-kotlin-build-xh2dq-j47ql
Description of problem:
In-cluster clients should be able to talk directly to the node local apiservert ip address and as a best practice should all be configured to use it. This load balancer provides added benefit in cloud environments of healthchecking the path from the machine to the load balancer fronting the kube-apiserver. It becomes more cruicial in baremetal/on-prem environments where there may not be a load balancer and instead just 3 unique endpoints directly to redundant kube-apiservers. In this case: if using just dns: intermittent traffic failures will be experienced if a control plane instance goes down. Using the node local load balancer: there will be no traffic disruption
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100%
Steps to Reproduce:
1. Schedule a pod on any hypershift cluster node 2. In the pod run curl -v -k https://172.20.0.1:6443 3. The verbose output will show that the kube-apiserver cert does not have the node local client load balancer IP address in it's IPs section and therefore will not allow valid HTTPS requests on that address
Actual results:
Secure HTTPS requests cannot be made to the kube-apiserver
Expected results:
Secure HTTPS requests can be made to the kube-apiserver (no need to run -k when specifying proper CA bundle)
Additional info:
Description of problem:
Sometimes deleting the bootstrap ssh rule during bootstrap destroy can timeout after 5min, failing the installation.
Version-Release number of selected component (if applicable):
4.16+ with capi/aws
How reproducible:
Intermittent
Steps to Reproduce:
1. 2. 3.
Actual results:
level=info msg=Waiting up to 5m0s (until 2:31AM UTC) for bootstrap SSH rule to be destroyed... level=fatal msg=error destroying bootstrap resources failed during the destroy bootstrap hook: failed to remove bootstrap SSH rule: bootstrap ssh rule was not removed within 5m0s: timed out waiting for the condition
Expected results:
The rule is deleted successfully and in a timely manner.
Additional info:
This is probably happening because we are changing the AWSCluster object, thus causing capi/capa to trigger a big reconciliation of the resources. We should try to delete the rule via aws sdk.
This is a clone of issue OCPBUGS-33331. The following is the description of the original issue:
—
Description of problem:
nmstate-configuration.service failed due to wrong variable name $hostname_file https://github.com/openshift/machine-config-operator/blob/5a6e8b81f13de2dbf606a497140ac6e9c2a00e6f/templates/common/baremetal/files/nmstate-configuration.yaml#L26
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
always
Steps to Reproduce:
1. install cluster via dev-script, with node-specific network configuration
Actual results:
nmstate-configuration failed: sh-5.1# journalctl -u nmstate-configuration May 07 02:19:54 worker-0 systemd[1]: Starting Applies per-node NMState network configuration... May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + systemctl -q is-enabled mtu-migration May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + echo 'Cleaning up left over mtu migration configuration' May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: Cleaning up left over mtu migration configuration May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + rm -rf /etc/cno/mtu-migration May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + '[' -e /etc/nmstate/openshift/applied ']' May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + src_path=/etc/nmstate/openshift May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + dst_path=/etc/nmstate May 07 02:19:54 worker-0 systemd[1]: nmstate-configuration.service: Main process exited, code=exited, status=1/FAILURE May 07 02:19:54 worker-0 nmstate-configuration.sh[1565]: ++ hostname -s May 07 02:19:54 worker-0 systemd[1]: nmstate-configuration.service: Failed with result 'exit-code'. May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + hostname=worker-0 May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + host_file=worker-0.yml May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + cluster_file=cluster.yml May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + config_file= May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + '[' -s /etc/nmstate/openshift/worker-0.yml ']' May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: /usr/local/bin/nmstate-configuration.sh: line 22: hostname_file: unbound variable May 07 02:19:54 worker-0 systemd[1]: Failed to start Applies per-node NMState network configuration.
Expected results:
cluster can be setup successfully with node-specific network configuration via new mechanism
Additional info:
This is a clone of issue OCPBUGS-36816. The following is the description of the original issue:
—
Description of problem:
Dynamic plugins using PatternFly 4 could be referring to PF4 variables that do not exist in OpenShift 4.15+. Currently this is causing contrast issues for ACM in dark mode for donut charts.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. Install ACM on OpenShift 4.15 2. Switch to dark mode 3. Observe Home > Overview page
Actual results:
Some categories in the donut charts cannot be seen due to low contrast
Expected results:
Colors should match those seen in OpenShift 4.14 and earlier
Additional info:
Also posted about this on Slack: https://redhat-internal.slack.com/archives/C011BL0FEKZ/p1720467671332249 Variables like --pf-chart-color-gold-300 are no longer provided, although the PF5 equivalent, --pf-v5-chart-color-gold-300, is available. The stylesheet @patternfly/patternfly/patternfly-charts.scss is present, but not the V4 version. Hopefully it is possible to also include these styles since the names now include a version.
Please review the following PR: https://github.com/openshift/telemeter/pull/497
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-etcd-operator/pull/1175
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/203
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Recent introductions of a validation within the hypershift operator's webhook conflicts with the UI's ability to create HCP clusters. Previously the pull secret was not required to be posted before an HC or NP, but with a recent change, the pull secret is required because the pull secret is used to validate the release image payload. This issue is isolated to 4.15
Version-Release number of selected component (if applicable):
4.15
How reproducible:
100% attempt to post a HC before the pull secret is posted and the HC will be rejected. The expected outcome is that it should be able to post the pull secret for a HC after the HC is posted, and the controller should be eventually consistent to this change.
Description of problem:
Failed to upgrade 4.15 from 4.16 with vsphere UPI due to 03-19 05:58:11.372 network 4.16.0-0.nightly-2024-03-13-061822 True False True 9h Error while updating infrastructures.config.openshift.io/cluster: failed to apply / update (config.openshift.io/v1, Kind=Infrastructure) /cluster: Infrastructure.config.openshift.io "cluster" is invalid: [spec.platformSpec.vsphere.apiServerInternalIPs: Invalid value: "null": spec.platformSpec.vsphere.apiServerInternalIPs in body must be of type array: "null", spec.platformSpec.vsphere.ingressIPs: Invalid value: "null": spec.platformSpec.vsphere.ingressIPs in body must be of type array: "null", spec.platformSpec.vsphere.machineNetworks: Invalid value: "null": spec.platformSpec.vsphere.machineNetworks in body must be of type array: "null", <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]
Version-Release number of selected component (if applicable):
upgrade chain 4.11.58-x86_64 - > 4.12.53-x86_64,4.13.37-x86_64,4.14.17-x86_64,4.15.3-x86_64,4.16.0-0.nightly-2024-03-13-061822
How reproducible:
always
Steps to Reproduce:
1. Upgrade cluster from 4.15 -> 4.16 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The layout as described in https://docs.openshift.com/container-platform/4.14/networking/metallb/metallb-configure-return-traffic.html does not work if the service has external traffic policy = local
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Follow https://docs.openshift.com/container-platform/4.14/networking/metallb/metallb-configure-return-traffic.html with etp = local 2. 3.
Actual results:
The service's reply does not make it to the client
Expected results:
Additional info:
This is because the routes leverage the clusterip cidr, when with etp=local they leverage the special masquerade ip
This is a clone of issue OCPBUGS-42974. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-42873. The following is the description of the original issue:
—
Description of problem:
openshift-apiserver that sends traffic through konnectivity proxy is sending traffic intended for the local audit-webhook service. The audit-webhook service should be included in the NO_PROXY env var of the openshift-apiserver container.
4.14.z,4.15.z,4.15.z,4.16.z
How reproducible:{code:none} Always
Steps to Reproduce:
1. Create a rosa hosted cluster 2. Obeserve logs of the konnectivity-proxy sidecar of openshift-apiserver 3.
Actual results:
Logs include requests to the audit-webhook local service
Expected results:
Logs do not include requests to audit-webhook
Additional info:
Description of problem:
ARO supplies a platform kubeletconfig to enable certain features, currently we use this to enable node sizing or enable autoSizingReserved. Customers want the ability to customize podPidsLimit and we have directed them to configure a second kubeletconfig.
When these kubeletconfigs are rendered into machineconfigs, the order of their application is nondeterministic: the MCs are suffixed by an increasing serial number based on the order the kubeletconfigs were created. This makes it impossible for the customer to ensure their PIDs limit is applied while still allowing ARO to maintain our platform defaults.
We need a way of supplying platform defaults while still allowing the customer to make supported modifications in a way that does not risk being reverted during upgrades or other maintenance.
This issue has manifested in two different ways:
During an upgrade from 4.11.31 to 4.12.40, a cluster had the order of kubeletconfig rendered machine configs reverse. We think that in older versions, the initial kubeletconfig did not get an mc-name-suffix annotation applied, but rendered to "99-worker-generated-kubelet" (no suffix). The customer-provided kubeletconfig rendered to the suffix "-1". During the upgrade, MCO saw this as a new kubeletconfig and assigned it the suffix "-2", effectively reversing their order. See the RCS document https://docs.google.com/document/d/19LuhieQhCGgKclerkeO1UOIdprOx367eCSuinIPaqXA
ARO wants to make updates to the platform defaults. We are changing from a kubeletconfig "aro-limits" to a kubeletconfig "dynamic-node". We want to be able to do this while still keeping it as defaults and if the customer has created their own kubeletconfig, the customer's should still take precedence. What we see is that the creation of a new kubeletconfig regardless of source overrides all other kubeletconfigs, causing the customer to lose their customization.
Version-Release number of selected component (if applicable):
4.12.40+
ARO's older kubeletconfig "aro-limits":
apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: labels: aro.openshift.io/limits: "" name: aro-limits spec: kubeletConfig: evictionHard: imagefs.available: 15% memory.available: 500Mi nodefs.available: 10% nodefs.inodesFree: 5% systemReserved: memory: 2000Mi machineConfigPoolSelector: matchLabels: aro.openshift.io/limits: ""
ARO's newer kubeletconfig, "dynamic-node"
apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: dynamic-node spec: autoSizingReserved: true machineConfigPoolSelector: matchExpressions: - key: machineconfiguration.openshift.io/mco-built-in operator: Exists
Customer's desired kubeletconfig:
apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: labels: arogcd.arogproj.io/instance: cluster-config name: default-pod-pids-limit spec: kubeletConfig: podPidsLimit: 2000000 machineConfigPoolSelector: matchExpressions: - key: pools.operator.machineconfiguration.io/worker operator: Exists
Description of problem:
If the installer using cluster api exits before bootstrap destroy, it may leak processes which continue to run in the background of the host system. These processes may continue to reconcile cloud resources, so the cluster resources would be created and recreated even when you are trying to delete them. This occurs because the installer runs kube-apiserver, etcd, and the capi provider binaries as subprocesses. If the installer exits without shutting down those subprocesses, due to an error or user interrupt, the processes will continue to run in the background. The processes can be identified with the ps command. pgrep and pkill are also useful. Brief discussion here of this occurring in PowerVS: https://redhat-internal.slack.com/archives/C05QFJN2BQW/p1712688922574429
Version-Release number of selected component (if applicable):
How reproducible:
Often
Steps to Reproduce:
1. Run capi-based install (on any platform), by specifying fields below in the install config [0] 2. Wait until CAPI controllers begin to run. This will be easy to identify because the terminal will fill with controller logs. Particularly you should see [1] 3. Once the controllers are running interrupt with CTRL + C [0] Install config for capi install featureGates: - ClusterAPIInstall=true featureSet: CustomNoUpgrade [1] INFO Started local control plane with envtest INFO Stored kubeconfig for envtest in: /c/auth/envtest.kubeconfig INFO Running process: Cluster API with args [-v=2 --metrics-bind-addr=0 --
Actual results:
controllers will leak and continue to run. They can be viewed with ps or pgrep You may also see INFO Shutting down local Cluster API control plane... That means the Shutdown started but did not complete.
Expected results:
The installer should shutdown gracefully and not leak processes, such as: ^CWARNING Received interrupt signal INFO Shutting down local Cluster API control plane... INFO Stopped controller: Cluster API INFO Stopped controller: aws infrastructure provider ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to create infrastructure manifest: Post "https://127.0.0.1:41441/apis/infrastructure.cluster.x-k8s.io/v1beta2/awsclustercontrolleridentities": unexpected EOF INFO Local Cluster API system has completed operations
Additional info:
Description of problem:
Enable KMS v2 in the ibmcloud KMS provider
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-34987. The following is the description of the original issue:
—
Description of problem:
When create hostedcluster with -role-arn, --sts-credsfails failed
Version-Release number of selected component (if applicable):
4.16 4.17
How reproducible:
100%
Steps to Reproduce:
1. hypershift-no-cgo create iam cli-role 2. aws sts get-session-token --output json 3. hcp create cluster aws --role-arn xxx --sts-creds xxx
Actual results:
2024-06-06T04:34:39Z ERROR Failed to create cluster {"error": "failed to create iam: AccessDenied: User: arn:aws:sts::301721915996:assumed-role/6cd90f28a6449141869b/cli-create-iam is not authorized to perform: iam:TagOpenIDConnectProvider on resource: arn:aws:iam::301721915996:oidc-provider/hypershift-ci-oidc.s3.us-east-1.amazonaws.com/6cd90f28a6449141869b because no identity-based policy allows the iam:TagOpenIDConnectProvider action\n\tstatus code: 403, request id: 20e16ec4-b9a1-4fa4-aa34-1344145d41fd"} github.com/openshift/hypershift/product-cli/cmd/cluster/aws.NewCreateCommand.func1 /remote-source/app/product-cli/cmd/cluster/aws/create.go:60 github.com/spf13/cobra.(*Command).execute /remote-source/app/vendor/github.com/spf13/cobra/command.go:983 github.com/spf13/cobra.(*Command).ExecuteC /remote-source/app/vendor/github.com/spf13/cobra/command.go:1115 github.com/spf13/cobra.(*Command).Execute /remote-source/app/vendor/github.com/spf13/cobra/command.go:1039 github.com/spf13/cobra.(*Command).ExecuteContext /remote-source/app/vendor/github.com/spf13/cobra/command.go:1032 main.main /remote-source/app/product-cli/main.go:60 runtime.main /usr/lib/golang/src/runtime/proc.go:271 Error: failed to create iam: AccessDenied: User: arn:aws:sts::301721915996:assumed-role/6cd90f28a6449141869b/cli-create-iam is not authorized to perform: iam:TagOpenIDConnectProvider on resource: arn:aws:iam::301721915996:oidc-provider/hypershift-ci-oidc.s3.us-east-1.amazonaws.com/6cd90f28a6449141869b because no identity-based policy allows the iam:TagOpenIDConnectProvider action status code: 403, request id: 20e16ec4-b9a1-4fa4-aa34-1344145d41fd failed to create iam: AccessDenied: User: arn:aws:sts::301721915996:assumed-role/6cd90f28a6449141869b/cli-create-iam is not authorized to perform: iam:TagOpenIDConnectProvider on resource: arn:aws:iam::301721915996:oidc-provider/hypershift-ci-oidc.s3.us-east-1.amazonaws.com/6cd90f28a6449141869b because no identity-based policy allows the iam:TagOpenIDConnectProvider action status code: 403, request id: 20e16ec4-b9a1-4fa4-aa34-1344145d41fd {"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:84","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun","level":"error","msg":"Error executing test process","severity":"error","time":"2024-06-06T04:34:39Z"} error: failed to execute wrapped command: exit status 1
Expected results:
create hostedcluster successful
Additional info:
Full Logs: https://docs.google.com/document/d/1AnvAHXPfPYtP6KRcAKOebAx1wXjhWMOn3TW604XK09o/edit
The same command can be successfully created the second time
Please review the following PR: https://github.com/openshift/azure-workload-identity/pull/11
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Currently, the openshift-enterprise-tests image depends on the openstack repository on the x86_64 and ppc64le repositories. The package python-cinder gets installed, to allow the [openstack end-to-end tests|https://github.com/openshift/release/blob/60fed3474509bff9c5585a736554739e8ec4f017/ci-operator/step-registry/openstack/test/e2e/openstack-test-e2e-chain.yaml#L5] to [run|https://github.com/openshift/openstack-test/]. The python-cinder package is not made available for rhel9 on ppc64le. To move the tests image to rhel9, OCP probably should follow upstream's decision to not support ppc64le.
Please review the following PR: https://github.com/openshift/openshift-state-metrics/pull/112
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-33005. The following is the description of the original issue:
—
Description of problem:
Pod stuck in creating state when running performance benchmark The exact error when describing the pod - Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 45s (x114 over 3h47m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_client-1-5c978b7665-n4tds_cluster-density-v2-35_f57d8281-5a79-4c91-9b83-bb3e4b553597_0(5a8d6897ca792d91f1c52054f5f8c596530fbf72d3abb07b19a20fd9c95cc564): error adding pod cluster-density-v2-35_client-1-5c978b7665-n4tds to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&\{ContainerID:5a8d6897ca792d91f1c52054f5f8c596530fbf72d3abb07b19a20fd9c95cc564 Netns:/var/run/netns/e06c9af7-c13d-426f-9a00-73c54441a20b IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=cluster-density-v2-35;K8S_POD_NAME=client-1-5c978b7665-n4tds;K8S_POD_INFRA_CONTAINER_ID=5a8d6897ca792d91f1c52054f5f8c596530fbf72d3abb07b19a20fd9c95cc564;K8S_POD_UID=f57d8281-5a79-4c91-9b83-bb3e4b553597 Path: StdinData:[123 34 98 105 110 68 105 114 34 58 34 47 118 97 114 47 108 105 98 47 99 110 105 47 98 105 110 34 44 34 99 104 114 111 111 116 68 105 114 34 58 34 47 104 111 115 116 114 111 111 116 34 44 34 99 108 117 115 116 101 114 78 101 116 119 111 114 107 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 99 110 105 47 110 101 116 46 100 47 49 48 45 111 118 110 45 107 117 98 101 114 110 101 116 101 115 46 99 111 110 102 34 44 34 99 110 105 67 111 110 102 105 103 68 105 114 34 58 34 47 104 111 115 116 47 101 116 99 47 99 110 105 47 110 101 116 46 100 34 44 34 99 110 105 86 101 114 115 105 111 110 34 58 34 48 46 51 46 49 34 44 34 100 97 101 109 111 110 83 111 99 107 101 116 68 105 114 34 58 34 47 114 117 110 47 109 117 108 116 117 115 47 115 111 99 107 101 116 34 44 34 103 108 111 98 97 108 78 97 109 101 115 112 97 99 101 115 34 58 34 100 101 102 97 117 108 116 44 111 112 101 110 115 104 105 102 116 45 109 117 108 116 117 115 44 111 112 101 110 115 104 105 102 116 45 115 114 105 111 118 45 110 101 116 119 111 114 107 45 111 112 101 114 97 116 111 114 34 44 34 108 111 103 76 101 118 101 108 34 58 34 118 101 114 98 111 115 101 34 44 34 108 111 103 84 111 83 116 100 101 114 114 34 58 116 114 117 101 44 34 109 117 108 116 117 115 65 117 116 111 99 111 110 102 105 103 68 105 114 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 99 110 105 47 110 101 116 46 100 34 44 34 109 117 108 116 117 115 67 111 110 102 105 103 70 105 108 101 34 58 34 97 117 116 111 34 44 34 110 97 109 101 34 58 34 109 117 108 116 117 115 45 99 110 105 45 110 101 116 119 111 114 107 34 44 34 110 97 109 101 115 112 97 99 101 73 115 111 108 97 116 105 111 110 34 58 116 114 117 101 44 34 112 101 114 78 111 100 101 67 101 114 116 105 102 105 99 97 116 101 34 58 123 34 98 111 111 116 115 116 114 97 112 75 117 98 101 99 111 110 102 105 103 34 58 34 47 118 97 114 47 108 105 98 47 107 117 98 101 108 101 116 47 107 117 98 101 99 111 110 102 105 103 34 44 34 99 101 114 116 68 105 114 34 58 34 47 101 116 99 47 99 110 105 47 109 117 108 116 117 115 47 99 101 114 116 115 34 44 34 99 101 114 116 68 117 114 97 116 105 111 110 34 58 34 50 52 104 34 44 34 101 110 97 98 108 101 100 34 58 116 114 117 101 125 44 34 115 111 99 107 101 116 68 105 114 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 115 111 99 107 101 116 34 44 34 116 121 112 101 34 58 34 109 117 108 116 117 115 45 115 104 105 109 34 125]} ContainerID:"5a8d6897ca792d91f1c52054f5f8c596530fbf72d3abb07b19a20fd9c95cc564" Netns:"/var/run/netns/e06c9af7-c13d-426f-9a00-73c54441a20b" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=cluster-density-v2-35;K8S_POD_NAME=client-1-5c978b7665-n4tds;K8S_POD_INFRA_CONTAINER_ID=5a8d6897ca792d91f1c52054f5f8c596530fbf72d3abb07b19a20fd9c95cc564;K8S_POD_UID=f57d8281-5a79-4c91-9b83-bb3e4b553597" Path:"" ERRORED: error configuring pod [cluster-density-v2-35/client-1-5c978b7665-n4tds] networking: [cluster-density-v2-35/client-1-5c978b7665-n4tds/f57d8281-5a79-4c91-9b83-bb3e4b553597:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[cluster-density-v2-35/client-1-5c978b7665-n4tds 5a8d6897ca792d91f1c52054f5f8c596530fbf72d3abb07b19a20fd9c95cc564 network default NAD default] [cluster-density-v2-35/client-1-5c978b7665-n4tds 5a8d6897ca792d91f1c52054f5f8c596530fbf72d3abb07b19a20fd9c95cc564 network default NAD default] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:83:03:f6 [10.131.3.246/23] ' ': StdinData: \{"binDir":"/var/lib/cni/bin","clusterNetwork":"/host/run/multus/cni/net.d/10-ovn-kubernetes.conf","cniVersion":"0.3.1","daemonSocketDir":"/run/multus/socket","globalNamespaces":"default,openshift-multus,openshift-sriov-network-operator","logLevel":"verbose","logToStderr":true,"name":"multus-cni-network","namespaceIsolation":true,"type":"multus-shim"}
Version-Release number of selected component (if applicable):
4.16.0-ec.5\{code} How reproducible: {code:none} 50-60% It seems to be related to the number of times I have ran our test on a single cluster. Many of our performance tests are on ephemeral clusters - so we build the cluster, run the test, tear down. Currently I have a long lived cluster (1 week old), and I have been running many performance tests against this cluster -- serially. After each test, the previous resources are cleaned up. \{code} Steps to Reproduce: {code:none} 1. Use the following cmdline as an example. 2. ./bin/amd64/kube-burner-ocp cluster-density-v2 --iterations 90 3. Repeat until issue arises ( usually after 3-4 attempts)./ \{code} Actual results: {code:none} client-1-5c978b7665-n4tds 0/1 ContainerCreating 0 4h14m
Expected results:
For the benchmark not to get stuck waiting for this pod. \{code} Additional info: {code:none} Looking at the ovnkube-controller pod logs, grepping for the pod which was stuck oc logs -n openshift-ovn-kubernetes ovnkube-node-qpkws -c ovnkube-controller | grep client-1-5c978b7665-n4tds W0425 13:12:09.302395 6996 base_network_controller_policy.go:545] Failed to get get LSP for pod cluster-density-v2-35/client-1-5c978b7665-n4tds NAD default for networkPolicy allow-from-openshift-ingress, err: logical port cluster-density-v2-35/client-1-5c978b7665-n4tds for pod cluster-density-v2-35_client-1-5c978b7665-n4tds not found in cache I0425 13:12:09.302412 6996 obj_retry.go:370] Retry add failed for *factory.localPodSelector cluster-density-v2-35/client-1-5c978b7665-n4tds, will try again later: unable to get port info for pod cluster-density-v2-35/client-1-5c978b7665-n4tds NAD default W0425 13:12:09.908446 6996 helper_linux.go:481] [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4] pod uid f57d8281-5a79-4c91-9b83-bb3e4b553597: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:83:03:f6 [10.131.3.246/23] I0425 13:12:09.963651 6996 cni.go:279] [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default] ADD finished CNI request [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default], result "", err failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:83:03:f6 [10.131.3.246/23] I0425 13:12:09.988397 6996 cni.go:258] [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default] DEL starting CNI request [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default] W0425 13:12:09.996899 6996 helper_linux.go:697] Failed to delete pod "cluster-density-v2-35/client-1-5c978b7665-n4tds" interface 7f80514901cbc57: failed to lookup link 7f80514901cbc57: Link not found I0425 13:12:10.009234 6996 cni.go:279] [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default] DEL finished CNI request [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default], result "\{\"dns\":{}}", err <nil> I0425 13:12:10.059917 6996 cni.go:258] [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default] DEL starting CNI request [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default]
Description of problem:
A customer is deploying SNO with lvms-operator being installed during cluster installation using assisted-service. One of the deployment failed with catalog-operator pod crashlooping. NAME READY STATUS RESTARTS AGE catalog-operator-db9dff494-pqb68 0/1 CrashLoopBackOff 56 4h The pod logs show a panic. $ oc logs catalog-operator-db9dff494-pqb68 -n openshift-operator-lifecycle-manager2024-05-16T13:24:46.709156999Z time="2024-05-16T13:24:46Z" level=info msg="log level info"2024-05-16T13:24:46.709232085Z time="2024-05-16T13:24:46Z" level=info msg="TLS keys set, using https for metrics"2024-05-16T13:24:46.709736948Z W0516 13:24:46.709618 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.2024-05-16T13:24:46.709855179Z time="2024-05-16T13:24:46Z" level=info msg="Using in-cluster kube client config"2024-05-16T13:24:46.710165923Z time="2024-05-16T13:24:46Z" level=info msg="Using in-cluster kube client config"2024-05-16T13:24:46.710274657Z W0516 13:24:46.710268 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.2024-05-16T13:24:46.711960302Z W0516 13:24:46.711831 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.2024-05-16T13:24:46.720943025Z time="2024-05-16T13:24:46Z" level=info msg="connection established. cluster-version: v1.27.12+7bee54d"2024-05-16T13:24:46.720943025Z time="2024-05-16T13:24:46Z" level=info msg="operator ready"2024-05-16T13:24:46.720943025Z time="2024-05-16T13:24:46Z" level=info msg="starting informers..."2024-05-16T13:24:46.720943025Z time="2024-05-16T13:24:46Z" level=info msg="informers started"2024-05-16T13:24:46.720943025Z time="2024-05-16T13:24:46Z" level=info msg="waiting for caches to sync..."2024-05-16T13:24:46.921220918Z time="2024-05-16T13:24:46Z" level=info msg="starting workers..."2024-05-16T13:24:46.921869716Z time="2024-05-16T13:24:46Z" level=info msg="connection established. cluster-version: v1.27.12+7bee54d"2024-05-16T13:24:46.921869716Z time="2024-05-16T13:24:46Z" level=info msg="operator ready"2024-05-16T13:24:46.921869716Z time="2024-05-16T13:24:46Z" level=info msg="starting informers..."2024-05-16T13:24:46.921869716Z time="2024-05-16T13:24:46Z" level=info msg="informers started"2024-05-16T13:24:46.921869716Z time="2024-05-16T13:24:46Z" level=info msg="waiting for caches to sync..."2024-05-16T13:24:46.922300604Z time="2024-05-16T13:24:46Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=2024-05-16T13:24:47.022696884Z time="2024-05-16T13:24:47Z" level=info msg="starting workers..."2024-05-16T13:24:59.544398366Z panic: runtime error: invalid memory address or nil pointer dereference2024-05-16T13:24:59.544398366Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x1d761e6]2024-05-16T13:24:59.544398366Z 2024-05-16T13:24:59.544398366Z goroutine 469 [running]:2024-05-16T13:24:59.544398366Z github.com/operator-framework/operator-lifecycle-manager/pkg/controller/bundle.sortUnpackJobs.func1(0xc002bdca20?, 0x0?)2024-05-16T13:24:59.544398366Z /build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/bundle/bundle_unpacker.go:844 +0xc62024-05-16T13:24:59.544398366Z sort.insertionSort_func({0xc002b7cfb0?, 0xc0029fffe0?}, 0x0, 0x3)2024-05-16T13:24:59.544398366Z /usr/lib/golang/src/sort/zsortfunc.go:12 +0xb12024-05-16T13:24:59.544398366Z sort.pdqsort_func({0xc002b7cfb0?, 0xc0029fffe0?}, 0x7f07987eab38?, 0x18?, 0xc001e80000?)2024-05-16T13:24:59.544398366Z /usr/lib/golang/src/sort/zsortfunc.go:73 +0x2dd
Version-Release number of selected component (if applicable):
4.14.22
How reproducible:
Only sometimes
Steps to Reproduce:
1. SNO cluster deployment using assisted service 2. Provide lvms-operator sub, operatorgroup and namespace yamls during installation 3. The pod crashed once the node booted after ignition
Actual results:
Pod crashed with panic
Expected results:
The pod should be running
Additional info:
This is a clone of issue OCPBUGS-34911. The following is the description of the original issue:
—
We need to add more people to the owners file of multus repo.
Description of problem:
Multicast packets got 100% dropped
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-02-202327
How reproducible:
Always
Steps to Reproduce:
1. Create a test namespace and enable multicast
oc describe ns test
Name: test
Labels: kubernetes.io/metadata.name=test
pod-security.kubernetes.io/audit=restricted
pod-security.kubernetes.io/audit-version=v1.24
pod-security.kubernetes.io/enforce=restricted
pod-security.kubernetes.io/enforce-version=v1.24
pod-security.kubernetes.io/warn=restricted
pod-security.kubernetes.io/warn-version=v1.24
Annotations: k8s.ovn.org/multicast-enabled: true
openshift.io/sa.scc.mcs: s0:c28,c27
openshift.io/sa.scc.supplemental-groups: 1000810000/10000
openshift.io/sa.scc.uid-range: 1000810000/10000
Status: Active
No resource quota.
No LimitRange resource.
2. Created multicast pods
% oc get pods -n test -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES mcast-rc-67897 1/1 Running 0 10s 10.129.2.42 ip-10-0-86-58.us-east-2.compute.internal <none> <none> mcast-rc-ftsq8 1/1 Running 0 10s 10.128.2.61 ip-10-0-33-247.us-east-2.compute.internal <none> <none> mcast-rc-q48db 1/1 Running 0 10s 10.131.0.27 ip-10-0-1-176.us-east-2.compute.internal <none> <none>
3. Test mulicast traffic with omping from two pods
% oc rsh -n test mcast-rc-67897 ~ $ ~ $ omping -c10 10.129.2.42 10.128.2.61 10.128.2.61 : waiting for response msg 10.128.2.61 : joined (S,G) = (*, 232.43.211.234), pinging 10.128.2.61 : unicast, seq=1, size=69 bytes, dist=2, time=0.506ms 10.128.2.61 : unicast, seq=2, size=69 bytes, dist=2, time=0.595ms 10.128.2.61 : unicast, seq=3, size=69 bytes, dist=2, time=0.555ms 10.128.2.61 : unicast, seq=4, size=69 bytes, dist=2, time=0.572ms 10.128.2.61 : unicast, seq=5, size=69 bytes, dist=2, time=0.614ms 10.128.2.61 : unicast, seq=6, size=69 bytes, dist=2, time=0.653ms 10.128.2.61 : unicast, seq=7, size=69 bytes, dist=2, time=0.611ms 10.128.2.61 : unicast, seq=8, size=69 bytes, dist=2, time=0.594ms 10.128.2.61 : unicast, seq=9, size=69 bytes, dist=2, time=0.603ms 10.128.2.61 : unicast, seq=10, size=69 bytes, dist=2, time=0.687ms 10.128.2.61 : given amount of query messages was sent 10.128.2.61 : unicast, xmt/rcv/%loss = 10/10/0%, min/avg/max/std-dev = 0.506/0.599/0.687/0.050 10.128.2.61 : multicast, xmt/rcv/%loss = 10/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000 % oc rsh -n test mcast-rc-ftsq8 ~ $ omping -c10 10.128.2.61 10.129.2.42 10.129.2.42 : waiting for response msg 10.129.2.42 : waiting for response msg 10.129.2.42 : waiting for response msg 10.129.2.42 : waiting for response msg 10.129.2.42 : joined (S,G) = (*, 232.43.211.234), pinging 10.129.2.42 : unicast, seq=1, size=69 bytes, dist=2, time=0.463ms 10.129.2.42 : unicast, seq=2, size=69 bytes, dist=2, time=0.578ms 10.129.2.42 : unicast, seq=3, size=69 bytes, dist=2, time=0.632ms 10.129.2.42 : unicast, seq=4, size=69 bytes, dist=2, time=0.652ms 10.129.2.42 : unicast, seq=5, size=69 bytes, dist=2, time=0.635ms 10.129.2.42 : unicast, seq=6, size=69 bytes, dist=2, time=0.626ms 10.129.2.42 : unicast, seq=7, size=69 bytes, dist=2, time=0.597ms 10.129.2.42 : unicast, seq=8, size=69 bytes, dist=2, time=0.618ms 10.129.2.42 : unicast, seq=9, size=69 bytes, dist=2, time=0.964ms 10.129.2.42 : unicast, seq=10, size=69 bytes, dist=2, time=0.619ms 10.129.2.42 : given amount of query messages was sent 10.129.2.42 : unicast, xmt/rcv/%loss = 10/10/0%, min/avg/max/std-dev = 0.463/0.638/0.964/0.126 10.129.2.42 : multicast, xmt/rcv/%loss = 10/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000
Actual results:
Mulicast packets loss is 100%
10.129.2.42 : multicast, xmt/rcv/%loss = 10/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000
Expected results:
Should no 100% packet loss.
Additional info:
No such issue in 4.15, tested on same profile ipi-on-aws/versioned-installer-ci with 4.15.0-0.nightly-2024-05-31-131420, same operation with above steps.
The output for both mulicast pods:
10.131.0.27 : unicast, xmt/rcv/%loss = 10/10/0%, min/avg/max/std-dev = 1.176/1.239/1.269/0.027 10.131.0.27 : multicast, xmt/rcv/%loss = 10/9/9% (seq>=2 0%), min/avg/max/std-dev = 1.227/1.304/1.755/0.170 and 10.129.2.16 : unicast, xmt/rcv/%loss = 10/10/0%, min/avg/max/std-dev = 1.101/1.264/1.321/0.065 10.129.2.16 : multicast, xmt/rcv/%loss = 10/10/0%, min/avg/max/std-dev = 1.230/1.351/1.890/0.191
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Please review the following PR: https://github.com/openshift/network-tools/pull/125
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-41255. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-39438. The following is the description of the original issue:
—
Description of problem: If a customer applies ethtool configuration to the interface used in br-ex, that configuration will be dropped when br-ex is created. We need to read and apply the configuration from the interface to the phys0 connection profile, as described in https://issues.redhat.com/browse/RHEL-56741?focusedId=25465040&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25465040
Version-Release number of selected component (if applicable): 4.16
How reproducible: Always
Steps to Reproduce:
1. Deploy a cluster with an NMState config that sets the ethtool.feature.esp-tx-csum-hw-offload field to "off"
2.
3.
Actual results: The ethtool setting is only applied to the interface profile which is disabled after configure-ovs runs
Expected results: The ethtool setting is present on the configure-ovs-created profile
Additional info:
Affected Platforms: VSphere. Probably baremetal too and possibly others.
Description of problem:
When I test PR https://github.com/openshift/machine-config-operator/pull/4083, there is no machineset does not have any machine linked. $ oc get machineset/rioliu-1220c-bz2gp-worker-f -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE rioliu-1220c-bz2gp-worker-f 0 0 3h47m Many errors found in MCD log like below I1220 09:15:59.743704 1 machine_set_boot_image_controller.go:211] Error syncing machineset openshift-machine-api/rioliu-1220c-bz2gp-worker-f: failed to fetch architecture type of machineset rioliu-1220c-bz2gp-worker-f, err: could not find any machines linked to machineset, error: %!w(<nil>) the machineset patch is skipped in reconcile loop due to above error, boot image info cannot be patched even it does not have any machine provisioned.
Version-Release number of selected component (if applicable):
How reproducible:
Consistently
Steps to Reproduce:
https://github.com/openshift/machine-config-operator/pull/4083#issuecomment-1864226629
Actual results:
the machineset is skipped in reconcile loop due to above error, boot image info cannot be patched
Expected results:
the machineset should be updated even no linked machine found, because maybe it is scaled down to 0 replica
Additional info:
After the fix for OCPBUGSM-44759, we put timeouts on payload retrieval operations (verification and download); previously they were uncapped and under certain network circumstances could take hours to terminate. Testing the fix uncovered a problem where, after CVO passes the path with the timeouts, CVO starts logging errors for the core manifest reconciliation loop:
I0208 11:22:57.107819 1 sync_worker.go:993] Running sync for role "openshift-marketplace/marketplace-operator" (648 of 834) I0208 11:22:57.107887 1 task_graph.go:474] Canceled worker 1 while waiting for work I0208 11:22:57.107900 1 sync_worker.go:1013] Done syncing for configmap "openshift-apiserver-operator/trusted-ca-bundle" (444 of 834) I0208 11:22:57.107911 1 task_graph.go:474] Canceled worker 0 while waiting for work I0208 11:22:57.107918 1 task_graph.go:523] Workers finished I0208 11:22:57.107925 1 task_graph.go:546] Result of work: [update context deadline exceeded at 8 of 834 Could not update role "openshift-marketplace/marketplace-operator" (648 of 834)] I0208 11:22:57.107938 1 sync_worker.go:1169] Summarizing 1 errors I0208 11:22:57.107947 1 sync_worker.go:1173] Update error 648 of 834: UpdatePayloadFailed Could not update role "openshift-marketplace/marketplace-operator" (648 of 834) (context.deadlineExceededError: context deadline exceeded) E0208 11:22:57.107966 1 sync_worker.go:654] unable to synchronize image (waiting 3m39.457405047s): Could not update role "openshift-marketplace/marketplace-operator" (648 of 834)
This is caused by locks. The SyncWorker.Update method acquires its lock for its whole duration. The payloadRetriever.RetrievePayload method is called inside SyncWorker.Update, on the following call chain:
SyncWorker.Update -> SyncWorker.loadUpdatedPayload -> SyncWorker.syncPayload -> payloadRetriever.RetrievePayload
RetrievePayload can take 2 or 4 minutes before it timeouts, so CVO holds the lock for this whole wait.
The manifest reconciliation loop is implemented in the apply method. The whole apply method is bounded by a timeout context set to 2*minimum reconcile interval so it will be set to a value between 4 and 8 minutes. While in the reconciling mode, the manifest graph is split into multiple "tasks" where smaller sequences of these tasks are applied in parallel. Individual tasks in these series are iterated over and each iteration uses a consistentReporter to report status via its Update method, which also acquires the lock on the following call sequence:
SyncWorker.apply -> { for _, task := range tasks ... -> consistentReporter.Update -> statusWrapper.Report -> SyncWorker.updateApplyStatus ->
This leads to the following sequence:
1. apply is called with a timeout between 4 and 8 minutes
2. in parallel, SyncWorker.Update starts and acquires the lock
3. tasks under apply wait on the reporter to acquire lock
4. after 2 or 4 minutes RetrievePayload under SyncWorker.Update timeout and terminate, SyncWorker.Update terminates and releases the lock
5. tasks under apply report results after briefly acquiring the lock, start to do their thing
6. in parallel, SyncWorker.Update starts again and acquires the lock
7. further iterations over tasks under apply wait on the reporter to acquire lock
8. context passed to apply times out
9. Canceled worker 0 while waiting for work... errors
4.13.0-0.ci.test-2023-02-06-062603 with https://github.com/openshift/cluster-version-operator/pull/896
always in certain cluster configuration
1. in a disconnected cluster, upgrade to an unrechachable payload image with --force
2. observe the CVO log
CVO starts to fail reconciling manifests
no failures, cluster continues to try retrieving the image but no interference with manifest reconciliation
This problem was discovered by Evgeni Vakhonin while testing fix for OCPBUGSM-44759: https://bugzilla.redhat.com/show_bug.cgi?id=2090680#c22
https://github.com/openshift/cluster-version-operator/pull/896 uncovers this issue but still gets CVO into a better shape - previously the RetrievePayload could be running for a much longer time (hours), preventing the CVO from working at all.
When the cluster gets into this buggy state, the solution is to abort the upgrade that fails to verify or download.
Description of problem:
/var/log/etcd/etcd-health-probe.log exist on control plane node, but we only touch it in code: https://github.com/openshift/cluster-etcd-operator/blob/master/bindata/etcd/pod.yaml#L26 etcd's /var/log/etcd/etcd-health-probe.log be though audit log, because there are audit log in same directory tree for apiserver and auth: /var/log/kube-apiserver/audit-2024-03-21T04-27-49.470.log /var/log/oauth-server/audit.log etcd-health-probe.log will bring some misunderstanding to user How reproducible:always Steps to Reproduce: 1. login control plane node 2. check /var/log/etcd/etcd-health-probe.log 3. the file size is always zero Actual results: Expected results:remove this file in code/don't touch this file
Additional info:
Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/218
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Bootstrap process failed due to coredns.yaml manifest generation issue: Feb 04 05:14:34 yunjiang-p2-2r2b2-bootstrap bootkube.sh[11219]: I0204 05:14:34.966343 1 bootstrap.go:188] manifests/on-prem/coredns.yaml Feb 04 05:14:34 yunjiang-p2-2r2b2-bootstrap bootkube.sh[11219]: F0204 05:14:34.966513 1 bootstrap.go:188] error rendering bootstrap manifests: failed to execute template: template: manifests/on-prem/coredns.yaml:34:32: executing "manifests/on-prem/coredns.yaml" at <onPremPlatformAPIServerInternalIPs .ControllerConfig>: error calling onPremPlatformAPIServerInternalIPs: invalid platform for API Server Internal IP Feb 04 05:14:35 yunjiang-p2-2r2b2-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=255/EXCEPTION Feb 04 05:14:35 yunjiang-p2-2r2b2-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-02-03-192446 4.16.0-0.nightly-2024-02-03-221256
How reproducible:
Always
Steps to Reproduce:
1. 1. Enable custom DNS on GCP: platform.gcp.userProvisionedDNS:Enabled and featureSet:TechPreviewNoUpgrade 2. 3.
Actual results:
coredns.yaml can not be generated, bootstrap failed.
Expected results:
Bootstrap process succeeds.
Additional info:
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/96
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-41549. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-35358. The following is the description of the original issue:
—
I'm working with the Gitops operator (1.7) and when there is a high amount of CR (38.000 applications objects in this case) the related install plan get stuck with the following error:
- lastTransitionTime: "2024-06-11T14:28:40Z" lastUpdateTime: "2024-06-11T14:29:42Z" message: 'error validating existing CRs against new CRD''s schema for "applications.argoproj.io": error listing resources in GroupVersionResource schema.GroupVersionResource{Group:"argoproj.io", Version:"v1alpha1", Resource:"applications"}: the server was unable to return a response in the time allotted, but may still be processing the request'
Even waiting for a long time the operator is unable to move forward not removing or reinstalling its components.
Over a lab, the issue was not present until we started to add load to the cluster (applications.argoproj.io) and when we hit 26.000 applications we were not able to upgrade or reinstall the operator anymore.
Description of problem:
Environment file /etc/kubernetes/node.env is overwritten after node restart. There is a type in https://github.com/openshift/machine-config-operator/blob/master/templates/common/aws/files/usr-local-bin-aws-kubelet-nodename.yaml where variable should be changed to NODEENV wherever NODENV is found.
Version-Release number of selected component (if applicable):
How reproducible:
Easy
Steps to Reproduce:
1. Change contents of /etc/kubernetes/node.env 2. Restart node 3. Notice changes are lost
Actual results:
Expected results:
/etc/kubernetes/node.env should not be changed after restart of a node
Additional info:
This is a clone of issue OCPBUGS-34784. The following is the description of the original issue:
—
INSIGHTOCP-1557 is a rule to check for any custom Prometheus instances that may impact the management of corresponding resources.
Resource to gather: Prometheus and Alertmanager in all namespaces
apiVersion: monitoring.coreos.com/v1 kind: Prometheus
apiVersion: monitoring.coreos.com/v1 kind: Alertmanager
Backport: OCP 4.12.z; 4.13.z; 4.14.z; 4.15.z
Additional info:
1) Get the Prometheus and Alertmanager in all namespaces
$ oc get prometheus -A NAMESPACE NAME VERSION DESIRED READY RECONCILED AVAILABLE AGE openshift-monitoring k8s 2.39.1 2 1 True Degraded 712d test custom-prometheus 1 0 True False 25d
$ oc get alertmanager -A NAMESPACE NAME VERSION DESIRED READY RECONCILED AVAILABLE AGE openshift-monitoring main 2.39.1 2 1 True Degraded 712d test custom-alertmanager 1 0 True False 25d
This is a clone of issue OCPBUGS-1735. The following is the description of the original issue:
—
Description of problem:
When setting up cluster on vsphere, sometimes machine is powered off and in "Provisioning" phase, it will trigger a new machine creation, and report error "failed to Create machine: The name 'jima-ipi-27-d97wp-worker-7qn9b' already exists"
Version-Release number of selected component (if applicable):
4.12.0-0.ci.test-2022-09-26-235306-ci-ln-vh4qjyk-latest
How reproducible:
Sometimes, met two times
Steps to Reproduce:
1. Setup a vsphere cluster 2. 3.
Actual results:
Cluster installation failed, machine stuck in Provisioning status. $ oc get machine NAME PHASE TYPE REGION ZONE AGE jima-ipi-27-d97wp-master-0 Running 4h jima-ipi-27-d97wp-master-1 Running 4h jima-ipi-27-d97wp-master-2 Running 4h jima-ipi-27-d97wp-worker-7qn9b Provisioning 3h56m jima-ipi-27-d97wp-worker-dsqd2 Running 3h56m $ oc edit machine jima-ipi-27-d97wp-worker-7qn9b status: conditions: - lastTransitionTime: "2022-09-27T01:27:29Z" status: "True" type: Drainable - lastTransitionTime: "2022-09-27T01:27:29Z" message: Instance has not been created reason: InstanceNotCreated severity: Warning status: "False" type: InstanceExists - lastTransitionTime: "2022-09-27T01:27:29Z" status: "True" type: Terminable lastUpdated: "2022-09-27T01:27:29Z" phase: Provisioning providerStatus: conditions: - lastTransitionTime: "2022-09-27T01:36:09Z" message: The name 'jima-ipi-27-d97wp-worker-7qn9b' already exists. reason: MachineCreationSucceeded status: "False" type: MachineCreation taskRef: task-11363480 $ govc vm.info /SDDC-Datacenter/vm/jima-ipi-27-d97wp/jima-ipi-27-d97wp-worker-7qn9b Name: jima-ipi-27-d97wp-worker-7qn9b Path: /SDDC-Datacenter/vm/jima-ipi-27-d97wp/jima-ipi-27-d97wp-worker-7qn9b UUID: 422cb686-6585-f05a-af13-b2acac3da294 Guest name: Red Hat Enterprise Linux 8 (64-bit) Memory: 16384MB CPU: 8 vCPU(s) Power state: poweredOff Boot time: <nil> IP address: Host: 10.3.32.8 I0927 01:44:42.568599 1 session.go:91] No existing vCenter session found, creating new session I0927 01:44:42.633672 1 session.go:141] Find template by instance uuid: 9535891b-902e-410c-b9bb-e6a57aa6b25a I0927 01:44:42.641691 1 reconciler.go:270] jima-ipi-27-d97wp-worker-7qn9b: already exists, but was not powered on after clone, requeue I0927 01:44:42.641726 1 controller.go:380] jima-ipi-27-d97wp-worker-7qn9b: reconciling machine triggers idempotent create I0927 01:44:42.641732 1 actuator.go:66] jima-ipi-27-d97wp-worker-7qn9b: actuator creating machine I0927 01:44:42.659651 1 reconciler.go:935] task: task-11363480, state: error, description-id: VirtualMachine.clone I0927 01:44:42.659684 1 reconciler.go:951] jima-ipi-27-d97wp-worker-7qn9b: Updating provider status E0927 01:44:42.659696 1 actuator.go:57] jima-ipi-27-d97wp-worker-7qn9b error: jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: The name 'jima-ipi-27-d97wp-worker-7qn9b' already exists. I0927 01:44:42.659762 1 machine_scope.go:101] jima-ipi-27-d97wp-worker-7qn9b: patching machine I0927 01:44:42.660100 1 recorder.go:103] events "msg"="jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: The name 'jima-ipi-27-d97wp-worker-7qn9b' already exists." "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"jima-ipi-27-d97wp-worker-7qn9b","uid":"9535891b-902e-410c-b9bb-e6a57aa6b25a","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"17614"} "reason"="FailedCreate" "type"="Warning" W0927 01:44:42.688562 1 controller.go:382] jima-ipi-27-d97wp-worker-7qn9b: failed to create machine: jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: The name 'jima-ipi-27-d97wp-worker-7qn9b' already exists. E0927 01:44:42.688651 1 controller.go:326] "msg"="Reconciler error" "error"="jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: The name 'jima-ipi-27-d97wp-worker-7qn9b' already exists." "controller"="machine-controller" "name"="jima-ipi-27-d97wp-worker-7qn9b" "namespace"="openshift-machine-api" "object"={"name":"jima-ipi-27-d97wp-worker-7qn9b","namespace":"openshift-machine-api"} "reconcileID"="d765f02c-bd54-4e6c-88a4-c578f16c7149" ... I0927 03:18:45.118110 1 actuator.go:66] jima-ipi-27-d97wp-worker-7qn9b: actuator creating machine E0927 03:18:45.131676 1 actuator.go:57] jima-ipi-27-d97wp-worker-7qn9b error: jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: ServerFaultCode: The object 'vim.Task:task-11363480' has already been deleted or has not been completely created I0927 03:18:45.131725 1 machine_scope.go:101] jima-ipi-27-d97wp-worker-7qn9b: patching machine I0927 03:18:45.131873 1 recorder.go:103] events "msg"="jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: ServerFaultCode: The object 'vim.Task:task-11363480' has already been deleted or has not been completely created" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"jima-ipi-27-d97wp-worker-7qn9b","uid":"9535891b-902e-410c-b9bb-e6a57aa6b25a","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"17614"} "reason"="FailedCreate" "type"="Warning" W0927 03:18:45.150393 1 controller.go:382] jima-ipi-27-d97wp-worker-7qn9b: failed to create machine: jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: ServerFaultCode: The object 'vim.Task:task-11363480' has already been deleted or has not been completely created E0927 03:18:45.150492 1 controller.go:326] "msg"="Reconciler error" "error"="jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: ServerFaultCode: The object 'vim.Task:task-11363480' has already been deleted or has not been completely created" "controller"="machine-controller" "name"="jima-ipi-27-d97wp-worker-7qn9b" "namespace"="openshift-machine-api" "object"={"name":"jima-ipi-27-d97wp-worker-7qn9b","namespace":"openshift-machine-api"} "reconcileID"="5d92bc1d-2f0d-4a0b-bb20-7f2c7a2cb5af" I0927 03:18:45.150543 1 controller.go:187] jima-ipi-27-d97wp-worker-dsqd2: reconciling Machine
Expected results:
Machine is created successfully.
Additional info:
machine-controller log: http://file.rdu.redhat.com/~zhsun/machine-controller.log
Description of problem:
Based on this and this component readiness data that compares success rates for those two particular tests, we are regressing ~7-10% between the current 4.15 master and 4.14.z (iow. we made the product ~10% worse).
These jobs and their failures are all caused by increased etcd leader elections disrupting seemingly unrelated test cases across the VSphere AMD64 platform.
Since this particular platform's business significance is high, I'm setting this as "Critical" severity.
Please get in touch with me or Dean West if more teams need to be pulled into investigation and mitigation.
Version-Release number of selected component (if applicable):
4.15 / master
How reproducible:
Component Readiness Board
Actual results:
The etcd leader elections are elevated. Some jobs indicate it is due to disk i/o throughput OR network overload.
Expected results:
1. We NEED to understand what is causing this problem. 2. If we can mitigate this, we should. 3. If we cannot mitigate this, we need to document this or work with VSphere infrastructure provider to fix this problem. 4. We optionally need a way to measure how often this happens in our fleet so we can evaluate how bad it is.
Additional info:
Description of problem:
The ca-west-1 region is missing from https://github.com/openshift/installer/blob/master/pkg/quota/aws/limits.go#L15
Version-Release number of selected component (if applicable):
4.15+
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Quota checking is skipped as if it was not supported
Expected results:
Additional info:
Description of problem:
~~~
F0120 03:20:42.221327 879146 driver.go:131] failed to get node "wb02.pdns-edtnabtf-arch01.nfvdev.tlabs.ca" information: Get "https://192.168.0.1:443/api/v1/nodes/wb02.pdns-edtnabtf-arch01.nfvdev.tlabs.ca": dial tcp 192.168.0.1:443: i/o timeout
~~~
other pods on affected node with above config can hit the target service however, pods that are hostNetworked appear to be failing:
$ oc get pod csi-rbdplugin-kpz7n -o yaml | grep hostNetwork
hostNetwork: true
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Steps to Reproduce:
We have redeployed the cluster. and have
routingViaHost and ipForwarding both enabled.
We also pushed out a NODEIP_HINT configuraiton to all the nodes to make sure SDN is overlayed on the correct interface.
Default gateway has been moved to bond1.2039on the 2 x baremetal worker nodes.
wb01
wb02
observe that hostNetworked pods crashloop backoff
Actual results:
Expected results:
Additional info:
See the first comment for data samples + must-gathers + sosreports
Description of problem:
compact agent e2e jobs are consistently failing the e2e test (when they manage to install):
[sig-node] Managed cluster should verify that nodes have no unexpected reboots [Late] [Suite:openshift/conformance/parallel]
Examining CI search I noticed that this failure is also occurring on many other jobs:
Version-Release number of selected component (if applicable):
In CI search, we can see failures in 4.15-4.17
How reproducible:
See CI search results
Steps to Reproduce:
1. 2. 3.
Actual results:
fail e2e
Expected results:
Pass
Additional info:
Description of problem:
Agent CI jobs started to fail on a pretty regular basis, especially the compact ones. Jobs time out due either the console or authentication operators remaining in a degraded state. From the logs analysis, the are not able to get a route. Both apiserver and etcd component logs report connection refused messages, possibly indicating an underlying network problem
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Pretty frequently
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem Missing `useEffect` hook dependency warning error in console UI dev env
Version-Release number of selected component (if applicable): 4.15.0-0.ci.test-2024-02-28-133709-ci-ln-svyfg32-latest
How reproducible:
To reproduce:
Run `yarn lint --fix`
Description of problem:
In this PR, we started using watcher channels to wait for the job finished event from the periodic and on-demand data gathering jobs from IO.
However, as stated in this comment, part of maintaining a watcher is to re-establish it at the last received resource version whenever this channel closes.
This issue is currently causing flakiness in our test suite as the on-demand data gathering job is created, when the job is about to finish, the watcher channel closes, which is causing the datagather instance associated with the job to never have the insightsReport updated. Therefore the tests fail.
Version-Release number of selected component (if applicable):
How reproducible:
Sometimes. Very hard to reproduce as it might have to do with the API resyncing the watcher's cache .
Steps to Reproduce:
1.Create a data gathering job 2.You may see a log saying "watcher channel was closed unexpectedly"
Actual results:
The DataGather instance will not be updated with the insightsReport
Expected results:
When the job finishes, the archive is uploaded to ingress and the report is downloaded from the external data pipeline. This report should appear in the DataGather instance.
Additional info:
It's possible but flaky to reproduce with on-demand data gathering jobs but I've seen it happen with periodic ones as well.
on clusters with a large number of services with externalIPs or services from type loadBalancer the ovnkube-node initialization can take up to 50 min
The problem is after a node reboot done by MCO the unschedule taint is removed from the node so the api allocates pods to that node that get stuck on ContrainerCreating and other nodes continue to go down for reboot making the workloads unavailable. (if no PDB exists for the workload to protect it)
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-36301. The following is the description of the original issue:
—
Seen in a 4.16.1 CI run:
: [bz-Etcd] clusteroperator/etcd should not change condition/Available expand_less 1h28m39s { 2 unexpected clusteroperator state transitions during e2e test run. These did not match any known exceptions, so they cause this test-case to fail: Jun 27 14:17:18.966 E clusteroperator/etcd condition/Available reason/EtcdMembers_NoQuorum status/False EtcdMembersAvailable: 1 of 3 members are available, ip-10-0-71-113.us-west-1.compute.internal is unhealthy, ip-10-0-58-93.us-west-1.compute.internal is unhealthy Jun 27 14:17:18.966 - 75s E clusteroperator/etcd condition/Available reason/EtcdMembers_NoQuorum status/False EtcdMembersAvailable: 1 of 3 members are available, ip-10-0-71-113.us-west-1.compute.internal is unhealthy, ip-10-0-58-93.us-west-1.compute.internal is unhealthy
But further digging turned up no sign that quorum had had any difficulties. It seems like the difficulty was the GetMemberHealth structure, which currently allows timelines like:
That can leave 30+s gaps of nominal Healthy:false for MemberC when in fact MemberC was completely fine.
I suspect that the "was really short" Took:27.199µs got a "took too long" context deadline exceeded because GetMemberHealth has a 30s timeout per member, while many (all?) of its callers have a 30s DefaultClientTimeout. Which means by the time we get to MemberC, we've already spend our Context and we're starved of time to actually check MemberC. It may be more reliable to refactor and probe all known members in parallel, and to keep probing in the event of failures while you wait for the slowest member-probe to get back to you, because I suspect a re-probe of MemberC (or even a single probe that was granted reasonable time to complete) while we waited on MemberB would have succeeded and told us MemberC was actually fine.
Exposure is manageable, because this is self-healing, and quorum is actually ok. But still worth fixing because it spooks admins (and the origin CI test suite) if you tell them you're Available=False, and we want to save that for situations where the component is actually having trouble like quorum loss, and not burn signal-to-noise by claiming EtcdMembers_NoQuorum when it's really BriefIssuesScrapingMemberAHealthAndWeWillllTryAgainSoon.
Seen in 4.16.1, but the code is old, so likely a longstanding issue.
Luckily for customers, but unluckily for QE, network or whatever hiccups when connecting to members seem rare, so we don't trip the condition that exposes this issue often.
1. Figure out which order etcd is probing members in.
2. Stop the first or second member, in a way that makes its health probes time out ~30s.
3. Monitor the etcd ClusterOperator Available condition.
Available goes False claiming EtcdMembers_NoQuorum, as the operator starves itself of the time it needs to actually probe the third member.
Available stays True, as the etcd operator take the full 30s to check on all members, and see that two of them are completely happy.
This fix contains the following changes coming from updated version of kubernetes up to v1.29.9: Changelog: v1.29.9: https://github.com/kubernetes/kubernetes/blob/release-1.29/CHANGELOG/CHANGELOG-1.29.md#changelog-since-v1298
This is a clone of issue OCPBUGS-22293. The following is the description of the original issue:
—
Description of problem:
Upgrading from 4.13.5 to 4.13.17 fails at network operator upgrade
Version-Release number of selected component (if applicable):
How reproducible:
Not sure since we only had one cluster on 4.13.5.
Steps to Reproduce:
1. Have a cluster on version 4.13.5 witn ovn kubernetes 2. Set desired update image to quay.io/openshift-release-dev/ocp-release@sha256:c1f2fa2170c02869484a4e049132128e216a363634d38abf292eef181e93b692 3. Wait until it reaches network operator
Actual results:
Error message: Error while updating operator configuration: could not apply (apps/v1, Kind=DaemonSet) openshift-ovn-kubernetes/ovnkube-master: failed to apply / update (apps/v1, Kind=DaemonSet) openshift-ovn-kubernetes/ovnkube-master: DaemonSet.apps "ovnkube-master" is invalid: [spec.template.spec.containers[1].lifecycle.preStop: Required value: must specify a handler type, spec.template.spec.containers[3].lifecycle.preStop: Required value: must specify a handler type]
Expected results:
Network operator upgrades successfully
Additional info:
Since I'm not able to attach files please gather all required debug data from https://access.redhat.com/support/cases/#/case/03645170
Please review the following PR: https://github.com/openshift/kube-rbac-proxy/pull/88
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
There's a typo in the openssl commands within the ovn-ipsec-containerized/ovn-ipsec-host daemonsets. The correct parameter is "-checkend", not "-checkedn".
Version-Release number of selected component (if applicable):
# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.10 True False 7s Cluster version is 4.14.10
How reproducible:
Steps to Reproduce:
1. Enable IPsec encryption
# oc patch networks.operator.openshift.io cluster --type=merge -p '{"spec": {"defaultNetwork":{"ovnKubernetesConfig":{"ipsecConfig":{ }}}}}'
Actual results:
Examining the initContainer (ovn-keys) logs
# oc logs ovn-ipsec-containerized-7bcd2 -c ovn-keys
...
+ openssl x509 -noout -dates -checkedn 15770000 -in /etc/openvswitch/keys/ipsec-cert.pem
x509: Use -help for summary.
# oc get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE ovn-ipsec-containerized 1 1 0 1 0 beta.kubernetes.io/os=linux 159m ovn-ipsec-host 1 1 1 1 1 beta.kubernetes.io/os=linux 159m ovnkube-node 1 1 1 1 1 beta.kubernetes.io/os=linux 3h44m
# oc get ds ovn-ipsec-containerized -o yaml | grep edn if ! openssl x509 -noout -dates -checkedn 15770000 -in $cert_pem; then # oc get ds ovn-ipsec-host -o yaml | grep edn if ! openssl x509 -noout -dates -checkedn 15770000 -in $cert_pem; then
Description of problem:
Invalid CN is not bubbled up in the CR
Version-Release number of selected component (if applicable):
4.15.0-rc7
How reproducible:
always
Steps to Reproduce:
# generate a key with invalid CN openssl genrsa -out myuser4.key 2048 openssl req -new -key myuser4.key -out myuser4.csr -subj "/CN=baduser/O=system:masters" # get cert in the CSR # apply the CSR # Status remains in Accepted, but it is not Issued % oc get csr | grep 29ecg6n5bkugrh6io4his24ser3bt16n-5-customer-break-glass-csr 29ecg6n5bkugrh6io4his24ser3bt16n-5-customer-break-glass-csr 4m29s hypershift.openshift.io/ocm-integration-29ecg6n5bkugrh6io4his24ser3bt16n-ad-int1.customer-break-glass system:admin 60m Approved # No status in the CSR status: conditions: - lastTransitionTime: "2024-02-16T14:06:41Z" lastUpdateTime: "2024-02-16T14:06:41Z" message: The requisite approval resource exists. reason: ApprovalPresent status: "True" type: Approved # pki controller shows the error oc logs control-plane-pki-operator-bf6d75d5f-h95rf -n ocm-integration-29ecg6n5bkugrh6io4his24ser3bt16n-ad-int1 | grep "29ecg6n5bkugrh6io4his24ser3bt16n-5-customer-break-glass-csr" I0216 14:06:41.842414 1 event.go:298] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"ocm-integration-29ecg6n5bkugrh6io4his24ser3bt16n-ad-int1", Name:"control-plane-pki-operator", UID:"b63dbaa9-18f7-4ee6-8473-8a38bdb6f2df", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CertificateSigningRequestApproved' "29ecg6n5bkugrh6io4his24ser3bt16n-5-customer-break-glass-csr" in is approved I0216 14:06:41.848623 1 event.go:298] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"ocm-integration-29ecg6n5bkugrh6io4his24ser3bt16n-ad-int1", Name:"control-plane-pki-operator", UID:"b63dbaa9-18f7-4ee6-8473-8a38bdb6f2df", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CertificateSigningRequestInvalid' "29ecg6n5bkugrh6io4his24ser3bt16n-5-customer-break-glass-csr" is invalid: invalid certificate request: subject CommonName must begin with "system:customer-break-glass:"
Actual results:
Expected results:
status in the CR show failed and the error
Additional info:
Description of problem:
OAuth-Proxy breaks when it's using Service Account as an oauth-client as documented in https://docs.openshift.com/container-platform/4.15/authentication/using-service-accounts-as-oauth-client.html
Version-Release number of selected component (if applicable):
4.15
How reproducible:
100%
Steps to Reproduce:
1. install an OCP cluster without the ImageRegistry capability 2. deploy an oauth-proxy that uses an SA as its OAuth2 client 3. try to login to the oauth-proxy using valid credentials
Actual results:
The login fails, the oauth-server logs: 2024-02-05T13:30:56.059910994Z E0205 13:30:56.059873 1 osinserver.go:91] internal error: system:serviceaccount:my-namespace:my-sa has no tokens
Expected results:
The login succeeds
Additional info:
Please review the following PR: https://github.com/openshift/kubevirt-csi-driver/pull/37
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/images/pull/154
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-34295. The following is the description of the original issue:
—
Description of problem:
Gathering bootstrap log bundles has been failing in CI with:
level=error msg=Attempted to gather debug logs after installation failure: must provide bootstrap host address
Version-Release number of selected component (if applicable):
How reproducible:
not. this is a race condition when serializing the machine manifests to disk
Steps to Reproduce:
can't reproduce. need to verify in ci.
Actual results:
can't pull bootstrap log bundle
Expected results:
grabs bootstrap log bundle
Additional info:
Description of problem:
The cloud provider feature of NTO doesn't work as expected
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create a cloud-provider profile like as apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: provider-aws namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=GCE Cloud provider-specific profile # Your tuning for GCE Cloud provider goes here. [sysctl] vm.admin_reserve_kbytes=16386 name: provider-aws 2. 3.
Actual results:
the value of vm.admin_reserve_kbytes still using default value
Expected results:
the value of vm.admin_reserve_kbytes should change to 16386
Additional info:
Description of problem:
On a 4.14.5-fast channel cluster in ARO after the upgrade when the customer tried to add a new node the Machine Config was not applied and the node never joined the pool. This happens for every node and can only be remediated by SRE not the customer.
Version-Release number of selected component (if applicable):
4.14.5 -candidate
How reproducible:
Every time a node is added to the cluster at version.
Steps to Reproduce:
1. Install an ARO cluster 2. Upgrade it to 4.14 along fast channel 3. Add a node
Actual results:
message: >- could not Create/Update MachineConfig: Operation cannot be fulfilled on machineconfigs.machineconfiguration.openshift.io "99-worker-generated-kubelet": the object has been modified; please apply your changes to the latest version and try again status: 'False' type: Failure - lastTransitionTime: '2023-11-29T17:44:37Z' ~~~
Expected results:
Node is created and configured correctly.
Additional info:
MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-apiserver" in namespace: "openshift-kube-apiserver" for revision: 15 on node: "aro-cluster-REDACTED-master-0" didn't show up, waited: 4m45s
This is a clone of issue OCPBUGS-34005. The following is the description of the original issue:
—
Description of problem:
Intermittent error during the installation process when enabling Cluster API (CAPI) in the install-config for OCP 4.16 tech preview IPI installation on top of OSP. The error occurs during the post-machine creation hook, specifically related to Floating IP association.
Version-Release number of selected component (if applicable):
OCP: 4.16.0-0.nightly-2024-05-16-092402 TP enabled on top of OSP: RHOS-17.1-RHEL-9-20240123.n.1
How reproducible:
The issue occurs intermittently, sometimes the installation succeeds, and other times it fails.
Steps to Reproduce:
1.Install OSP 2.Initiate OCP installation with TP and CAPI enabled 3.Observe the installation logs of the failed installation.
Actual results:
The installation fails intermittently with the following error message: ... 2024-05-17 23:37:51.590 | level=debug msg=E0517 23:37:29.833599 266622 controller.go:329] "Reconciler error" err="failed to create cluster accessor: error creating http client and mapper for remote cluster \"openshift-cluster-api-guests/ostest-4qrz2\": error creating client for remote cluster \"openshift-cluster-api-guests/ostest-4qrz2\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://api.ostest.shiftstack.com:6443/api/v1?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="openshift-cluster-api-guests/ostest-4qrz2-master-0" namespace="openshift-cluster-api-guests" name="ostest-4qrz2-master-0" reconcileID="985ba50c-2a1d-41f6-b494-f5af7dca2e7b" 2024-05-17 23:37:51.597 | level=debug msg=E0517 23:37:39.838706 266622 controller.go:329] "Reconciler error" err="failed to create cluster accessor: error creating http client and mapper for remote cluster \"openshift-cluster-api-guests/ostest-4qrz2\": error creating client for remote cluster \"openshift-cluster-api-guests/ostest-4qrz2\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://api.ostest.shiftstack.com:6443/api/v1?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="openshift-cluster-api-guests/ostest-4qrz2-master-0" namespace="openshift-cluster-api-guests" name="ostest-4qrz2-master-0" reconcileID="dfe5f138-ac8e-4790-948f-72d6c8631f21" 2024-05-17 23:37:51.603 | level=debug msg=Machine ostest-4qrz2-master-0 is ready. Phase: Provisioned 2024-05-17 23:37:51.610 | level=debug msg=Machine ostest-4qrz2-master-1 is ready. Phase: Provisioned 2024-05-17 23:37:51.615 | level=debug msg=Machine ostest-4qrz2-master-2 is ready. Phase: Provisioned 2024-05-17 23:37:51.619 | level=info msg=Control-plane machines are ready 2024-05-17 23:37:51.623 | level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed during post-machine creation hook: Resource not found: [POST https://10.46.44.159:13696/v2.0/floatingips], error message: {"NeutronError": {"type": "ExternalGatewayForFloatingIPNotFound", "message": "External network 654792e9-dead-485a-beec-f3c428ef71da is not reachable from subnet d9829374-f0de-4a41-a1c0-a2acdd4841da. Therefore, cannot associate Port 01c518a9-5d5f-42d8-a090-6e3151e8af3f with a Floating IP.", "detail": ""}} 2024-05-17 23:37:51.629 | level=info msg=Shutting down local Cluster API control plane... 2024-05-17 23:37:51.637 | level=info msg=Stopped controller: Cluster API 2024-05-17 23:37:51.643 | level=warning msg=process cluster-api-provider-openstack exited with error: signal: killed 2024-05-17 23:37:51.653 | level=info msg=Stopped controller: openstack infrastructure provider 2024-05-17 23:37:51.659 | level=info msg=Local Cluster API system has completed operations
Expected results:
The installation should complete successfully
Additional info: CAPI is enabled by adding the following to the install-config:
featureSet: 'CustomNoUpgrade' featureGates: ['ClusterAPIInstall=true']
Description of problem:
Running yarn or yarn install on latest master branch of Console fails on MacOS
$ cd /path/to/console/frontend $ yarn install
https://github.com/openshift/console/pull/13706#issuecomment-2051682156
$ ./scripts/check-patternfly-modules.sh && yarn prepare-husky && yarn generate Checking \e[0;33myarn.lock\e[0m file for PatternFly module resolutions grep: invalid option -- P usage: grep [-abcdDEFGHhIiJLlMmnOopqRSsUVvwXxZz] [-A num] [-B num] [-C[num]] [-e pattern] [-f file] [--binary-files=value] [--color=when] [--context[=num]] [--directories=action] [--label] [--line-buffered] [--null] [pattern] [file ...]
This is a clone of issue OCPBUGS-35300. The following is the description of the original issue:
—
Description of problem:
ARO cluster fails to install with disconnected networking. We see master nodes bootup hang on the service machine-config-daemon-pull.service. Logs from the service indicate it cannot reach the public IP of the image registry. In ARO, image registries need to go via a proxy. Dnsmasq is used to inject proxy DNS answers, but machine-config-daemon-pull is starting before ARO's dnsmasq.service starts.
Version-Release number of selected component (if applicable):
4.14.16
How reproducible:
Always
Steps to Reproduce:
For Fresh Install: 1. Create the required ARO vnet and subnets 2. Attach a route table to the subnets with a blackhole route 0.0.0.0/0 3. Create 4.14 ARO cluster with --apiserver-visibility=Private --ingress-visibility=Private --outbound-type=UserDefinedRouting [OR] Post Upgrade to 4.14: 1. Create a ARO 4.13 UDR. 2. ClusterUpgrade the cluster 4.13-> 4.14 , upgrade was successful 3. Create a new node (scale up), we run into the same issue.
Actual results:
For Fresh Install of 4.14: ERROR: (InternalServerError) Deployment failed. [OR] Post Upgrade to 4.14: Node doesn't come into a Ready State and Machine is stuck in Provisioned status.
Expected results:
Succeeded
Additional info:
We see in the node logs that machine-config-daemon-pull.service is unable to reach the image registry. ARO's dnsmasq was not yet started.
Previously, systemd ordering was set for ovs-configuration.service to start after (ARO's) dnsmasq.service. Perhaps that should have gone on machine-config-daemon-pull.service.
See https://issues.redhat.com/browse/OCPBUGS-25406.
This is a clone of issue OCPBUGS-35215. The following is the description of the original issue:
—
Description of problem:
[0]
$ omc -n openshift-cluster-storage-operator logs vsphere-problem-detector-operator-78cbc7fdbb-2g9mx | grep -i -e datastore.go -e E0508 2024-05-08T07:44:05.842165300Z I0508 07:44:05.839356 1 datastore.go:329] checking datastore ds:///vmfs/volumes/vsan:526390016b19d2b5-21ae3fd76fa61150/ for permissions 2024-05-08T07:44:05.842165300Z I0508 07:44:05.839504 1 datastore.go:125] CheckStorageClasses: thin-csi: storage policy openshift-storage-policy-tc01-rpdd7: unable to find datastore with URL ds:///vmfs/volumes/vsan:526390016b19d2b5-21ae3fd76fa61150/ 2024-05-08T07:44:05.842165300Z I0508 07:44:05.839522 1 datastore.go:142] CheckStorageClasses checked 7 storage classes, 1 problems found 2024-05-08T07:44:05.848251057Z E0508 07:44:05.848212 1 operator.go:204] failed to run checks: StorageClass thin-csi: storage policy openshift-storage-policy-tc01-rpdd7: unable to find datastore with URL ds:///vmfs/volumes/vsan:526390016b19d2b5-21ae3fd76fa61150/ [...]
[1] https://github.com/openshift/vsphere-problem-detector/compare/release-4.13...release-4.14
[2] https://github.com/openshift/vsphere-problem-detector/blame/release-4.14/pkg/check/datastore.go#L328-L344
[3] https://github.com/openshift/vsphere-problem-detector/pull/119
[4] https://issues.redhat.com/browse/OCPBUGS-28879
Description of problem:
signing test assumes rhel8 base image with the selection of repositories. It should automatically do the right thing.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
When baselineCapabilitySet is set to None, still see an SA with name `deployer-controller` in the cluster.
steps to Reproduce:
=================
1. Install 4.15 cluster with baselineCapabilitySet to None
2. Run command `oc get sa -A | grep deployer`
Actual Results:
================
[knarra@knarra openshift-tests-private]$ oc get sa -A | grep deployer
openshift-infra deployer-controller 0 63m
Expected Results:
==================
No SA related to deployer should be returned
When rolling back from 4.16 to 4.15, rollback changes made to the cluster state to allow the 4.15 version of the managed image pull secret generation to take over again.
This is a clone of issue OCPBUGS-30955. The following is the description of the original issue:
—
Description of problem:
apply nncp to configure DNS, then edit nncp to update nameserver, but /etc/resolv.conf is not updated.
Version-Release number of selected component (if applicable):
OCP version: 4.16.0-0.nightly-2024-03-13-061822 knmstate operator version: kubernetes-nmstate-operator.4.16.0-202403111814
How reproducible:
always
Steps to Reproduce:
1. install knmstate operator 2. apply below nncp to configure dns on one of the node --- apiVersion: nmstate.io/v1 kind: NodeNetworkConfigurationPolicy metadata: name: dns-staticip-4 spec: nodeSelector: kubernetes.io/hostname: qiowang-031510-k4cjs-worker-0-rw4nt desiredState: dns-resolver: config: search: - example.org server: - 192.168.221.146 - 8.8.9.9 interfaces: - name: dummy44 type: dummy state: up ipv4: address: - ip: 192.0.2.251 prefix-length: 24 dhcp: false enabled: true auto-dns: false % oc apply -f dns-staticip-noroute.yaml nodenetworkconfigurationpolicy.nmstate.io/dns-staticip-4 created % oc get nncp NAME STATUS REASON dns-staticip-4 Available SuccessfullyConfigured % oc get nnce NAME STATUS STATUS AGE REASON qiowang-031510-k4cjs-worker-0-rw4nt.dns-staticip-4 Available 5s SuccessfullyConfigured 3. check dns on the node, dns configured correctly sh-5.1# cat /etc/resolv.conf # Generated by KNI resolv prepender NM dispatcher script search qiowang-031510.qe.devcluster.openshift.com example.org nameserver 192.168.221.146 nameserver 192.168.221.146 nameserver 8.8.9.9 # nameserver 192.168.221.1 sh-5.1# sh-5.1# cat /var/run/NetworkManager/resolv.conf # Generated by NetworkManager search example.org nameserver 192.168.221.146 nameserver 8.8.9.9 nameserver 192.168.221.1 sh-5.1# sh-5.1# nmcli | grep 'DNS configuration' -A 10 DNS configuration: servers: 192.168.221.146 8.8.9.9 domains: example.org interface: dummy44 ... ... 4. edit nncp, update nameserver, save the modification --- spec: desiredState: dns-resolver: config: search: - example.org server: - 192.168.221.146 - 8.8.8.8 <---- update from 8.8.9.9 to 8.8.8.8 interfaces: - ipv4: address: - ip: 192.0.2.251 prefix-length: 24 auto-dns: false dhcp: false enabled: true name: dummy44 state: up type: dummy nodeSelector: kubernetes.io/hostname: qiowang-031510-k4cjs-worker-0-rw4nt % oc edit nncp dns-staticip-4 nodenetworkconfigurationpolicy.nmstate.io/dns-staticip-4 edited % oc get nncp NAME STATUS REASON dns-staticip-4 Available SuccessfullyConfigured % oc get nnce NAME STATUS STATUS AGE REASON qiowang-031510-k4cjs-worker-0-rw4nt.dns-staticip-4 Available 8s SuccessfullyConfigured 5. check dns on the node again
Actual results:
the dns nameserver in file /etc/resolv.conf is not updated after nncp updated, file /var/run/NetworkManager/resolv.conf updated correctly: sh-5.1# cat /etc/resolv.conf # Generated by KNI resolv prepender NM dispatcher script search qiowang-031510.qe.devcluster.openshift.com example.org nameserver 192.168.221.146 nameserver 192.168.221.146 nameserver 8.8.9.9 <---- it is not updated # nameserver 192.168.221.1 sh-5.1# sh-5.1# cat /var/run/NetworkManager/resolv.conf # Generated by NetworkManager search example.org nameserver 192.168.221.146 nameserver 8.8.8.8 <---- updated correctly nameserver 192.168.221.1 sh-5.1# sh-5.1# nmcli | grep 'DNS configuration' -A 10 DNS configuration: servers: 192.168.221.146 8.8.8.8 domains: example.org interface: dummy44 ... ...
Expected results:
the dns nameserver in file /etc/resolv.conf can be updated accordingly
Additional info:
The customer's cloud credentials operator generates millions of the below messages per day in the GCP cluster.
And they want to reduce/stop these logs as it is consuming more disks. Also, their "cloud credentials" operator runs in manual mode.
time="2024-06-21T08:37:42Z" level=warning msg="read-only creds not found, using root creds client" actuator=gcp cr=openshift-cloud-credential-operator/openshift-gcp-ccm secret=openshift-cloud-credential-operator/cloud-credential-operator-gcp-ro-creds time="2024-06-21T08:37:42Z" level=error msg="error creating GCP client" error="Secret \"gcp-credentials\" not found" time="2024-06-21T08:37:42Z" level=error msg="error determining whether a credentials update is needed" actuator=gcp cr=openshift-cloud-credential-operator/openshift-gcp-ccm error="unable to check whether credentialsRequest needs update" time="2024-06-21T08:37:42Z" level=error msg="error syncing credentials: error determining whether a credentials update is needed" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-ccm secret=openshift-cloud-controller-manager/gcp-ccm-cloud-credentials time="2024-06-21T08:37:42Z" level=error msg="errored with condition: CredentialsProvisionFailure" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-ccm secret=openshift-cloud-controller-manager/gcp-ccm-cloud-credentials time="2024-06-21T08:37:42Z" level=info msg="reconciling clusteroperator status" time="2024-06-21T08:37:42Z" level=info msg="operator detects timed access token enabled cluster (STS, Workload Identity, etc.)" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-pd-csi-driver-operator time="2024-06-21T08:37:42Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-pd-csi-driver-operator time="2024-06-21T08:37:42Z" level=warning msg="read-only creds not found, using root creds client" actuator=gcp cr=openshift-cloud-credential-operator/openshift-gcp-pd-csi-driver-operator secret=openshift-cloud-credential-operator/cloud-credential-operator-gcp-ro-creds
Description of problem:
The default channel of 4.15, 4.16 clusters is stable-4.14.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-01-03-193825
How reproducible:
Always
Steps to Reproduce:
1. Install a 4.16 cluster 2. Check default channel # oc adm upgrade warning: Cannot display available updates: Reason: VersionNotFound Message: Unable to retrieve available updates: currently reconciling cluster version 4.16.0-0.nightly-2024-01-03-193825 not found in the "stable-4.14" channel Cluster version is 4.16.0-0.nightly-2024-01-03-193825 Upgradeable=False Reason: MissingUpgradeableAnnotation Message: Cluster operator cloud-credential should not be upgraded between minor versions: Upgradeable annotation cloudcredential.openshift.io/upgradeable-to on cloudcredential.operator.openshift.io/cluster object needs updating before upgrade. See Manually Creating IAM documentation for instructions on preparing a cluster for upgrade. Upstream is unset, so the cluster will use an appropriate default. Channel: stable-4.14 3.
Actual results:
Default channel is stable-4.14 in a 4.16 cluster
Expected results:
Default channel should be stable-4.16 in a 4.16 cluster
Additional info:
4.15 cluster has the issue as well.
Observed during testing of candidate-4.15 image as of 2024-02-08.
This is an incomplete report as I haven't verified the reproducer yet or attempted to get a must-gather. I have observed this multiple times now, so I am confident it's a thing. I can't be confident that the procedure described here reliably reproduces it, or that all the described steps are required.
I have been using MCO to apply machine config to masters. This involves a rolling reboot of all masters.
During a rolling reboot I applied an update to CPMS. I observed the following sequence of events:
At this point there were only 2 nodes in the cluster:
and machines provisioning:
Hosted cluster time to provision is outside the SLO 99% of cluster provision in less than 360s.
Description of problem:
When we create a new HostedCluster with HyperShift, the OLM pods on the management cluster cannot be created correctly. Regardless using multi or amd64 images, the OLM pods complains: exec /bin/opm: exec format error All other pods are running correctly. The nodes on management cluster is amd64.
Version-Release number of selected component (if applicable):
4.15.z
How reproducible:
Trigger reherasal of this example PR: https://github.com/openshift/release/pull/51141
Steps to Reproduce:
1. Trigger the reherasal on the PR above: /pj-rehearse periodic-ci-opendatahub-io-ai-edge-main-test-ai-edge-periodic 2. Locate the cluster name in Log of the Pod test-ai-edge-periodic-hypershift-hostedcluster-create-hostedcluster 3. Login https://console-openshift-console.apps.hosted-mgmt.ci.devcluster.openshift.com/ 4. Enter the namespace for the ephemeral cluster created by the reherasal 5. Check Pod, looking for marketplace related pods, like certified-operators-catalog-58f7bd7467-4l2s2
Actual results:
The Pods are Running
Expected results:
The Pods are either CrashLoop or ErrPullImage
Additional info:
Egress IP doesn’t work in multihomed VRF Setup, packets can not be delivered to next-hop for routing.
Topology description
SNO with following configuration:
Interface 1 - Machine network
Interface 2 - VRF with IP and Default Network.
Interface 3 - Interface in Main Routing table with static route
Configuration:
--- apiVersion: nmstate.io/v1 kind: NodeNetworkConfigurationPolicy metadata: name: vrf-1082-with-ip-iface-left-transport annotations: description: Create VLAN, IP Interface and VRF on Transport node LEFT spec: nodeSelector: transport/node: "left" desiredState: interfaces: - ipv4: address: - ip: 10.10.82.2 prefix-length: 24 enabled: true name: enp5s0f0.1082 state: up type: vlan vlan: base-iface: enp5s0f0 id: 1082 - name: vrf1082 state: up type: vrf vrf: port: - enp5s0f0.1082 route-table-id: 1082 route-rules: config: - ip-to: 172.30.0.0/16 priority: 998 route-table: 254 - ip-to: 10.128.0.0/14 priority: 998 route-table: 254 - ip-to: 169.254.169.0/29 priority: 998 route-table: 254 routes: config: - destination: 0.0.0.0/0 metric: 150 next-hop-address: 10.10.82.1 next-hop-interface: enp5s0f0.1082 table-id: 1082
This above creates IP interface on the node
### List of VRFs [core@pool2-controller1 ~]$ ip l show vrf1082 6613: vrf1082: <NOARP,MASTER,UP,LOWER_UP> mtu 65575 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether 72:75:e4:f8:b4:7b brd ff:ff:ff:ff:ff:ff [core@pool2-controller1 ~]$ ip vrf list Name Table ----------------------- vrf1082 1082 ### Default routing table [core@pool2-controller1 ~]$ ip r default via 10.1.196.254 dev br-ex proto static metric 48 10.1.196.0/24 dev br-ex proto kernel scope link src 10.1.196.21 metric 48 10.128.0.0/14 via 10.131.0.1 dev ovn-k8s-mp0 10.131.0.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.131.0.2 169.254.169.0/29 dev br-ex proto kernel scope link src 169.254.169.2 169.254.169.1 dev br-ex src 10.1.196.21 169.254.169.3 via 10.131.0.1 dev ovn-k8s-mp0 172.30.0.0/16 via 169.254.169.4 dev br-ex src 169.254.169.2 mtu 1400 ### VRF Routing table 1082 [core@pool2-controller1 ~]$ ip r show table 1082 default via 10.10.82.1 dev enp5s0f0.1082 proto static metric 150 10.10.82.0/24 dev enp5s0f0.1082 proto kernel scope link src 10.10.82.2 metric 400 local 10.10.82.2 dev enp5s0f0.1082 proto kernel scope host src 10.10.82.2 local 10.10.82.110 dev enp5s0f0.1082 proto kernel scope host src 10.10.82.110 broadcast 10.10.82.255 dev enp5s0f0.1082 proto kernel scope link src 10.10.82.2
Deploy Application
---
# Create Namespace
apiVersion: v1
kind: Namespace
metadata:
name: egressip-test
labels:
egress: vrf1082
---
# Create EgressIP for the namespace
apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
name: egressip-vrf-1082
spec:
egressIPs:
- 100.10.82.110
namespaceSelector:
matchLabels:
egress: vrf1082
---
#Deploy APP
apiVersion: apps/v1
kind: Deployment
metadata:
name: server
namespace: egressip-test
spec:
selector:
matchLabels:
app: server
template:
metadata:
labels:
app: server
spec:
containers:
- name: server
image: quay.io/mancubus77/podman-banner
ports:
- name: http
containerPort: 8080
volumeMounts:
- name: npm-empty-dir
mountPath: /.npm
volumes:
- name: npm-empty-dir
emptyDir: {}
OCP Behaviour
With configuration above, OVNK behaves as it supposed to:
# IPTables egress created [core@pool2-controller1 ~]$ sudo iptables -nvL OVN-KUBE-EGRESS-IP-MULTI-NIC -t nat Chain OVN-KUBE-EGRESS-IP-MULTI-NIC (1 references) pkts bytes target prot opt in out source destination 0 0 SNAT 0 -- * enp5s0f0.1082 10.130.0.45 0.0.0.0/0 to:10.10.82.110 0 0 SNAT 0 -- * enp5s0f0.1082 10.131.0.15 0.0.0.0/0 to:10.10.82.110 [ # IP Rule created [core@pool2-controller1 ~]$ ip rule | grep 6000 6000: from 10.131.0.15 lookup 7614 6000: from 10.130.0.45 lookup 7614
Expected behavior
Actual Behavior
## Command from pod ~ $ curl 1.1.1.1 #### ---=== PACKET DUMP ===--- # Packet leaving Pod's OVN Port 10:48:23.615730 0343c50016330fb P IP 10.131.0.15.57974 > 1.1.1.1.80: Flags [S], seq 2114359519, win 32640, options [mss 1360,sackOK,TS val 2792922673 ecr 0,nop,wscale 7], length 0 # Packet leaving OVN-Domain 10:48:23.615858 ovn-k8s-mp0 In IP 10.131.0.15.57974 > 1.1.1.1.80: Flags [S], seq 2114359519, win 32640, options [mss 1360,sackOK,TS val 2792922673 ecr 0,nop,wscale 7], length 0 # Node tries to resolve Destination IP via ARP (on vlan Interface) 10:48:23.615903 enp5s0f0.1082 Out ARP, Request who-has 1.1.1.1 tell 10.10.82.2, length 28
Root cause
According to the OVN-K source code, when an EgressIP node is added, the controller searches for a routes associated with a given interface based on it’s ifindex. As VRF has different routing table ID from main(default - 254), the OVN-K controller doesn’t know about any routes associated with the interface and creates the following rule per pod:
# IP RULE for 2 pods in the namespace [core@pool2-controller1 ~]$ ip rule 6000: from 10.131.0.15 lookup 7614 6000: from 10.130.0.45 lookup 7614 # Routing table 7614 [core@pool2-controller1 ~]$ ip route show table 7614 default dev enp5s0f0.1082
The entry above says that all traffic on this interface is directly attached (P2P), therefore Linux routing engine sends ARP in attempt to find a MAC address of the destination (1.1.1.1 in this example)
Hack
To make it work, the default route (or associated static route) to VRF needs to be added.
# Add proper route [core@pool2-controller1 ~]$ sudo ip route add default via 10.10.82.1 dev enp5s0f0.1082 table 7614 # Delete default route [core@pool2-controller1 ~]$ sudo ip route del default dev enp5s0f0.1082 table 7614 # Ensure route installed [core@pool2-controller1 ~]$ ip route show table 7614 default via 10.10.82.1 dev enp5s0f0.1082 default dev enp5s0f0.1082 metric 10
New behaviour
# Packet leaving Pod's OVN Port 11:01:25.915965 0343c50016330fb P IP 10.131.0.15.35796 > 1.1.1.1.80: Flags [S], seq 1447686540, win 32640, options [mss 1360,sackOK,TS val 2793704974 ecr 0,nop,wscale 7], length 0 # Packet leaving OVN-Domain 11:01:25.917868 ovn-k8s-mp0 In IP 10.131.0.15.35796 > 1.1.1.1.80: Flags [S], seq 1447686540, win 32640, options [mss 1360,sackOK,TS val 2793704974 ecr 0,nop,wscale 7], length 0 # Packet addresses to Default GW toward to 10.10.82.1 Router 11:04:11.136937 enp5s0f0.1082 Out ifindex 6614 b4:96:91:25:93:20 > b4:96:91:1d:7f:f0, ethertype IPv4 (0x0800), length 74: 10.10.82.110.kitim > 1.1.1.1.http: Flags [S], seq 2398495404, win 32640, options [mss 1360,sackOK,TS val 2794031032 ecr 0,nop,wscale 7], length 0 # Validate MAC [core@pool2-controller1 ~]$ arp -an | grep f0 ? (10.10.82.1) at b4:96:91:1d:7f:f0 [ether] on enp5s0f0.1082
Document with more details: https://docs.google.com/document/d/1ZLIqWjs85_zBZ9J92L63zwbds66gMAnLhShtlPFH9Ro/edit
==== This Jira covers only haproxy component ====
Description of problem:
Pods running in the namespace openshift-vsphere-infra are so much verbose printing as INFO messages that should debug. This excesse of verbosity has an impact in CRIO, in the node and also in the Logging system. For instance, having 71 nodes, the number of logs coming from this namespace in 1 month was: 450.000.000 meaning 1TB of logs written to disk on the node by CRIO, reading but the Red Hat log collector and stored in the Log Store. Added to the impact on the performance, it have a financial impact for the storage needed. Examples of logs are that adjust better to DEBUG and not as INFO: ``` /// For keep-alive pods are printed 4 messages per node each 10 seconds per node, in this example, the number of nodes is 71, then, this means 284 log entries per second, then 1704 log entries by minute and keepalive pod $ oc logs keepalived-master.example-0 -c keepalived-monitor |grep master.example-0|grep 2024-02-15T08:20:21 |wc -l $ oc logs keepalived-master-example-0 -c keepalived-monitor |grep worker-example-0|grep 2024-02-15T08:20:21 2024-02-15T08:20:21.671390814Z time="2024-02-15T08:20:21Z" level=info msg="Searching for Node IP of worker-example-0. Using 'x.x.x.x/24' as machine network. Filtering out VIPs '[x.x.x.x x.x.x.x]'." 2024-02-15T08:20:21.671390814Z time="2024-02-15T08:20:21Z" level=info msg="For node worker-example-0 selected peer address x.x.x.x using NodeInternalIP" 2024-02-15T08:20:21.733399279Z time="2024-02-15T08:20:21Z" level=info msg="Searching for Node IP of worker-example-0. Using 'x.x.x.x' as machine network. Filtering out VIPs '[x.x.x.x x.x.x.x]'." 2024-02-15T08:20:21.733421398Z time="2024-02-15T08:20:21Z" level=info msg="For node worker-example-0 selected peer address x.x.x.x using NodeInternalIP" /// For haproxy logs observed 2 logs printed per 6 seconds for each master, this means 6 messages in the same second, 60 messages/minute per pod $ oc logs haproxy-master-0-example -c haproxy-monitor ... 2024-02-15T08:20:00.517159455Z time="2024-02-15T08:20:00Z" level=info msg="Searching for Node IP of master-example-0. Using 'x.x.x.x/24' as machine network. Filtering out VIPs '[x.x.x.x]'." 2024-02-15T08:20:00.517159455Z time="2024-02-15T08:20:00Z" level=info msg="For node master-example-0 selected peer address x.x.x.x using NodeInternalIP"
Version-Release number of selected component (if applicable):
OpenShift 4.14 VSphere IPI installation
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift 4.14 Vsphere IPI environment 2. Review the logs of the haproxy pods and keealived pods running in the namespace `openshift-vsphere-infra`
Actual results:
The pods haproxy-* and keepalived-* pods being so much verbose printing as INFO messages should be as DEBUG. Some of the messages are available in the Description of the problem in the present bug.
Expected results:
Printed as INFO only relevant messages helping to reduce the verbosity of the pods running in the namespace `openshift-vsphere-infra`
Additional info:
This is a clone of issue OCPBUGS-32476. The following is the description of the original issue:
—
Description of problem:
After installing the Pipelines Operator on a local cluster (OpenShift local), the Pipelines features was shown the Console.
But when selecting the Build option "Pipelines" a warning was shown:
The pipeline template for Dockerfiles is not available at this time.
Anyway it was possible to push the Create button and create a Deployment. But because there is no build process created, it couldn't successful start.
After ~20 min after the Pipeline operator says that it was successfully installed, the Pipeline templates in the openshift-pipelines namespaces appear, and I could create valid Deployment.
Version-Release number of selected component (if applicable):
How reproducible:
Sometimes, maybe depending on the internet connection speed.
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
Component Readiness has found a potential regression in [sig-arch] events should not repeat pathologically for ns/openshift-multus.
Probability of significant regression: 99.96%
Sample (being evaluated) Release: 4.16
Start Time: 2024-02-19T00:00:00Z
End Time: 2024-03-04T23:59:59Z
Success Rate: 53.33%
Successes: 8
Failures: 7
Flakes: 0
Base (historical) Release: 4.15
Start Time: 2024-02-19T00:00:00Z
End Time: 2024-03-04T23:59:59Z
Success Rate: 100.00%
Successes: 24
Failures: 0
Flakes: 0
Description of problem:
On customer feedback modal, there are 3 links for user to feedback to Red Hat, the third link lacks a title.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-21-155123
How reproducible:
Always
Steps to Reproduce:
1.Login admin console. Click on "?"->"Share Feedback", check the links on the modal 2. 3.
Actual results:
1. The third link lacks a link title (the link for "Learn about opportunities to ……").
Expected results:
1. There is link title "Inform the direction of Red Hat" in 4.14, it should also exists for 4.15.
Additional info:
screenshot for 4.14 page: https://drive.google.com/file/d/19AnPlE0h9WwvIjxV0gLuf5x27jLN7TLS/view?usp=drive_link screenshot for 4.15 page: https://drive.google.com/file/d/19MRjzNGRWfYnK-zcoMozh7Z7eaDDG2L-/view?usp=drive_link
This is a clone of issue OCPBUGS-33750. The following is the description of the original issue:
—
Description of problem:
Sometimes dns name configured in EgressFirewall was not resolved
Version-Release number of selected component (if applicable):
Using the build by {code:java} build openshift/cluster-network-operator#2131
How reproducible:{code:none}
Steps to Reproduce:
% for i in {1..7};do oc create ns test$i;oc create -f data/egressfirewall/eg_policy_wildcard.yaml -n test$i; oc create -f data/list-for-pod.json -n test$i;sleep 1;done namespace/test1 created egressfirewall.k8s.ovn.org/default created replicationcontroller/test-rc created service/test-service created namespace/test2 created egressfirewall.k8s.ovn.org/default created replicationcontroller/test-rc created service/test-service created namespace/test3 created egressfirewall.k8s.ovn.org/default created replicationcontroller/test-rc created service/test-service created namespace/test4 created egressfirewall.k8s.ovn.org/default created replicationcontroller/test-rc created service/test-service created namespace/test5 created egressfirewall.k8s.ovn.org/default created replicationcontroller/test-rc created service/test-service created namespace/test6 created egressfirewall.k8s.ovn.org/default created replicationcontroller/test-rc created service/test-service created namespace/test7 created egressfirewall.k8s.ovn.org/default created replicationcontroller/test-rc created service/test-service created % cat data/egressfirewall/eg_policy_wildcard.yaml kind: EgressFirewall apiVersion: k8s.ovn.org/v1 metadata: name: default spec: egress: - type: Allow to: dnsName: "*.google.com" - type: Deny to: cidrSelector: 0.0.0.0/0 Then I created namespace test8, created egressfirewall and updated dns anme,it worked well. Then I deleted test8 After that I created namespace test11 as below steps, the issue happened again. % oc create ns test11 namespace/test11 created % oc create -f data/list-for-pod.json -n test11 replicationcontroller/test-rc created service/test-service created % oc create -f data/egressfirewall/eg_policy_dnsname1.yaml -n test11 egressfirewall.k8s.ovn.org/default created % oc get egressfirewall -n test11 NAME EGRESSFIREWALL STATUS default EgressFirewall Rules applied % oc get egressfirewall -n test11 -o yaml apiVersion: v1 items: - apiVersion: k8s.ovn.org/v1 kind: EgressFirewall metadata: creationTimestamp: "2024-05-16T05:32:07Z" generation: 1 name: default namespace: test11 resourceVersion: "101288" uid: 18e60759-48bf-4337-ac06-2e3252f1223a spec: egress: - to: dnsName: registry-1.docker.io type: Allow - ports: - port: 80 protocol: TCP to: dnsName: www.facebook.com type: Allow - to: cidrSelector: 0.0.0.0/0 type: Deny status: messages: - 'hrw-0516i-d884f-worker-a-m7769: EgressFirewall Rules applied' - 'hrw-0516i-d884f-master-0.us-central1-b.c.openshift-qe.internal: EgressFirewall Rules applied' - 'hrw-0516i-d884f-worker-b-q4fsm: EgressFirewall Rules applied' - 'hrw-0516i-d884f-master-1.us-central1-c.c.openshift-qe.internal: EgressFirewall Rules applied' - 'hrw-0516i-d884f-master-2.us-central1-f.c.openshift-qe.internal: EgressFirewall Rules applied' - 'hrw-0516i-d884f-worker-c-4kvgr: EgressFirewall Rules applied' status: EgressFirewall Rules applied kind: List metadata: resourceVersion: "" % oc get pods -n test11 NAME READY STATUS RESTARTS AGE test-rc-ffg4g 1/1 Running 0 61s test-rc-lw4r8 1/1 Running 0 61s % oc rsh -n test11 test-rc-ffg4g ~ $ curl registry-1.docker.io -I ^C ~ $ curl www.facebook.com ^C ~ $ ~ $ curl www.facebook.com --connect-timeout 5 curl: (28) Failed to connect to www.facebook.com port 80 after 2706 ms: Operation timed out ~ $ curl registry-1.docker.io --connect-timeout 5 curl: (28) Failed to connect to registry-1.docker.io port 80 after 4430 ms: Operation timed out ~ $ ^C ~ $ exit command terminated with exit code 130 % oc get dnsnameresolver -n openshift-ovn-kubernetes NAME AGE dns-67b687cfb5 7m47s dns-696b6747d9 2m12s dns-b6c74f6f4 2m12s % oc get dnsnameresolver dns-696b6747d9 -n openshift-ovn-kubernetes -o yaml apiVersion: network.openshift.io/v1alpha1 kind: DNSNameResolver metadata: creationTimestamp: "2024-05-16T05:32:07Z" generation: 1 name: dns-696b6747d9 namespace: openshift-ovn-kubernetes resourceVersion: "101283" uid: a8546ad8-b16d-4d81-a943-46bdd0d82aa5 spec: name: www.facebook.com. % oc get dnsnameresolver dns-696b6747d9 -n openshift-ovn-kubernetes -o yaml apiVersion: network.openshift.io/v1alpha1 kind: DNSNameResolver metadata: creationTimestamp: "2024-05-16T05:32:07Z" generation: 1 name: dns-696b6747d9 namespace: openshift-ovn-kubernetes resourceVersion: "101283" uid: a8546ad8-b16d-4d81-a943-46bdd0d82aa5 spec: name: www.facebook.com. % oc get dnsnameresolver dns-696b6747d9 -n openshift-ovn-kubernetes -o yaml apiVersion: network.openshift.io/v1alpha1 kind: DNSNameResolver metadata: creationTimestamp: "2024-05-16T05:32:07Z" generation: 1 name: dns-696b6747d9 namespace: openshift-ovn-kubernetes resourceVersion: "101283" uid: a8546ad8-b16d-4d81-a943-46bdd0d82aa5 spec: name: www.facebook.com.
Actual results:
The dns name like www.facebook.com configured in egressfirewall didn't get resolved to IP
Expected results:
EgressFirewall works as expected.
Additional info:
Description of problem:
Bare Metal UPI cluster Nodes lose communication with other nodes and this affects the pod communication on these nodes as well. This issue can be fixed with an OVN rebuild on the nodes db that are hitting the issue but eventually the nodes will degrade again and lose communication again. Note despite an OVN Rebuild fixing the issue temporarily Host Networking is set to True so it's using the kernel routing table. **update: observed on Vsphere with routingViaHost: false, ipForwarding: global configuration as well.
Version-Release number of selected component (if applicable):
4.14.7, 4.14.30
How reproducible:
Can't reproduce locally but reproducible and repeatedly occurring in customer environment
Steps to Reproduce:
identify a host node who's pods can't be reached from other hosts in default namespaces ( tested via openshift-dns). observe curls to that peer pod consistently timeout. TCPdumps to target pod observe that packets are arriving and are acknowledged, but never route back to the client pod successfully. (SYN/ACK seen at pod network layer, not at geneve; so dropped before hitting geneve tunnel).
Actual results:
Nodes will repeatedly degrade and lose communication despite fixing the issue with a ovn db rebuild (db rebuild only provides hours/days of respite, no permanent resolve).
Expected results:
Nodes should not be losing communication and even if they did it should not happen repeatedly
Additional info:
What's been tried so far ======================== - Multiple OVN rebuilds on different nodes (works but node will eventually hit issue again) - Flushing the conntrack (Doesn't work) - Restarting nodes (doesn't work) Data gathered ============= - Tcpdump from all interfaces for dns-pods going to port 7777 (to segregate traffic) - ovnkube-trace - SOSreports of two nodes having communication issues before an OVN rebuild - SOSreports of two nodes having communication issues after an OVN rebuild - OVS trace dumps of br-int and br-ex ==== More data in nested comments below.
Please review the following PR: https://github.com/openshift/cluster-api-operator/pull/37
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-36462. The following is the description of the original issue:
—
Similar to OCPBUGS-20061, but for a different situation:
$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=48h&name=pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling&type=junit&search=clusteroperator/control-plane-machine-set+should+not+change+condition/Available' | grep 'failures match' | sort pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling (all) - 15 runs, 60% failed, 33% of failures match = 20% impact
In that test, since ETCD-329, the test suite deletes a control-plane Machine and waits for the ControlPlaneMachineSet controller to scale in a replacement. But in runs like this, the outgoing Node goes Ready=Unknown for not-yet-diagnosed reasons, and that somehow misses cpmso#294's inertia (maybe the running guard should be dropped?), and the ClusterOperator goes Available=False complaining about Missing 1 available replica(s).
It's not clear from the message which replica it's worried about (that would be helpful information to include in the message), but I suspect it's the Machine/Node that's in the deletion process. But regardless of the message, this does not seem like a situation worth a cluster-admin-midnight-page Available=False alarm.
Seen in dev-branch CI. I haven't gone back to check older 4.y.
CI Search shows 20% impact, see my earlier query in this message.
Run a bunch of pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling and check CI Search results.
20% impact
No hits.
This is a clone of issue OCPBUGS-42277. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-42231. The following is the description of the original issue:
—
Description of problem:
OCP Conformance MonitorTests can fail based on CSI Drivers pod and ClusterRole applied order. SA, CR, CRB likely should be applied first prior to deployment/pods.
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
60%
Steps to Reproduce:
1. Create IPI cluster on IBM Cloud 2. Run OCP Conformance w/ MonitorTests
Actual results:
: [sig-auth][Feature:SCC][Early] should not have pod creation failures during install [Suite:openshift/conformance/parallel] { fail [github.com/openshift/origin/test/extended/authorization/scc.go:76]: 1 pods failed before test on SCC errors Error creating: pods "ibm-vpc-block-csi-node-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[5]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[6]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[7]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[9]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, provider restricted-v2: .containers[0].runAsUser: Invalid value: 0: must be in the ranges: [1000180000, 1000189999], provider restricted-v2: .containers[1].runAsUser: Invalid value: 0: must be in the ranges: [1000180000, 1000189999], provider restricted-v2: .containers[1].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[2].runAsUser: Invalid value: 0: must be in the ranges: [1000180000, 1000189999], provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for DaemonSet.apps/v1/ibm-vpc-block-csi-node -n openshift-cluster-csi-drivers happened 7 times Ginkgo exit error 1: exit with code 1}
Expected results:
No pod creation failures using the wrong SCC, because the ClusterRole/ClusterRoleBinding, etc. had not been applied yet.
Additional info:
Sorry, I did not see an IBM Cloud Storage listed in the targeted Component for this bug, so selected the generic Storage component. Please forward as necessary/possible. Items to consider: ClusterRole: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/assets/rbac/privileged_role.yaml ClusterRoleBinding: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/assets/rbac/node_privileged_binding.yaml The ibm-vpc-block-csi-node-* pods eventually reach running using privileged SCC. I do not know whether it is possible to stage the resources that get created first, within the CSI Driver Operator https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/9288e5078f2fe3ce2e69a4be3d94622c164c3dbd/pkg/operator/starter.go#L98-L99 Prior to the CSI Driver daemonset (`node.yaml`), perhaps order matters within the list. Example of failure in CI: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/8235/pull-ci-openshift-installer-master-e2e-ibmcloud-ovn/1836521032031145984
This is a clone of issue OCPBUGS-34901. The following is the description of the original issue:
—
Description of problem:
My CSV recently added a v1beta2 API version in addition to the existing v1beta1 version. When I create a v1beta2 CR and view it in the console, I see v1beta1 API fields and not the expected v1beta2 fields.
Version-Release number of selected component (if applicable):
4.15.14 (could affect other versions)
How reproducible:
Install 3.0.0 development version of Cryostat Operator
Steps to Reproduce:
1. operator-sdk run bundle quay.io/ebaron/cryostat-operator-bundle:ocpbugs-34901 2. cat << 'EOF' | oc create -f - apiVersion: operator.cryostat.io/v1beta2 kind: Cryostat metadata: name: cryostat-sample spec: enableCertManager: false EOF 3. Navigate to https://<openshift console>/k8s/ns/openshift-operators/clusterserviceversions/cryostat-operator.v3.0.0-dev/operator.cryostat.io~v1beta2~Cryostat/cryostat-sample 4. Observe v1beta1 properties are rendered including "Minimal Deployment" 5. Attempt to toggle "Minimal Deployment", observe that this fails.
Actual results:
v1beta1 properties are rendered in the details page instead of v1beta2 properties
Expected results:
v1beta2 properties are rendered in the details page
Additional info:
This is a clone of issue OCPBUGS-36904. The following is the description of the original issue:
—
Description of problem:
Subnets created by the installer are tagged with kubernetes.io/cluster/<infra_id> set to 'shared' instead of 'owned'.
Version-Release number of selected component (if applicable):
4.16.z
How reproducible:
Any time a 4.16 cluster is installed
Steps to Reproduce:
1. Install a fresh 4.16 cluster without providing an existing VPC.
Actual results:
Subnets are tagged with kubernetes.io/cluster/<infra_id>: shared
Expected results:
Subnets created by the installer are tagged with kubernetes.io/cluster/<infra_id>: owned
Additional info:
Slack discussion here - https://redhat-internal.slack.com/archives/C68TNFWA2/p1720728359424529
This is a clone of issue OCPBUGS-39225. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-38474. The following is the description of the original issue:
—
Description of problem:
AdditionalTrustedCA is not wired correctly so the configmap is not found my its operator. This feature is meant to be exposed by XCMSTRAT-590, but at the moment it seems to be broken
Version-Release number of selected component (if applicable):
4.16.5
How reproducible:
Always
Steps to Reproduce:
1. Create a configmap containing a registry and PEM cert, like https://github.com/openshift/openshift-docs/blob/ef75d891786604e78dcc3bcb98ac6f1b3a75dad1/modules/images-configuration-cas.adoc#L17 2. Refer to it in .spec.configuration.image.additionalTrustedCA.name 3. image-registry-config-operator is not able to find the cm and the CO is degraded
Actual results:
CO is degraded
Expected results:
certs are used.
Additional info:
I think we may miss a copy of the configmap from the cluster NS to the target ns. It should be also deleted if it is deleted.
% oc get hc -n ocm-adecorte-2d525fsstsvtbv1h8qss14pkv171qhdd -o jsonpath="{.items[0].spec.configuration.image.additionalTrustedCA}" | jq { "name": "registry-additional-ca-q9f6x5i4" }
% oc get cm -n ocm-adecorte-2d525fsstsvtbv1h8qss14pkv171qhdd registry-additional-ca-q9f6x5i4 NAME DATA AGE registry-additional-ca-q9f6x5i4 1 16m
logs of cluster-image-registry operator
E0814 13:22:32.586416 1 imageregistrycertificates.go:141] ImageRegistryCertificatesController: unable to sync: failed to update object *v1.ConfigMap, Namespace=openshift-image-registry, Name=image-registry-certificates: image-registry-certificates: configmap "registry-additional-ca-q9f6x5i4" not found, requeuing
CO is degraded
% oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
console 4.16.5 True False False 3h58m
csi-snapshot-controller 4.16.5 True False False 4h11m
dns 4.16.5 True False False 3h58m
image-registry 4.16.5 True False True 3h58m ImageRegistryCertificatesControllerDegraded: failed to update object *v1.ConfigMap, Namespace=openshift-image-registry, Name=image-registry-certificates: image-registry-certificates: configmap "registry-additional-ca-q9f6x5i4" not found
ingress 4.16.5 True False False 3h59m
insights 4.16.5 True False False 4h
kube-apiserver 4.16.5 True False False 4h11m
kube-controller-manager 4.16.5 True False False 4h11m
kube-scheduler 4.16.5 True False False 4h11m
kube-storage-version-migrator 4.16.5 True False False 166m
monitoring 4.16.5 True False False 3h55m
Please review the following PR: https://github.com/openshift/cluster-update-keys/pull/53
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-33397. The following is the description of the original issue:
—
Description of problem:
Node has been cordoned manually.After several days, machine-config-controller uncordoned the same node after rendering a new machine-config.
Version-Release number of selected component (if applicable):
4.13
Actual results:
The mco rolled out and the node was uncordoned by the mco
Expected results:
MCO treat unscedhulable node as not ready for performing update. Also, it may halt update on other nodes in the pool based on what maxUnavailable is set for that pool
Additional info:
Description of the problem:
In a deployment with bonding and vlan during the booting of the provision image, the system loose the connectivity (even the ping) as soon as two new network interfaces appear on the node, created by the ironic-python-agent, those interfaces are the slaves interfaces with the vlan added.{code}
Version-Release number of selected component (if applicable):
4.12.48\{code} How reproducible: {code:none} Always\{code} Steps to Reproduce: {code:none} 1. Deploy a cluster with bonding + vlan 2. 3. Actual results: {code:none} After investigation from Openstack team, it looks like having this option "enable_vlan_interfaces = all" enabled in "/etc/ironic-python-agent.conf" is what trigger the creation of the vlan interfaces. This new interfaces is what cuts the communication.
Expected results:
No extra vlan interfaces created, communication is not lost and installation succeeds.
Additional info:
How customer crafted the test: As soon as the node start pinging he connected with ssh and set a password to core user one communication is lost (~1 min after started pinging) it connects through the KVM interface and cor password. If we disable the ironic-python-agent and manually remove the vlan interfaces created the communication is restored. Installation works if LLDP is turned off at teh switch. This issue was supposed to be fixed in these versions, according to the original JIRA which I have linked here. Team lead from that JIRA suggested the issue has to be fixed by re-vendoring ICC in the assisted-service, hence this JIRA creation.
Description of problem:
The installation of OpenShift Container Platform 4.13.4 is failing fairly frequent compare to previous version, when installing with proxy configured. The error reported by the MachineConfigPool is as shown below. - lastTransitionTime: "2023-07-04T10:36:44Z" message: 'Node master0.example.com is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-1e13d7d4ca10669d3d5a6a2bd532873a\" not found", Node master1.example.com is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-1e13d7d4ca10669d3d5a6a2bd532873a\" not found", Node master2.example.com is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-1e13d7d4ca10669d3d5a6a2bd532873a\" not found"' According to https://docs.google.com/document/d/1fgP6Kv1D-75e1Ot0Kg-W2qPyxWDp2_CALltlBLuseec/edit#heading=h.ny6l9ud82fxx this seems to be a known condition but it's not clear how to prevent that from happening and therefore ensure installation are working as expected. The major difference found between /etc/mcs-machine-config-content.json on the OpenShift Container Platform 4 - Control-Plane Node and the rendered-master-${hash} are within the following files. - /etc/mco/proxy.env - /etc/kubernetes/kubelet-ca.crt
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.13.4
How reproducible:
Random
Steps to Reproduce:
1. Install OpenShift Container Platform 4.13.4 on AWS with platform:none, proxy defined and both machineCIDR and machineNetwork.cidr set.
Actual results:
Installation is stuck and will eventually fail as the MachineConfigPool is failing to rollout required MachineConfig for master MachineConfigPool - lastTransitionTime: "2023-07-04T10:36:44Z" message: 'Node master0.example.com is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-1e13d7d4ca10669d3d5a6a2bd532873a\" not found", Node master1.example.com is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-1e13d7d4ca10669d3d5a6a2bd532873a\" not found", Node master2.example.com is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-1e13d7d4ca10669d3d5a6a2bd532873a\" not found"'
Expected results:
Installation to work or else provide meaningful error messaging
Additional info:
https://docs.google.com/document/d/1fgP6Kv1D-75e1Ot0Kg-W2qPyxWDp2_CALltlBLuseec/edit#heading=h.ny6l9ud82fxx checked and then talked to Red Hat Engineering as it was not clear how to proceed
Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/774
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/k8s-prometheus-adapter/pull/100
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-api-provider-openstack/pull/300
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-35041. The following is the description of the original issue:
—
Description of problem:
For STS, an AWS creds file is injected with credentials_process for installer to use. That usually points to a command that loads a Secret containing the creds necessary to assume role. For CAPI, installer runs in an ephemeral envtest cluster. So when it runs that credentials_process (via the black box of passing the creds file to the AWS SDK) the command ends up requesting that Secret from the envtest kube API server… where it doesn’t exist. The Installer should avoid overriding KUBECONFIG whenever possible.
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
always
Steps to Reproduce:
1. Deploy cluster with STS credentials 2. 3.
Actual results:
Install fails with: time="2024-06-02T23:50:17Z" level=debug msg="failed to get the service provider secret: secrets \"shawnnightly-aws-service-provider-secret\" not foundfailed to get the service provider secret: oc get events -n uhc-staging-2blaesc1478urglmcfk3r79a17n82lm3E0602 23:50:17.324137 151 awscluster_controller.go:327] \"failed to reconcile network\" err=<" time="2024-06-02T23:50:17Z" level=debug msg="\tfailed to create new managed VPC: failed to create vpc: ProcessProviderExecutionError: error in credential_process" time="2024-06-02T23:50:17Z" level=debug msg="\tcaused by: exit status 1" time="2024-06-02T23:50:17Z" level=debug msg=" > controller=\"awscluster\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"AWSCluster\" AWSCluster=\"openshift-cluster-api-guests/shawnnightly-c8zdl\" namespace=\"openshift-cluster-api-guests\" name=\"shawnnightly-c8zdl\" reconcileID=\"e7524343-f598-4b71-a788-ad6975e92be7\" cluster=\"openshift-cluster-api-guests/shawnnightly-c8zdl\"" time="2024-06-02T23:50:17Z" level=debug msg="I0602 23:50:17.324204 151 recorder.go:104] \"Failed to create new managed VPC: ProcessProviderExecutionError: error in credential_process\\ncaused by: exit status 1\" logger=\"events\" type=\"Warning\" object={\"kind\":\"AWSCluster\",\"namespace\":\"openshift-cluster-api-guests\",\"name\":\"shawnnightly-c8zdl\",\"uid\":\"f20bd7ae-a8d2-4b16-91c2-c9525256bb46\",\"apiVersion\":\"infrastructure.cluster.x-k8s.io/v1beta2\",\"resourceVersion\":\"311\"} reason=\"FailedCreateVPC\""
Expected results:
No failures
Additional info:
Since HyperShift / Hosted Control Plane have adopted include.release.openshift.io/ibm-cloud-managed, to tailor the resources of clusters running in the ROKS IBM environment, the include.release.openshift.io/hypershift addition will allow Hypershift to express different profile choices than ROKS
Description of problem:
control-plane-machine-set operator pod stuck into crashloopbackoff state with panic: runtime error: invalid memory address or nil pointer dereference while extracting the failureDomain from the controlplanemachineset. Below is the error trace for reference. ~~~ 2024-04-04T09:32:23.594257072Z I0404 09:32:23.594176 1 controller.go:146] "msg"="Finished reconciling control plane machine set" "controller"="controlplanemachinesetgenerator" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="c282f3e3-9f9d-40df-a24e-417ba2ea4106" 2024-04-04T09:32:23.594257072Z I0404 09:32:23.594221 1 controller.go:125] "msg"="Reconciling control plane machine set" "controller"="controlplanemachinesetgenerator" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="7f03c05f-2717-49e0-95f8-3e8b2ce2fc55" 2024-04-04T09:32:23.594274974Z I0404 09:32:23.594257 1 controller.go:146] "msg"="Finished reconciling control plane machine set" "controller"="controlplanemachinesetgenerator" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="7f03c05f-2717-49e0-95f8-3e8b2ce2fc55" 2024-04-04T09:32:23.597509741Z I0404 09:32:23.597426 1 watch_filters.go:179] reconcile triggered by infrastructure change 2024-04-04T09:32:23.606311553Z I0404 09:32:23.606243 1 controller.go:220] "msg"="Starting workers" "controller"="controlplanemachineset" "worker count"=1 2024-04-04T09:32:23.606360950Z I0404 09:32:23.606340 1 controller.go:169] "msg"="Reconciling control plane machine set" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="5dac54f4-57ab-419b-b258-79136ca8b400" 2024-04-04T09:32:23.609322467Z I0404 09:32:23.609217 1 panic.go:884] "msg"="Finished reconciling control plane machine set" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="5dac54f4-57ab-419b-b258-79136ca8b400" 2024-04-04T09:32:23.609322467Z I0404 09:32:23.609271 1 controller.go:115] "msg"="Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" "controller"="controlplanemachineset" "reconcileID"="5dac54f4-57ab-419b-b258-79136ca8b400" 2024-04-04T09:32:23.612540681Z panic: runtime error: invalid memory address or nil pointer dereference [recovered] 2024-04-04T09:32:23.612540681Z panic: runtime error: invalid memory address or nil pointer dereference 2024-04-04T09:32:23.612540681Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a5911c] 2024-04-04T09:32:23.612540681Z 2024-04-04T09:32:23.612540681Z goroutine 255 [running]: 2024-04-04T09:32:23.612540681Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1() 2024-04-04T09:32:23.612571624Z /go/src/github.com/openshift/cluster-control-plane-machine-set-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116 +0x1fa 2024-04-04T09:32:23.612571624Z panic({0x1c8ac60, 0x31c6ea0}) 2024-04-04T09:32:23.612571624Z /usr/lib/golang/src/runtime/panic.go:884 +0x213 2024-04-04T09:32:23.612571624Z github.com/openshift/cluster-control-plane-machine-set-operator/pkg/machineproviders/providers/openshift/machine/v1beta1/providerconfig.VSphereProviderConfig.ExtractFailureDomain(...) 2024-04-04T09:32:23.612571624Z /go/src/github.com/openshift/cluster-control-plane-machine-set-operator/pkg/machineproviders/providers/openshift/machine/v1beta1/providerconfig/vsphere.go:120 2024-04-04T09:32:23.612571624Z github.com/openshift/cluster-control-plane-machine-set-operator/pkg/machineproviders/providers/openshift/machine/v1beta1/providerconfig.providerConfig.ExtractFailureDomain({{0x1f2a71a, 0x7}, {{{{...}, {...}}, {{...}, {...}, {...}, {...}, {...}, {...}, ...}, ...}}, ...}) 2024-04-04T09:32:23.612588145Z /go/src/github.com/openshift/cluster-control-plane-machine-set-operator/pkg/machineproviders/providers/openshift/machine/v1beta1/providerconfig/providerconfig.go:212 +0x23c ~~~
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
control-plane-machine-set operator stuck into crashloopback off state while cluster upgrade.
Expected results:
control-plane-machine-set operator should be upgraded without any errors.
Additional info:
This is happening during the cluster upgrade of Vsphere IPI cluster from OCP version 4.14.z to 4.15.6 and may impact other z stream releases. from the official docs[1] I see providing the failure domain for the Vsphere platform is tech preview feature. [1] https://docs.openshift.com/container-platform/4.15/machine_management/control_plane_machine_management/cpmso-configuration.html#cpmso-yaml-failure-domain-vsphere_cpmso-configuration
Description of problem:
$ oc get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE machine-config 4.16.0-0.ci-2024-03-01-110656 False False True 2m56s Cluster not available for [{operator 4.16.0-0.ci-2024-03-01-110656}]: MachineConfigNode.machineconfiguration.openshift.io "ip-10-0-24-212.us-east-2.compute.internal" is invalid: [metadata.ownerReferences.apiVersion: Invalid value: "": version must not be empty, metadata.ownerReferences.kind: Invalid value: "": kind must not be empty] MCO operator is failing with this error: 218", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MachineConfigNodeFailed' Cluster not available for [{operator 4.16.0-0.ci-2024-03-01-110656}]: MachineConfigNode.machineconfiguration.openshift.io "ip-10-0-24-212.us-east-2.compute.internal" is invalid: [metadata.ownerReferences.apiVersion: Invalid value: "": version must not be empty, metadata.ownerReferences.kind: Invalid value: "": kind must not be empty] I0301 17:19:12.823035 1 event.go:364] Event(v1.ObjectReference{Kind:"", Namespace:"openshift-machine-config-operator", Name:"machine-config", UID:"c1bad7e7-26ff-47fb-8a2d-a0c03c04d218", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'OperatorDegraded: MachineConfigNodeFailed' Failed to resync 4.16.0-0.ci-2024-03-01-110656 because: MachineConfigNode.machineconfiguration.openshift.io "ip-10-0-49-207.us-east-2.compute.internal" is invalid: [metadata.ownerReferences.apiVersion: Invalid value: "": version must not be empty, metadata.ownerReferences.kind: Invalid value: "": kind must not be empty]
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.ci-2024-03-01-110656 True False 17m Error while reconciling 4.16.0-0.ci-2024-03-01-110656: the cluster operator machine-config is not available
How reproducible:
Always
Steps to Reproduce:
1. Enable techpreview oc patch featuregate cluster --type=merge -p '{"spec":{"featureSet": "TechPreviewNoUpgrade"}}'
Actual results:
machine-config CO is degraded
Expected results:
machine-config CO should not be degraded, no error should happen in MCO operator pod
Additional info:
Description of problem:
For certain operations the CEO will check the etcd member health by creating a client directly and waiting for its status report. Under a situation of any member not being reachable for a longer period, we found the CEO was constantly getting stuck / deadlocked and couldn't move certain controllers forward. In OCPBUGS-12475 we introduced a health-check that would dump stack and automatically restart with the operator deployment health probe. In a more recent upgrade run we could find the culprit [1] to be a missing context during client initialization to etcd, making it stuck infinitely: W0229 02:55:46.820529 1 aliveness_checker.go:33] Controller [EtcdEndpointsController] didn't sync for a long time, declaring unhealthy and dumping stack goroutine 1426 [select]: github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth({0x3272768?, 0xc002090310}, {0xc0000a6880, 0x3, 0xc001c98360?}) github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:64 +0x330 github.com/openshift/cluster-etcd-operator/pkg/etcdcli.(*etcdClientGetter).MemberHealth(0xc000c24540, {0x3272688, 0x4c20080}) github.com/openshift/cluster-etcd-operator/pkg/etcdcli/etcdcli.go:412 +0x18c github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers.CheckSafeToScaleCluster({0x324ccd0?, 0xc000b6d5f0?}, {0x3284250?, 0xc0008dda10?}, {0x324e6c0, 0xc000ed4fb0}, {0x3250560, 0xc000ed4fd0}, {0x32908d0, 0xc000c24540}) github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers/bootstrap.go:149 +0x28e github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers.(*QuorumCheck).IsSafeToUpdateRevision(0x2893020?) github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers/qourum_check.go:37 +0x46 github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller.(*EtcdEndpointsController).syncConfigMap(0xc0002e28c0, {0x32726f8, 0xc0008e60a0}, {0x32801b0, 0xc001198540}) github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go:146 +0x5d8 github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller.(*EtcdEndpointsController).sync(0xc0002e28c0, {0x32726f8, 0xc0008e60a0}, {0x325d240, 0xc003569e90}) github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go:66 +0x71 github.com/openshift/cluster-etcd-operator/pkg/operator/health.(*CheckingSyncWrapper).Sync(0xc000f21bc0, {0x32726f8?, 0xc0008e60a0?}, {0x325d240?, 0xc003569e90?}) github.com/openshift/cluster-etcd-operator/pkg/operator/health/checking_sync_wrapper.go:22 +0x43 github.com/openshift/library-go/pkg/controller/factory.(*baseController).reconcile(0xc00113cd80, {0x32726f8, 0xc0008e60a0}, {0x325d240?, 0xc003569e90?}) github.com/openshift/library-go@v0.0.0-20240124134907-4dfbf6bc7b11/pkg/controller/factory/base_controller.go:201 +0x43 goroutine 11640 [select]: google.golang.org/grpc.(*ClientConn).WaitForStateChange(0xc003707000, {0x3272768, 0xc002091260}, 0x3) google.golang.org/grpc@v1.58.3/clientconn.go:724 +0xb1 google.golang.org/grpc.DialContext({0x3272768, 0xc002091260}, {0xc003753740, 0x3c}, {0xc00355a880, 0x7, 0xc0023aa360?}) google.golang.org/grpc@v1.58.3/clientconn.go:295 +0x128e go.etcd.io/etcd/client/v3.(*Client).dial(0xc000895180, {0x32754a0?, 0xc001785670?}, {0xc0017856b0?, 0x28f6a80?, 0x28?}) go.etcd.io/etcd/client/v3@v3.5.10/client.go:303 +0x407 go.etcd.io/etcd/client/v3.(*Client).dialWithBalancer(0xc000895180, {0x0, 0x0, 0x0}) go.etcd.io/etcd/client/v3@v3.5.10/client.go:281 +0x1a9 go.etcd.io/etcd/client/v3.newClient(0xc002484e70?) go.etcd.io/etcd/client/v3@v3.5.10/client.go:414 +0x91c go.etcd.io/etcd/client/v3.New(...) go.etcd.io/etcd/client/v3@v3.5.10/client.go:81 github.com/openshift/cluster-etcd-operator/pkg/etcdcli.newEtcdClientWithClientOpts({0xc0017853d0, 0x1, 0x1}, 0x0, {0x0, 0x0, 0x0?}) github.com/openshift/cluster-etcd-operator/pkg/etcdcli/etcdcli.go:127 +0x77d github.com/openshift/cluster-etcd-operator/pkg/etcdcli.checkSingleMemberHealth({0x32726f8, 0xc00318ac30}, 0xc002090460) github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:103 +0xc5 github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth.func1() github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:58 +0x6c created by github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth in goroutine 1426 github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:54 +0x2a5 [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1762965898773139456/
Version-Release number of selected component (if applicable):
any currently supported OCP version
How reproducible:
Always
Steps to Reproduce:
1. create a healthy cluster 2. make sure one etcd member never responds, but the node is still there (ie kubelet shutdown, blocking the etcd ports on a firewall) 3. wait for the CEO to restart pod on failing health probe and dump its stack (similar to the one above)
Actual results:
CEO controllers are getting deadlocked, but the operator will restart eventually after some time due to health probes failing
Expected results:
CEO should mark the member as unhealthy and continue its service without getting deadlocked and should not restart its pod by failing the health probe
Additional info:
clientv3.New doesn't take any timeout context, but tries to establish a connection forever https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/etcdcli/etcdcli.go#L127-L130 There's a way to pass the "default context" via the client config, which is slightly misleading.
Update i18n docs with how to enable additional supported languages
Description of problem:
OLM still check the deleted catsrc of openshift-marketplace
Version-Release number of selected component (if applicable):
4.13
How reproducible:
not always
Steps to Reproduce:
In daily CI, we met this issue several times. for example: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.13-amd64-nightly-gcp-ipi-sdn-p1-f7/1632127504539979776/artifacts/gcp-ipi-sdn-p1-f7/openshift-extended-test/build-log.txt prometheus-dependency1-cs has been deleted, but many sub are installed failed due to ErrorPreventedResolution. "message": "failed to populate resolver cache from source prometheus-dependency1-cs/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp: lookup prometheus-dependency1-cs.openshift-marketplace.svc on 172.30.0.10:53: no such host\"", "reason": "ErrorPreventedResolution", "status": "True", "type": "ResolutionFailed" 2023-03-04T22:35:00.761837299Z time="2023-03-04T22:35:00Z" level=info msg="removed client for deleted catalogsource" source="{prometheus-dependency1-cs openshift-marketplace}" 4114 2023-03-04T22:39:38.039489890Z E0304 22:39:38.039410 1 queueinformer_operator.go:298] sync "e2e-test-olm-a-fa98jfef-sxnxr" failed: failed to populate resolver cach e from source prometheus-dependency1-cs/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error wh ile dialing dial tcp: lookup prometheus-dependency1-cs.openshift-marketplace.svc on 172.30.0.10:53: no such host"
Actual results:
The deleted catsrc impacts sub installation.
Expected results:
The deleted catsrc should not impact sub installation.
Additional info:
Please review the following PR: https://github.com/openshift/service-ca-operator/pull/227
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
A long-lived cluster updating into 4.16.0-ec.1 was bitten by the Engineering Candidate's month-or-more-old api-int CA rotation (details on early rotation in API-1687). After manually updating /var/lib/kubelet/kubeconfig to include the new CA (which OCPBUGS-25821 is working on automating), multus pods still complained about untrusted api-int:
$ oc -n openshift-multus logs multus-pz7zp | grep api-int | tail -n5 E0119 19:33:52.983918 3194 reflector.go:148] k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.Pod: failed to list *v1.Pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dbuild0-gstfj-m-2.c.openshift-ci-build-farm.internal&resourceVersion=4723865081": tls: failed to verify certificate: x509: certificate signed by unknown authority 2024-01-19T19:33:55Z [error] Multus: [openshift-machine-api/cluster-autoscaler-default-f8dd547c7-dg9t5/f79ff01a-71c2-4f02-b48b-8c23c9e875ce]: error waiting for pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/namespaces/openshift-machine-api/pods/cluster-autoscaler-default-f8dd547c7-dg9t5?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority 2024-01-19T19:33:55Z [verbose] ADD finished CNI request ContainerID:"b554f8edca8ea7672119c1aa71a69e0368fefeb5f8ae2c2659f822b7fa8d3f62" Netns:"/var/run/netns/36923fe0-e28d-422f-8213-233086527baa" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-machine-api;K8S_POD_NAME=cluster-autoscaler-default-f8dd547c7-dg9t5;K8S_POD_INFRA_CONTAINER_ID=b554f8edca8ea7672119c1aa71a69e0368fefeb5f8ae2c2659f822b7fa8d3f62;K8S_POD_UID=f79ff01a-71c2-4f02-b48b-8c23c9e875ce" Path:"", result: "", err: error configuring pod [openshift-machine-api/cluster-autoscaler-default-f8dd547c7-dg9t5] networking: Multus: [openshift-machine-api/cluster-autoscaler-default-f8dd547c7-dg9t5/f79ff01a-71c2-4f02-b48b-8c23c9e875ce]: error waiting for pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/namespaces/openshift-machine-api/pods/cluster-autoscaler-default-f8dd547c7-dg9t5?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority 2024-01-19T19:34:00Z [error] Multus: [openshift-kube-storage-version-migrator/migrator-558d4d48b9-ggjpj/769153af-350b-492b-9589-ede2574aea85]: error waiting for pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/namespaces/openshift-kube-storage-version-migrator/pods/migrator-558d4d48b9-ggjpj?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority 2024-01-19T19:34:00Z [verbose] ADD finished CNI request ContainerID:"cfd0b8ca596411f1e26ae058fc9f015d6edeac407668420c023ff459860423eb" Netns:"/var/run/netns/bc7fbf17-c049-4241-a7dc-7e27acd3c8af" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-storage-version-migrator;K8S_POD_NAME=migrator-558d4d48b9-ggjpj;K8S_POD_INFRA_CONTAINER_ID=cfd0b8ca596411f1e26ae058fc9f015d6edeac407668420c023ff459860423eb;K8S_POD_UID=769153af-350b-492b-9589-ede2574aea85" Path:"", result: "", err: error configuring pod [openshift-kube-storage-version-migrator/migrator-558d4d48b9-ggjpj] networking: Multus: [openshift-kube-storage-version-migrator/migrator-558d4d48b9-ggjpj/769153af-350b-492b-9589-ede2574aea85]: error waiting for pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/namespaces/openshift-kube-storage-version-migrator/pods/migrator-558d4d48b9-ggjpj?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority
The multus pod needed a delete/replace, and after that it recovered:
$ oc --as system:admin -n openshift-multus delete pod multus-pz7zp pod "multus-pz7zp" deleted $ oc -n openshift-multus get -o wide pods | grep 'NAME\|build0-gstfj-m-2.c.openshift-ci-build-farm.internal' NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES multus-additional-cni-plugins-wrdtt 1/1 Running 1 28h 10.0.0.3 build0-gstfj-m-2.c.openshift-ci-build-farm.internal <none> <none> multus-admission-controller-74d794678b-9s7kl 2/2 Running 0 27h 10.129.0.36 build0-gstfj-m-2.c.openshift-ci-build-farm.internal <none> <none> multus-hxmkz 1/1 Running 0 11s 10.0.0.3 build0-gstfj-m-2.c.openshift-ci-build-farm.internal <none> <none> network-metrics-daemon-dczvs 2/2 Running 2 28h 10.129.0.4 build0-gstfj-m-2.c.openshift-ci-build-farm.internal <none> <none> $ oc -n openshift-multus logs multus-hxmkz | grep -c api-int 0
That need for multus-pod deletion should be automated, to reduce the number of things that need manual touches when the api-int CA rolls.
Seen in 4.16.0-ec.1.
Several multus on this cluster were bit. But others were not, including some on clusters with old kubeconfigs that did not contain the new CA. I'm not clear on what the trigger is, perhaps some clients escape immediate trouble by having exsting api-int connections to servers from back when the servers used the old CA? But deleting the multus pod on a cluster whose /var/lib/kubelet/kubeconfig has not yet been updated will likely reproduce the breakage, at least until OCPBUGS-25821 is fixed.
Not entirely clear, but something like:
Multus still fails to trust api-int until the broken pod is deleted or the container otherwise restarts to notice the updated kubeconfig.
Multus pod automatically pulls in the updated kubeconfig.
One possible implementation would be a liveness probe failing on api-int trust issues, triggering the kubelet to roll the multus container, and the replacement multus container to come up and load the fresh kubeconfig.
Description of problem:
1. wrt changes done in PR - https://github.com/openshift/console/pull/13676 TaskRuns are fetched for Failed and Cancelled PipelineRuns. In order to still improve the performance of PLR list page, use pipelinerun.status.conditions.message for Failed TaskRuns as well and along with that, for any PLR, if string pipelinerun.status.conditions.message having data about Tasks status use that string only instead of fetching TaskRuns ex string : 'Tasks Completed: 2 (Failed: 1, Cancelled 0), Skipped: 1' 2. For Failed PLR, to show the log snippet, make the API call on click of Failed status column in the list page
Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/178
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When running 4.15 installer full function test, detect below three instance families and verified, need to append them in installer doc[1]: - standardHBv4Family - standardMSMediumMemoryv3Family - standardMDSMediumMemoryv3Family [1] https://github.com/openshift/installer/blob/master/docs/user/azure/tested_instance_types_x86_64.md
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Topology links between VMs and non VMs (such as Pod or Deployment) don't show
Version-Release number of selected component (if applicable):
4.12.14
How reproducible:
every time via UI or annoation
Steps to Reproduce:
1. Create VM 2. Create Pod/Deployment 3. Add annoation or link via UI
Actual results:
annotation is updated only
Expected results:
topology shows linkage
Additional info:
app.openshift.io/connects-to: >- [{"apiVersion":"kubevirt.io/v1","kind":"VirtualMachine","name":"es-master00"},{"apiVersion":"kubevirt.io/v1","kind":"VirtualMachine","name":"es-master01"},{"apiVersion":"kubevirt.io/v1","kind":"VirtualMachine","name":"es-master02"}]
we need to update packages_ironic.yml to be closer to current opendev master upper constraints
after the new packages are created we'll have to tag them and update the ironic-image configuration
Description of problem:
Due to RHEL9 incorporating OpenSSL 3.0, HaProxy will refuse to start if provided with a cert using SHA1-based signature algorithm. RHEL9 is being introduced in 4.16. This means customers updating from 4.15 to 4.16 with a SHA1 cert will find their router in a failure state. My Notes from experimenting with various ways of using a cert in ingress: - Routes with SHA1 spec.tls.certificate WILL prevent HaProxy from reloading/starting - It is NOT limited to FIPs, I broke a non-FIPs cluster with this - Routes with SHA1 spec.tls.caCertificate will NOT prevent HaProxy starting, but route is rejected, due to extended route validation failure: - lastTransitionTime: "2024-01-04T20:18:01Z" message: 'spec.tls.certificate: Invalid value: "redacted certificate data": error verifying certificate: x509: certificate signed by unknown authority (possibly because of "x509: cannot verify signature: insecure algorithm SHA1-RSA (temporarily override with GODEBUG=x509sha1=1)" while trying to verify candidate authority certificate "www.exampleca.com")' - Routes with SHA1 spec.tls.destinationCACertificate will NOT prevent HaProxy from starting. It actually seems to work as expected - IngressController with SHA1 spec.defaultCertificate WILL prevent HaProxy from starting. - IngressController with SHA1 spec.clientTLS.clientCA will NOT prevent HaProxy from starting.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100%
Steps to Reproduce:
1. Create a Ingress Controller with spec.defaultCertificate or a Route with spec.tls.certificate as a SHA1 cert 2. Roll out the router
Actual results:
Router fails to start
Expected results:
Router should start
Additional info:
We've previously documented via story in RHEL9 epic: https://issues.redhat.com/browse/NE-1449 The initial fix for this issue was merged as [https://github.com/openshift/router/pull/555]. This issue is currently causing some issues, notably causing the openshift/cluster-ingress-operator repository's {{TestRouteAdmissionPolicy}} E2E test to fail intermittently, which causes the e2e-azure, e2e-gcp-operator, and e2e-aws-operator CI jobs to fail intermittently. Note: In the solution, we only intend to reject **routes** with SHA1 cert on spec.tls.certificate. Ingress Controller with SHA1 cert on spec.defaultCertificate will NOT be rejected.
Description of problem:
[AWS-EBS-CSI-Driver] allocatable volumes count incorrect in csinode for AWS arm instance types "c7gd.2xlarge , m7gd.xlarge"
Version-Release number of selected component (if applicable):
4.15.3
How reproducible:
Always
Steps to Reproduce:
1. Create an Openshift cluster on AWS with intance types "c7gd.2xlarge , m7gd.xlarge" 2. Check the csinode allocatable volumes count 3. Create statefulset with 1 pvc mounted and max allocatable volumes count replicas with nodeAffinity apiVersion: apps/v1 kind: StatefulSet metadata: name: statefulset-vol-limit spec: serviceName: "my-svc" replicas: $VOL_COUNT_LIMIT selector: matchLabels: app: my-svc template: metadata: labels: app: my-svc spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - $NODE_NAME containers: - name: openshifttest image: quay.io/openshifttest/hello-openshift@sha256:56c354e7885051b6bb4263f9faa58b2c292d44790599b7dde0e49e7c466cf339 volumeMounts: - name: data mountPath: /mnt/storage tolerations: - key: "node-role.kubernetes.io/master" effect: "NoSchedule" volumeClaimTemplates: - metadata: name: doc gata spec: accessModes: [ "ReadWriteOnce" ] storageClassName: gp3-csi resources: requests: storage: 1Gi 4. The statefulset all replicas should all become ready.
Actual results:
In step 4, the statefulset 26th replica(pod) stuck at ContainerCreating caused by the volume couldn't be attached to the node(the csinode allocatable volumes count incorrect) $ oc get no/ip-10-0-22-114.ec2.internal -oyaml|grep 'instance' beta.kubernetes.io/instance-type: m7gd.xlarge node.kubernetes.io/instance-type: m7gd.xlarge $ oc get csinode/ip-10-0-22-114.ec2.internal -oyaml apiVersion: storage.k8s.io/v1 kind: CSINode metadata: annotations: storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/aws-ebs,kubernetes.io/azure-disk,kubernetes.io/azure-file,kubernetes.io/cinder,kubernetes.io/gce-pd,kubernetes.io/vsphere-volume creationTimestamp: "2024-03-20T02:16:34Z" name: ip-10-0-22-114.ec2.internal ownerReferences: - apiVersion: v1 kind: Node name: ip-10-0-22-114.ec2.internal uid: acb9a153-bb9b-4c4a-90c1-f3e095173ce2 resourceVersion: "19281" uid: 12507a73-898d-441a-a844-41c7de290b5b spec: drivers: - allocatable: count: 26 name: ebs.csi.aws.com nodeID: i-00ec014c5676a99d2 topologyKeys: - topology.ebs.csi.aws.com/zone $ export VOL_COUNT_LIMIT="26" $ export NODE_NAME="ip-10-0-22-114.ec2.internal" $ envsubst < sts-vol-limit.yaml| oc apply -f - statefulset.apps/statefulset-vol-limit created $ oc get sts NAME READY AGE statefulset-vol-limit 25/26 169m $ oc describe po/statefulset-vol-limit-25 Name: statefulset-vol-limit-25 Namespace: default Priority: 0 Service Account: default Node: ip-10-0-22-114.ec2.internal/10.0.22.114 Start Time: Wed, 20 Mar 2024 18:56:08 +0800 Labels: app=my-svc apps.kubernetes.io/pod-index=25 controller-revision-hash=statefulset-vol-limit-7db55989f7 statefulset.kubernetes.io/pod-name=statefulset-vol-limit-25 Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["10.128.2.53/23"],"mac_address":"0a:58:0a:80:02:35","gateway_ips":["10.128.2.1"],"routes":[{"dest":"10.128.0.0... Status: Pending IP: IPs: <none> Controlled By: StatefulSet/statefulset-vol-limit Containers: openshifttest: Container ID: Image: quay.io/openshifttest/hello-openshift@sha256:56c354e7885051b6bb4263f9faa58b2c292d44790599b7dde0e49e7c466cf339 Image ID: Port: <none> Host Port: <none> State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Environment: <none> Mounts: /mnt/storage from data (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zkwqx (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: data: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: data-statefulset-vol-limit-25 ReadOnly: false kube-api-access-zkwqx: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: BestEffort Node-Selectors: <none> Tolerations: node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 167m default-scheduler Successfully assigned default/statefulset-vol-limit-25 to ip-10-0-22-114.ec2.internal Warning FailedAttachVolume 166m (x2 over 166m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-b43ec1d0-4fa3-4e87-a80b-6ad912160273" : rpc error: code = Internal desc = Could not attach volume "vol-0a7cb8c5859cf3f96" to node "i-00ec014c5676a99d2": context deadline exceeded Warning FailedAttachVolume 30s (x87 over 166m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-b43ec1d0-4fa3-4e87-a80b-6ad912160273" : rpc error: code = Internal desc = Could not attach volume "vol-0a7cb8c5859cf3f96" to node "i-00ec014c5676a99d2": attachment of disk "vol-0a7cb8c5859cf3f96" failed, expected device to be attached but was attaching
Expected results:
In step4 The statefulset all replicas should all become ready.
Additional info:
The AWS arm instance types "c7gd.2xlarge , m7gd.xlarge" all should be "25" not "26"
This is a clone of issue OCPBUGS-37054. The following is the description of the original issue:
—
Description of problem:
The 'Getting started resources' card on the Cluster overview includes a link to 'View all steps in documentation', but this link is not valid for ROSA and OSD so it should be hidden.
Description of problem:
In the 4.14 z-stream rollback job, I'm seeing test-case "[sig-network] pods should successfully create sandboxes by adding pod to network " fail. The job link is here https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-upgrade-rollback-oldest-supported/1719037590788640768 The error is: 56 failures to create the sandbox ns/openshift-monitoring pod/prometheus-k8s-1 node/ip-10-0-48-75.us-east-2.compute.internal - 3314.57 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-k8s-1_openshift-monitoring_95d1a457-3e1b-4ae3-8b57-8023eec5937d_0(5b36bc12b2964e85bcdbe60b275d6a12ea68cb18b81f16622a6cb686270c4eb3): error adding pod openshift-monitoring_prometheus-k8s-1 to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): failed to send CNI request: Post "http://dummy/cni": EOF ns/openshift-monitoring pod/prometheus-k8s-1 node/ip-10-0-48-75.us-east-2.compute.internal - 3321.57 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-k8s-1_openshift-monitoring_95d1a457-3e1b-4ae3-8b57-8023eec5937d_0(3cc0afc5bec362566e4c3bdaf822209377102c2e39aaa8ef5d99b0f4ba795aaf): error adding pod openshift-monitoring_prometheus-k8s-1 to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): failed to send CNI request: Post "http://dummy/cni": dial unix /run/multus/socket/multus.sock: connect: connection refused
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-30-170011
How reproducible:
Flaky
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The rollback test is testing by installing 4.14.0, then upgrade to the latest 4.14.nightly, at some random point, rolling back to 4.14.0
Description of problem:
Oc-mirror create invalid format file of itms-oc-mirror.yaml when work with OCI image, when create itms from the file, hit error : oc create -f itms-oc-mirror.yaml The ImageTagMirrorSet "itms-operator-0" is invalid: spec.imageTagMirrors[0].source: Invalid value: "//app1/noo": spec.imageTagMirrors[0].source in body should match '^\*(?:\.(?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9]))+$|^((?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9])(?:(?:\.(?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9]))+)?(?::[0-9]+)?)(?:(?:/[a-z0-9]+(?:(?:(?:[._]|__|[-]*)[a-z0-9]+)+)?)+)?$'
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202403251146.p0.g03ce0ca.assembly.stream.el9-03ce0ca", GitCommit:"03ce0ca797e73b6762fd3e24100ce043199519e9", GitTreeState:"clean", BuildDate:"2024-03-25T16:34:33Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
always
Steps to Reproduce:
1) Copy the operator as OCI format to localhost: `skopeo copy --all docker://registry.redhat.io/redhat/redhat-operator-index:v4.15 oci:///app1/noo/redhat-operator-index --remove-signatures` 2) Use following imagesetconfigure for mirror: cat config-multi-op.yaml kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 mirror: operators: - catalog: oci:///app1/noo/redhat-operator-index packages: - name: odf-operator `oc-mirror --config config-multi-op.yaml file://outmulitop --v2` 3) Do diskTomirror : `oc-mirror --config config-multi-op.yaml --from file://outmulitop --v2 docker://ec2-3-139-239-15.us-east-2.compute.amazonaws.com:5000/multi` 4) Create cluster resource with file: itms-oc-mirror.yaml `oc create -f itms-oc-mirror.yaml`
Actual results:
4) failed to create ImageTagMirrorSet oc create -f itms-oc-mirror.yaml The ImageTagMirrorSet "itms-operator-0" is invalid: spec.imageTagMirrors[0].source: Invalid value: "//app1/noo": spec.imageTagMirrors[0].source in body should match '^\*(?:\.(?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9]))+$|^((?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9])(?:(?:\.(?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9]))+)?(?::[0-9]+)?)(?:(?:/[a-z0-9]+(?:(?:(?:[._]|__|[-]*)[a-z0-9]+)+)?)+)?$' cat itms-oc-mirror.yaml --- apiVersion: config.openshift.io/v1 kind: ImageTagMirrorSet metadata: creationTimestamp: null name: itms-operator-0 spec: imageTagMirrors: - mirrors: - ec2-3-139-239-15.us-east-2.compute.amazonaws.com:5000/multi source: //app1/noo status: {}
Expected results:
4) succeed to create the cluster resource
This is a clone of issue OCPBUGS-43674. The following is the description of the original issue:
—
Description of problem:
The assisted service is throwing an error message stating that the Cloud Controller Manager (CCM) is not enabled, even though the CCM value is correctly set in the install-config file.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-19-045205
How reproducible:
Always
Steps to Reproduce:
1. Prepare install-config and agent-config for external OCI platform. example of install-config configuration ....... ....... platform: external platformName: oci cloudControllerManager: External ....... ....... 2. Create agent ISO for external OCI platform 3. Boot up nodes using created agent ISO
Actual results:
Oct 21 16:40:47 agent-sno.private.agenttest.oraclevcn.com service[2829]: time="2024-10-21T16:40:47Z" level=info msg="Register cluster: agenttest with id 2666753a-0485-420b-b968-e8732da6898c and params {\"api_vips\":[],\"base_dns_domain\":\"abitest.oci-rhelcert.edge-sro.rhecoeng.com\",\"cluster_networks\":[{\"cidr\":\"10.128.0.0/14\",\"host_prefix\":23}],\"cpu_architecture\":\"x86_64\",\"high_availability_mode\":\"None\",\"ingress_vips\":[],\"machine_networks\":[{\"cidr\":\"10.0.0.0/20\"}],\"name\":\"agenttest\",\"network_type\":\"OVNKubernetes\",\"olm_operators\":null,\"openshift_version\":\"4.18.0-0.nightly-2024-10-19-045205\",\"platform\":{\"external\":{\"cloud_controller_manager\":\"\",\"platform_name\":\"oci\"},\"type\":\"external\"},\"pull_secret\":\"***\",\"schedulable_masters\":false,\"service_networks\":[{\"cidr\":\"172.30.0.0/16\"}],\"ssh_public_key\":\"ssh-rsa XXXXXXXXXXXX\",\"user_managed_networking\":true,\"vip_dhcp_allocation\":false}" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).RegisterClusterInternal" file="/src/internal/bminventory/inventory.go:515" cluster_id=2666753a-0485-420b-b968-e8732da6898c go-id=2110 pkg=Inventory request_id=82e83b31-1c1b-4dea-b435-f7316a1965e
Expected results:
The cluster installation should be successful.
The failure is fairly rare globally but some platforms seem to see it more often. Last night we happened to see it twice in 10 azure runs and aggregation failed on it. It appears to be a longstanding issue however.
The following test catches the problem
[sig-arch] events should not repeat pathologically for ns/openshift-authentication-operator
And the error will show something similar to:
{ 1 events happened too frequently event happened 70 times, something is wrong: namespace/openshift-authentication-operator deployment/authentication-operator hmsg/16eeb8c913 - reason/OpenShiftAPICheckFailed "oauth.openshift.io.v1" failed with an attempt failed with statusCode = 503, err = the server is currently unable to handle the request From: 15:46:39Z To: 15:46:40Z result=reject }
This is quite severe for just 1 second. The intervals database shows occurrences of over 100.
Sippy's test page provides insight into what platforms see the problem more, and can be used to find job runs where this happens, but the runs from yesterday were:
This is a clone of issue OCPBUGS-43329. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-36236. The following is the description of the original issue:
—
Description of problem:
The installer for IBM Cloud currently only checks the first group of subnets (50) when searching for Subnet details by name. It should provide pagination support to search all subnets.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%, dependent on order of subnets returned by IBM Cloud API's however
Steps to Reproduce:
1. Create 50+ IBM Cloud VPC Subnets 2. Use Bring Your Own Network (BYON) configuration (with Subnet names for CP and/or Compute) in install-config.yaml 3. Attempt to create manifests (openshift-install create manifests)
Actual results:
ERROR failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: [platform.ibmcloud.controlPlaneSubnets: Not found: "eu-de-subnet-paginate-1-cp-eu-de-1", platform.ibmcloud.controlPlaneSubnets: Not found: "eu-de-subnet-paginate-1-cp-eu-de-2", platform.ibmcloud.controlPlaneSubnets: Not found: "eu-de-subnet-paginate-1-cp-eu-de-3", platform.ibmcloud.controlPlaneSubnets: Invalid value: []string{"eu-de-subnet-paginate-1-cp-eu-de-1", "eu-de-subnet-paginate-1-cp-eu-de-2", "eu-de-subnet-paginate-1-cp-eu-de-3"}: number of zones (0) covered by controlPlaneSubnets does not match number of provided or default zones (3) for control plane in eu-de, platform.ibmcloud.computeSubnets: Not found: "eu-de-subnet-paginate-1-compute-eu-de-1", platform.ibmcloud.computeSubnets: Not found: "eu-de-subnet-paginate-1-compute-eu-de-2", platform.ibmcloud.computeSubnets: Not found: "eu-de-subnet-paginate-1-compute-eu-de-3", platform.ibmcloud.computeSubnets: Invalid value: []string{"eu-de-subnet-paginate-1-compute-eu-de-1", "eu-de-subnet-paginate-1-compute-eu-de-2", "eu-de-subnet-paginate-1-compute-eu-de-3"}: number of zones (0) covered by computeSubnets does not match number of provided or default zones (3) for compute[0] in eu-de]
Expected results:
Successful manifests and cluster creation
Additional info:
IBM Cloud is working on a fix
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-29687. The following is the description of the original issue:
—
Description of problem:
Security baselines such as CIS do not recommend using secrets as environment variables, but using files. 5.4.1 Prefer using secrets as files over secrets as environmen... | Tenable® https://www.tenable.com/audits/items/CIS_Kubernetes_v1.6.1_Level_2_Master.audit:98de3da69271994afb6211cf86ae4c6b Secrets in Kubernetes must not be stored as environment variables. https://www.stigviewer.com/stig/kubernetes/2021-04-14/finding/V-242415 However, metal3 and metal3-image-customization Pods are using environment variables. $ oc get pod -A -o jsonpath='{range .items[?(@..secretKeyRef)]} {.kind} {.metadata.name} {"\n"}{end}' | grep metal3 Pod metal3-66b59bbb76-8xzl7 Pod metal3-image-customization-965f5c8fc-h8zrk
Version-Release number of selected component (if applicable):
4.14, 4.13, 4.12
How reproducible:
100%
Steps to Reproduce:
1. Install a new cluster using baremetal IPI 2. Run a compliance scan using compliance operator[1], or just look at the manifest of metal3 or metal3-image-customization pod [1] https://docs.openshift.com/container-platform/4.14/security/compliance_operator/co-overview.html
Actual results:
Not compliant to CIS or other security baselines
Expected results:
Compliant to CIS or other security baselines
Additional info:
Currently the konnectivity agent has the following update strategy:
```
updateStrategy:
rollingUpdate:
maxUnavailable: 1
maxSurge: 0
```
We (IBM) suggest to update it to the following:
```
updateStrategy:
rollingUpdate:
maxUnavailable: 10%
type: RollingUpdate
```
In a big cluster, it would speed up the konnectivity-agent update. As the agents are independent, this would not hurt the service.
Description of problem:
The install-config.yaml file lets a user set a server group policy for Control plane nodes, and one for Compute nodes, choosing from affinity, soft-affinity, anti-affinity, soft-anti-affinity. Installer will then create the server group if it doesn't exist. The server group policy defined in install-config for Compute nodes is ignored. The worker server group always has the same policy as the Control plane's.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. openshift-install create install-config 2. set Compute's serverGroupPolicy to soft-affinity in install-config.yaml 3. openshift-install create cluster 4. watch the server groups
Actual results:
both master and worker server groups have the default soft-anti-affinity policy
Expected results:
the worker server group should have soft-affinity as its policy
Additional info:
When a MachineAutoscaler references a currently-zero-Machine MachineSet that includes spec.template.spec.taints, the autoscaler fails to deserialize that MachineSet, which causes it to fail to autoscale that MachineSet. The autoscaler's deserialization logic should be improved to avoid failing on the presence of taints.
Reproduced on 4.14.10 and 4.16.0-ec.1. Expected to be every release going back to at least 4.12, based on code inspection.
Always.
With a launch 4.14.10 gcp Cluster Bot cluster (logs):
$ oc adm upgrade Cluster version is 4.14.10 Upstream: https://api.integration.openshift.com/api/upgrades_info/graph Channel: candidate-4.14 (available channels: candidate-4.14, candidate-4.15) No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available. $ oc -n openshift-machine-api get machinesets.machine.openshift.io NAME DESIRED CURRENT READY AVAILABLE AGE ci-ln-s48f02k-72292-5z2hn-worker-a 1 1 1 1 29m ci-ln-s48f02k-72292-5z2hn-worker-b 1 1 1 1 29m ci-ln-s48f02k-72292-5z2hn-worker-c 1 1 1 1 29m ci-ln-s48f02k-72292-5z2hn-worker-f 0 0 29m
Pick that set with 0 nodes. They don't come with taints by default:
$ oc -n openshift-machine-api get -o json machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f | jq '.spec.template.spec.taints' null
So patch one in:
$ oc -n openshift-machine-api patch machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f --type json -p '[{"op": "add", "path": "/spec/template/spec/taints", "value": [{"effect":"NoSchedule","key":"node-role.kubernetes.io/ci","value":"ci"} ]}]' machineset.machine.openshift.io/ci-ln-s48f02k-72292-5z2hn-worker-f patched
And set up autoscaling:
$ cat cluster-autoscaler.yaml apiVersion: autoscaling.openshift.io/v1 kind: ClusterAutoscaler metadata: name: default spec: maxNodeProvisionTime: 30m scaleDown: enabled: true $ oc apply -f cluster-autoscaler.yaml clusterautoscaler.autoscaling.openshift.io/default created
I'm not all that familiar with autoscaling. Maybe the ClusterAutoscaler doesn't matter, and you need a MachineAutoscaler aimed at the chosen MachineSet?
$ cat machine-autoscaler.yaml apiVersion: autoscaling.openshift.io/v1beta1 kind: MachineAutoscaler metadata: name: test namespace: openshift-machine-api spec: maxReplicas: 2 minReplicas: 1 scaleTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: ci-ln-s48f02k-72292-5z2hn-worker-f $ oc apply -f machine-autoscaler.yaml machineautoscaler.autoscaling.openshift.io/test created
Checking the autoscaler's logs:
$ oc -n openshift-machine-api logs -l k8s-app=cluster-autoscaler --tail -1 | grep taint W0122 19:18:47.246369 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] W0122 19:18:58.474000 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] W0122 19:19:09.703748 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] W0122 19:19:20.929617 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] ...
And the MachineSet is failing to scale:
$ oc -n openshift-machine-api get machinesets.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f NAME DESIRED CURRENT READY AVAILABLE AGE ci-ln-s48f02k-72292-5z2hn-worker-f 0 0 50m
While if I remove the taint:
$ oc -n openshift-machine-api patch machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f --type json -p '[{"op": "remove", "path": "/spec/template/spec/taints"}]' machineset.machine.openshift.io/ci-ln-s48f02k-72292-5z2hn-worker-f patched
The autoscaler... well, it's not scaling up new Machines like I'd expected, but at least it seems to have calmed down about the taint deserialization issue:
$ oc -n openshift-machine-api get machines.machine.openshift.io NAME PHASE TYPE REGION ZONE AGE ci-ln-s48f02k-72292-5z2hn-master-0 Running e2-custom-6-16384 us-central1 us-central1-a 53m ci-ln-s48f02k-72292-5z2hn-master-1 Running e2-custom-6-16384 us-central1 us-central1-b 53m ci-ln-s48f02k-72292-5z2hn-master-2 Running e2-custom-6-16384 us-central1 us-central1-c 53m ci-ln-s48f02k-72292-5z2hn-worker-a-fwskf Running e2-standard-4 us-central1 us-central1-a 45m ci-ln-s48f02k-72292-5z2hn-worker-b-qkwlt Running e2-standard-4 us-central1 us-central1-b 45m ci-ln-s48f02k-72292-5z2hn-worker-c-rlw4m Running e2-standard-4 us-central1 us-central1-c 45m $ oc -n openshift-machine-api get machinesets.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f NAME DESIRED CURRENT READY AVAILABLE AGE ci-ln-s48f02k-72292-5z2hn-worker-f 0 0 53m $ oc -n openshift-machine-api logs -l k8s-app=cluster-autoscaler --tail 50 I0122 19:23:17.284762 1 static_autoscaler.go:552] No unschedulable pods I0122 19:23:17.687036 1 legacy.go:296] No candidates for scale down W0122 19:23:27.924167 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] I0122 19:23:28.510701 1 static_autoscaler.go:552] No unschedulable pods I0122 19:23:28.909507 1 legacy.go:296] No candidates for scale down W0122 19:23:39.148266 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] I0122 19:23:39.737359 1 static_autoscaler.go:552] No unschedulable pods I0122 19:23:40.135580 1 legacy.go:296] No candidates for scale down W0122 19:23:50.376616 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] I0122 19:23:50.963064 1 static_autoscaler.go:552] No unschedulable pods I0122 19:23:51.364313 1 legacy.go:296] No candidates for scale down W0122 19:24:01.601764 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] I0122 19:24:02.191330 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:02.589766 1 legacy.go:296] No candidates for scale down I0122 19:24:13.415183 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:13.815851 1 legacy.go:296] No candidates for scale down I0122 19:24:24.641190 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:25.040894 1 legacy.go:296] No candidates for scale down I0122 19:24:35.867194 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:36.266400 1 legacy.go:296] No candidates for scale down I0122 19:24:47.097656 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:47.498099 1 legacy.go:296] No candidates for scale down I0122 19:24:58.326025 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:58.726034 1 legacy.go:296] No candidates for scale down I0122 19:25:04.927980 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache I0122 19:25:04.938213 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 10.036399ms I0122 19:25:09.552086 1 static_autoscaler.go:552] No unschedulable pods I0122 19:25:09.952094 1 legacy.go:296] No candidates for scale down I0122 19:25:20.778317 1 static_autoscaler.go:552] No unschedulable pods I0122 19:25:21.178062 1 legacy.go:296] No candidates for scale down I0122 19:25:32.005246 1 static_autoscaler.go:552] No unschedulable pods I0122 19:25:32.404966 1 legacy.go:296] No candidates for scale down I0122 19:25:43.233637 1 static_autoscaler.go:552] No unschedulable pods I0122 19:25:43.633889 1 legacy.go:296] No candidates for scale down I0122 19:25:54.462009 1 static_autoscaler.go:552] No unschedulable pods I0122 19:25:54.861513 1 legacy.go:296] No candidates for scale down I0122 19:26:05.688410 1 static_autoscaler.go:552] No unschedulable pods I0122 19:26:06.088972 1 legacy.go:296] No candidates for scale down I0122 19:26:16.915156 1 static_autoscaler.go:552] No unschedulable pods I0122 19:26:17.315987 1 legacy.go:296] No candidates for scale down I0122 19:26:28.143877 1 static_autoscaler.go:552] No unschedulable pods I0122 19:26:28.543998 1 legacy.go:296] No candidates for scale down I0122 19:26:39.369085 1 static_autoscaler.go:552] No unschedulable pods I0122 19:26:39.770386 1 legacy.go:296] No candidates for scale down I0122 19:26:50.596923 1 static_autoscaler.go:552] No unschedulable pods I0122 19:26:50.997262 1 legacy.go:296] No candidates for scale down I0122 19:27:01.823577 1 static_autoscaler.go:552] No unschedulable pods I0122 19:27:02.223290 1 legacy.go:296] No candidates for scale down I0122 19:27:04.938943 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache I0122 19:27:04.947353 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 8.319938ms
Scale-from-zero MachineAutoscaler fails on taint-deserialization when the referenced MachineSet contains spec.template.spec.taints.
Scale-from-zero MachineAutoscaler works, even when the referenced MachineSet contains spec.template.spec.taints.
Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/247
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Observer - Alerting, Metrics, and Targets page does not load as expected, blank page would be shown
4.15.0-0.nightly-2023-12-07-041003
Always
1.Navigate to Observer -> Alerting, Metrics, and Targets page directly 2. 3.
Blank page, no data be loaded
Work as normal
Failed to load resource: the server responded with a status of 404 (Not Found) /api/accounts_mgmt/v1/subscriptions?page=1&search=external_cluster_id%3D%2715ace915-53d3-4455-b7e3-b7a5a4796b5c%27:1 Failed to load resource: the server responded with a status of 403 (Forbidden) main-chunk-bb9ed989a7f7c65da39a.min.js:1 API call to get support level has failed r: Access denied due to cluster policy. at https://console-openshift-console.apps.ci-ln-9fl1l5t-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-bb9ed989a7f7c65da39a.min.js:1:95279 (anonymous) @ main-chunk-bb9ed989a7f7c65da39a.min.js:1 /api/kubernetes/apis/operators.coreos.com/v1alpha1/namespaces/#ALL_NS#/clusterserviceversions?:1 Failed to load resource: the server responded with a status of 404 (Not Found) vendor-patternfly-5~main-chunk-95cb256d9fa7738d2c46.min.js:1 Modal: When using hasNoBodyWrapper or setting a custom header, ensure you assign an accessible name to the the modal container with aria-label or aria-labelledby.
Description of problem:
oauthclients degraded condition that never gets removed, meaning once its set due to an issue on a cluster, it wont be unset
Version-Release number of selected component (if applicable):
How reproducible:
Sporadically, when the AuthStatusHandlerFailedApply condition is set on the console operator status conditions.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
time="2024-05-10T10:06:43-04:00" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed provisioning resources after infrastructure ready: failed to create route53 records: failed to create records for api: InvalidChangeBatch: [\"\" is not a valid hosted zone id. is not a valid encrypted
Version-Release number of selected component (if applicable):
How reproducible:
4.16.0-0.nightly-2024-05-05-102537
Steps to Reproduce:
1. Install a C2S or an SC2S cluster via Cluster API
Actual results:
See description
Expected results:
Additional info:
Cluster could be created successfully on C2S/SC2S
Description of problem:
CI is permafailing all the way down to 4.12 due to some breaking changes being side loaded to old version due to a :latest tag for a fixture image Longer version - we faced a few different issues: - we made a change to opm where it started to validate package names differently. This broke some of our tests because they had invalid package names. - opm switched to a different cache backend, which lead to the operatorhubio image being updated with the new cache backend, but that same image broke CI for older versions whose opm did not support the new backend
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-33671. The following is the description of the original issue:
—
Description of problem:
If one attempts to create more than one MachineOSConfig at the same time that requires a canonicalized secret, only one will build. The rest will not build.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Create multiple MachineConfigPools. Wait for the MachineConfigPool to get a rendered config. 2. Create multiple MachineOSConfigs at the same time for each of the newly-created MachineConfigPools that uses a legacy Docker pull secret. A legacy Docker pull secret is one which does not have each of its secrets under a top-level auths key. One can use the builder-dockercfg secret in the MCO namespace for this purpose. 3. Wait for the machine-os-builder pod to start.
Actual results:
Only one of the MachineOSBuilds begins building. The remaining MachineOSBuilds do not build nor do they get a status assigned to them. The root cause is because if they both attempt to use the same legacy Docker pull secret, one will create the canonicalized version of it. Subsequent requests that occur concurrently will fail because the canonicalized secret already exists.
Expected results:
Each MachineOSBuild should occur whenever it is created. It should also have some kind of status assigned to it as well.
Additional info:
To accommodate upgrades of 4.12 to 4.13 on a fips cluster, a rhel8 binary needed to be included in the rhel9 based 4.13 ovn-kubernetes container image. See https://issues.redhat.com/browse/OCPBUGS-15962 for details.
This workaround is not needed on 4.14+ clusters, as minor upgrades from 4.12 will always land on 4.13.
A fix in the ovn-kubernetes repo needs to be accompanied by a config change in ocp-build-data, please coordinate with ART.
Please review the following PR: https://github.com/openshift/cluster-api/pull/191
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-42081. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-34647. The following is the description of the original issue:
—
Description of problem:
When we enable OCB functionality and we create a MC that configures an eforcing=0 kernel argumnent the MCP is degraded reporting this message { "lastTransitionTime": "2024-05-30T09:37:06Z", "message": "Node ip-10-0-29-166.us-east-2.compute.internal is reporting: \"unexpected on-disk state validating against quay.io/mcoqe/layering@sha256:654149c7e25a1ada80acb8eedc3ecf9966a8d29e9738b39fcbedad44ddd15ed5: missing expected kernel arguments: [enforcing=0]\"", "reason": "1 nodes are reporting degraded status on sync", "status": "True", "type": "NodeDegraded" },
Version-Release number of selected component (if applicable):
IPI on AWS $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-05-30-021120 True False 97m Error while reconciling 4.16.0-0.nightly-2024-05-30-021120: the cluster operator olm is not available
How reproducible:
Alwasy
Steps to Reproduce:
1. Enable techpreview $ oc patch featuregate cluster --type=merge -p '{"spec":{"featureSet": "TechPreviewNoUpgrade"}}' 2. Configure a MSOC resource to enable OCB functionality in the worker pool When we hit this problem we were using the mcoqe quay repository. A copy of the pull-secret for baseImagePullSecret and renderedImagePushSecret and no currentImagePullSecret configured. apiVersion: machineconfiguration.openshift.io/v1alpha1 kind: MachineOSConfig metadata: name: worker spec: machineConfigPool: name: worker # buildOutputs: # currentImagePullSecret: # name: "" buildInputs: imageBuilder: imageBuilderType: PodImageBuilder baseImagePullSecret: name: pull-copy renderedImagePushSecret: name: pull-copy renderedImagePushspec: "quay.io/mcoqe/layering:latest" 3. Create a MC to use enforing=0 kernel argument { "kind": "List", "apiVersion": "v1", "metadata": {}, "items": [ { "apiVersion": "machineconfiguration.openshift.io/v1", "kind": "MachineConfig", "metadata": { "labels": { "machineconfiguration.openshift.io/role": "worker" }, "name": "change-worker-kernel-selinux-gvr393x2" }, "spec": { "config": { "ignition": { "version": "3.2.0" } }, "kernelArguments": [ "enforcing=0" ] } } ] }
Actual results:
The worker MCP is degraded reporting this message: oc get mcp worker -oyaml .... { "lastTransitionTime": "2024-05-30T09:37:06Z", "message": "Node ip-10-0-29-166.us-east-2.compute.internal is reporting: \"unexpected on-disk state validating against quay.io/mcoqe/layering@sha256:654149c7e25a1ada80acb8eedc3ecf9966a8d29e9738b39fcbedad44ddd15ed5: missing expected kernel arguments: [enforcing=0]\"", "reason": "1 nodes are reporting degraded status on sync", "status": "True", "type": "NodeDegraded" },
Expected results:
The MC should be applied without problems and selinux should be using enforcing=0
Additional info:
Description of problem:
The konnectivity-agent on the data plane needs to resolve its proxy-server-url to connect the control plane's konnectivity server. Also, the these agents are using the default dnsPolicy which is ClusterFirst. This creates a dependency with CoreDNS. If CoreDNS is misconfigured or down, agents won't able to connect to the server, and all konnectivity related traffic goes down (blocks updates, webhooks, logs, etc). The correction would to use the dnsPolicy: Default in the konnectivity-agent daemonset on the data plane, so it would use the name resolution configuration from the node. This makes sure that the konnectivity-agent's proxy-server-url can be resolved even if coreDNS is down or mis-configured The konnectivity-agent control plane deployment shall not change as it still needs to use coreDNS as in that case a ClusterIP Service is configured as proxy-server-url.
Version-Release number of selected component (if applicable):
4.14, 4.15
How reproducible:
Break coreDNS configuration
Steps to Reproduce:
1. Put an invalid forwarder to the dns.operator/default to fail upstream DNS resolving 2. Rollout restart the konnectivity-agent daemonset in kube-system
Actual results:
kubectl log is failing
Expected results:
kubectl log is working
Additional info:
This is a clone of issue OCPBUGS-30860. The following is the description of the original issue:
—
Description of problem:
Installation failed on 4.16 nightly build when waiting for install-complete. API is unavailable. level=info msg=Waiting up to 20m0s (until 5:00AM UTC) for the Kubernetes API at https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443... level=info msg=API v1.29.2+a0beecc up level=info msg=Waiting up to 30m0s (until 5:11AM UTC) for bootstrapping to complete... api available waiting for bootstrap to complete level=info msg=Waiting up to 20m0s (until 5:01AM UTC) for the Kubernetes API at https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443... level=info msg=API v1.29.2+a0beecc up level=info msg=Waiting up to 30m0s (until 5:11AM UTC) for bootstrapping to complete... level=info msg=It is now safe to remove the bootstrap resources level=info msg=Time elapsed: 15m54s Copying kubeconfig to shared dir as kubeconfig-minimal level=info msg=Destroying the bootstrap resources... level=info msg=Waiting up to 40m0s (until 5:39AM UTC) for the cluster at https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443 to initialize... W0313 04:59:34.272442 229 reflector.go:539] k8s.io/client-go/tools/watch/informerwatcher.go:146: failed to list *v1.ClusterVersion: Get "https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": dial tcp 172.212.184.131:6443: i/o timeout I0313 04:59:34.272658 229 trace.go:236] Trace[533197684]: "Reflector ListAndWatch" name:k8s.io/client-go/tools/watch/informerwatcher.go:146 (13-Mar-2024 04:59:04.271) (total time: 30000ms): Trace[533197684]: ---"Objects listed" error:Get "https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": dial tcp 172.212.184.131:6443: i/o timeout 30000ms (04:59:34.272) ... E0313 05:38:18.669780 229 reflector.go:147] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": dial tcp 172.212.184.131:6443: i/o timeout level=error msg=Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 172.212.184.131:6443: i/o timeout level=error msg=Cluster initialization failed because one or more operators are not functioning properly. level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below, level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation level=error msg=failed to initialize the cluster: timed out waiting for the condition On master node, seems that kube-apiserver is not running, [root@ci-op-4sgxj8jx-8482f-hppxj-master-0 ~]# crictl ps | grep apiserver e4b6cc9622b01 ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90 7 minutes ago Running kube-apiserver-cert-syncer 22 3ff4af6614409 kube-apiserver-ci-op-4sgxj8jx-8482f-hppxj-master-0 1249824fe5788 ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90 4 hours ago Running kube-apiserver-insecure-readyz 0 3ff4af6614409 kube-apiserver-ci-op-4sgxj8jx-8482f-hppxj-master-0 ca774b07284f0 ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90 4 hours ago Running kube-apiserver-cert-regeneration-controller 0 3ff4af6614409 kube-apiserver-ci-op-4sgxj8jx-8482f-hppxj-master-0 2931b9a2bbabd ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90 4 hours ago Running openshift-apiserver-check-endpoints 0 4136bf2183de1 apiserver-7df5bb879-xx74p 0c9534aec3b6b 8c9042f97c89d8c8519d6e6235bef5a5346f08e6d7d9864ef0f228b318b4c3de 4 hours ago Running openshift-apiserver 0 4136bf2183de1 apiserver-7df5bb879-xx74p db21a2dd1df33 ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90 4 hours ago Running guard 0 199e1f4e665b9 kube-apiserver-guard-ci-op-4sgxj8jx-8482f-hppxj-master-0 429110f9ea5a3 6a03f3f082f3719e79087d569b3cd1e718fb670d1261fbec9504662f1005b1a5 4 hours ago Running apiserver-watcher 0 7664f480df29d apiserver-watcher-ci-op-4sgxj8jx-8482f-hppxj-master-0 [root@ci-op-4sgxj8jx-8482f-hppxj-master-1 ~]# crictl ps | grep apiserver c64187e7adcc6 ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90 4 hours ago Running openshift-apiserver-check-endpoints 0 1a4a5b247c28a apiserver-7df5bb879-f6v5x ff98c52402288 8c9042f97c89d8c8519d6e6235bef5a5346f08e6d7d9864ef0f228b318b4c3de 4 hours ago Running openshift-apiserver 0 1a4a5b247c28a apiserver-7df5bb879-f6v5x 2f8a97f959409 faa1b95089d101cdc907d7affe310bbff5a9aa8f92c725dc6466afc37e731927 4 hours ago Running oauth-apiserver 0 ffa2c316a0cca apiserver-97fbc599c-2ftl7 72897e30e0df0 6a03f3f082f3719e79087d569b3cd1e718fb670d1261fbec9504662f1005b1a5 4 hours ago Running apiserver-watcher 0 3b6c3849ce91f apiserver-watcher-ci-op-4sgxj8jx-8482f-hppxj-master-1 [root@ci-op-4sgxj8jx-8482f-hppxj-master-2 ~]# crictl ps | grep apiserver 04c426f07573d faa1b95089d101cdc907d7affe310bbff5a9aa8f92c725dc6466afc37e731927 4 hours ago Running oauth-apiserver 0 2172a64fb1a38 apiserver-654dcb4cc6-tq8fj 4dcca5c0e9b99 6a03f3f082f3719e79087d569b3cd1e718fb670d1261fbec9504662f1005b1a5 4 hours ago Running apiserver-watcher 0 1cd99ec327199 apiserver-watcher-ci-op-4sgxj8jx-8482f-hppxj-master-2 And found below error in kubelet log, Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: E0313 06:10:15.004656 23961 kuberuntime_manager.go:1262] container &Container{Name:kube-apiserver,Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:789f242b8bc721b697e265c6f9d025f45e56e990bfd32e331c633fe0b9f076bc,Command:[/bin/bash -ec],Args:[LOCK=/var/log/kube-apiserver/.lock Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: # We should be able to acquire the lock immediatelly. If not, it means the init container has not released it yet and kubelet or CRI-O started container prematurely. Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: exec {LOCK_FD}>${LOCK} && flock --verbose -w 30 "${LOCK_FD}" || { Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: echo "Failed to acquire lock for kube-apiserver. Please check setup container for details. This is likely kubelet or CRI-O bug." Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: exit 1 Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: } Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: if [ -f /etc/kubernetes/static-pod-certs/configmaps/trusted-ca-bundle/ca-bundle.crt ]; then Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: echo "Copying system trust bundle ..." Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: cp -f /etc/kubernetes/static-pod-certs/configmaps/trusted-ca-bundle/ca-bundle.crt /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: fi Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: exec watch-termination --termination-touch-file=/var/log/kube-apiserver/.terminating --termination-log-file=/var/log/kube-apiserver/termination.log --graceful-termination-duration=135s --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig -- hyperkube kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --advertise-address=${HOST_IP} -v=2 --permit-address-sharing Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: ],WorkingDir:,Ports:[]ContainerPort{ContainerPort{Name:,HostPort:6443,ContainerPort:6443,Protocol:TCP,HostIP:,},},Env:[]EnvVar{EnvVar{Name:POD_NAME,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.name,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:POD_NAMESPACE,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:STATIC_POD_VERSION,Value:4,ValueFrom:nil,},EnvVar{Name:HOST_IP,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.hostIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:GOGC,Value:100,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{cpu: {{265 -3} {<nil>} 265m DecimalSI},memory: {{1073741824 0} {<nil>} 1Gi BinarySI},},Claims:[]ResourceClaim{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:resource-dir,ReadOnly:false,MountPath:/etc/kubernetes/static-pod-resources,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:cert-dir,ReadOnly:false,MountPath:/etc/kubernetes/static-pod-certs,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:audit-dir,ReadOnly:false,MountPath:/var/log/kube-apiserver,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:livez,Port:{0 6443 },Host:,Scheme:HTTPS,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:10,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,TerminationGracePeriodSeconds:nil,},ReadinessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:readyz,Port:{0 6443 },Host:,Scheme:HTTPS,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:10,PeriodSeconds:5,SuccessThreshold:1,FailureThreshold:1,TerminationGracePeriodSeconds:nil,},Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:nil,Privileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:FallbackToLogsOnError,VolumeDevices:[]VolumeDevice{},StartupProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:healthz,Port:{0 6443 },Host:,Scheme:HTTPS,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:10,PeriodSeconds:5,SuccessThreshold:1,FailureThreshold:30,TerminationGracePeriodSeconds:nil,},ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod kube-apiserver-ci-op-4sgxj8jx-8482f-hppxj-master-0_openshift-kube-apiserver(196e0956694ff43707b03f4585f3b6cd): CreateContainerConfigError: host IP unknown; known addresses: []
Version-Release number of selected component (if applicable):
4.16 latest nightly build
How reproducible:
frequently
Steps to Reproduce:
1. Install cluster on 4.16 nightly build 2. 3.
Actual results:
Installation failed.
Expected results:
Installation is successful.
Additional info:
Searched CI jobs, found many jobs failed with same error, most are on azure platform. https://search.dptools.openshift.org/?search=failed+to+initialize+the+cluster%3A+timed+out+waiting+for+the+condition&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
[sig-api-machinery] ValidatingAdmissionPolicy [Privileged:ClusterAdmin] [FeatureGate:ValidatingAdmissionPolicy] [Beta] should type check a CRD [Suite:openshift/conformance/parallel] [Suite:k8s]
This test appears to fail a little too often. It seems to only run on techpreview clusters (presumably the Beta tag in the name), but I was worried it's an indication something isn't ready to graduate from techpreview, so figured this is worth a bug.
Even so 93% pass rate is a little too low, would like someone to investigate and get this test rate up. When it fails it's typically the only thing killing the job run. Output is always:
{ fail [k8s.io/kubernetes@v1.29.0/test/e2e/apimachinery/validatingadmissionpolicy.go:349]: Expected <[]v1beta1.ExpressionWarning | len:0, cap:0>: nil to have length 2 Ginkgo exit error 1: exit with code 1}
View this link for sample job runs, I would focus on those with 2 failures indicating this was the only failing test in the job.
This is a clone of issue OCPBUGS-43746. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-38132. The following is the description of the original issue:
—
The CPO reconciliation aborts when the OIDC/LDAP IDP validation check fails and this result in failure to reconcile for any components that are reconciled after that point in the code.
This failure should not be fatal to the CPO reconcile and should likely be reported as a condition on the HC.
xref
Customer incident
https://issues.redhat.com/browse/OCPBUGS-38071
RFE for bypassing the check
https://issues.redhat.com/browse/RFE-5638
PR to proxy the IDP check through the data plane network
https://github.com/openshift/hypershift/pull/4273
Included following regions
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver/pull/54
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
E1213 07:18:34.291004 1 run.go:74] "command failed" err="error while building transformers: KMSv1 is deprecated and will only receive security updates going forward. Use KMSv2 instead. Set --feature-gates=KMSv1=true to use the deprecated KMSv1 feature."
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
After setting an invalid release image on a HostedCluster, it is not possible to fix it by editing the HostedCluster and setting a valid release image.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create a HostedCluster with an invalid release image 2. Edit HostedCluster and specify a valid release image 3.
Actual results:
HostedCluster does not start using the new valid release image
Expected results:
HostedCluster starts using the valid release image.
Additional info:
Description of problem:
While researching OCPBUGS-30860, I came across Read only caches on master Azure nodes.
Version-Release number of selected component (if applicable):
4.10 - present
How reproducible:
Always
Steps to Reproduce:
1.Spin up a cluster 2.Observe in the Azure cloud dashboard master nodes only using Read cache. 3.
Actual results:
Expected results:
Master nodes should be using the ReadWrite cache.
Additional info:
Description of problem:
4.17 introduces new auto node sizing values. To preserve backwards compatibility we need to backport a version file. Related: https://issues.redhat.com//browse/OCPNODE-2226
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.20 True False 43h Cluster version is 4.11.20 $ oc get clusterrolebinding system:openshift:kube-controller-manager:gce-cloud-provider -o yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: creationTimestamp: "2023-01-11T13:16:47Z" name: system:openshift:kube-controller-manager:gce-cloud-provider resourceVersion: "6079" uid: 82a81635-4535-4a51-ab83-d2a1a5b9a473 roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:openshift:kube-controller-manager:gce-cloud-provider subjects: - kind: ServiceAccount name: cloud-provider namespace: kube-system $ oc get sa cloud-provider -n kube-system Error from server (NotFound): serviceaccounts "cloud-provider" not found The serviceAccount cloud-provider does not exist. Neither in kube-system nor in any other namespace. It's therefore not clear what this ClusterRoleBinding does, what use-case it does fulfill and why it references non existing serviceAccount. From Security point of view, it's recommended to remove non serviceAccounts from ClusterRoleBindings as a potential attacker could abuse the current state by creating the necessary serviceAccount and gain undesired permissions.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4 (all version from what we have found)
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift Container Platform 4 2. Run oc get clusterrolebinding system:openshift:kube-controller-manager:gce-cloud-provider -o yaml
Actual results:
$ oc get clusterrolebinding system:openshift:kube-controller-manager:gce-cloud-provider -o yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: creationTimestamp: "2023-01-11T13:16:47Z" name: system:openshift:kube-controller-manager:gce-cloud-provider resourceVersion: "6079" uid: 82a81635-4535-4a51-ab83-d2a1a5b9a473 roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:openshift:kube-controller-manager:gce-cloud-provider subjects: - kind: ServiceAccount name: cloud-provider namespace: kube-system $ oc get sa cloud-provider -n kube-system Error from server (NotFound): serviceaccounts "cloud-provider" not found
Expected results:
The serviceAccount called cloud-provider to exist or otherwise the ClusterRoleBinding to be removed.
Additional info:
Finding related to a Security review done on the OpenShift Container Platform 4 - Platform
To support external OIDC on hypershift, but not on self-managed, we need different schemas for the authentication CRD on a default-hypershift versus a default-self-managed. This requires us to change rendering so that it honors the clusterprofile.
Then we have to update the installer to match, then update hypershift, then update the manifests.
Please review the following PR: https://github.com/openshift/machine-config-operator/pull/4152
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
After fixing https://issues.redhat.com/browse/OCPBUGS-29919 by merging https://github.com/openshift/baremetal-runtimecfg/pull/301 we have lost ability to properly debug the logic of selection Node IP used in runtimecfg.
In order to preserve debugability of this component, it should be possible to selectively enable verbose logs.
Please review the following PR: https://github.com/openshift/csi-external-provisioner/pull/90
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/826
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/multus-cni/pull/202
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver/pull/56
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Even though fakefish is not a supported redfish interface, it is very useful to have it working for "special" scenarios, like NC-SI, while its support is implemented. On OCP 4.14 and later, converged flow is enabled by default, and on this configuration Ironic sends a soft power_off command to the ironic agent running on the ramdisk. Since this power operation is not going through the redfish interface, it is not processed by fakefish, preventing it from working on some NC-SI configurations, where a full power-off would mean the BMC loses power. Ironic already supports using out-of-band power off for the agent [1], so having an option to use it would be very helpful. [1]- https://opendev.org/openstack/ironic/commit/824ad1676bd8032fb4a4eb8ffc7625a376a64371
Version-Release number of selected component (if applicable):
Seen with OCP 4.14.26 and 4.14.33, expected to happen on later versions
How reproducible:
Always
Steps to Reproduce:
1. Deploy SNO node using ACM and fakefish as redfish interface 2. Check metal3-ironic pod logs
Actual results:
We can see a soft power_off command sent to the ironic agent running on the ramdisk: 2024-08-07 15:00:45.545 1 DEBUG ironic.drivers.modules.agent_client [None req-74c0c3ed-011f-4718-bdce-53f2ba412e85 - - - - - -] Executing agent command standby.power_off for node df006e90-02ee-4847-b532-be4838e844e6 with params {'wait': 'false', 'agent_token': '***'} _command /usr/lib/python3.9/site-packages/ironic/drivers/modules/agent_client.py:197 2024-08-07 15:00:45.551 1 DEBUG ironic.drivers.modules.agent_client [None req-74c0c3ed-011f-4718-bdce-53f2ba412e85 - - - - - -] Agent command standby.power_off for node df006e90-02ee-4847-b532-be4838e844e6 returned result None, error None, HTTP status code 200 _command /usr/lib/python3.9/site-packages/ironic/drivers/modules/agent_client.py:234
Expected results:
There is an option to prevent this soft power_off command, so all power actions happen via redfish. This would allow fakefish to capture them and behave as needed.
Additional info:
This is a clone of issue OCPBUGS-38846. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-37850. The following is the description of the original issue:
—
Occasional machine-config daemon panics in test-preview. For example this run has:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1076/pull-ci-openshift-cluster-version-operator-master-e2e-aws-ovn-techpreview/1819082707058036736
And the referenced logs include a full stack trace, the crux of which appears to be:
E0801 19:23:55.012345 2908 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 127 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x2424b80, 0x4166150}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0004d5340?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b panic({0x2424b80?, 0x4166150?}) /usr/lib/golang/src/runtime/panic.go:770 +0x132 github.com/openshift/machine-config-operator/pkg/helpers.ListPools(0xc0007c5208, {0x0, 0x0}) /go/src/github.com/openshift/machine-config-operator/pkg/helpers/helpers.go:142 +0x17d github.com/openshift/machine-config-operator/pkg/helpers.GetPoolsForNode({0x0, 0x0}, 0xc0007c5208) /go/src/github.com/openshift/machine-config-operator/pkg/helpers/helpers.go:66 +0x65 github.com/openshift/machine-config-operator/pkg/daemon.(*PinnedImageSetManager).handleNodeEvent(0xc000a98480, {0x27e9e60?, 0xc0007c5208}) /go/src/github.com/openshift/machine-config-operator/pkg/daemon/pinned_image_set.go:955 +0x92
$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?name=^periodic&type=junit&search=machine-config-daemon.*Observed+a+panic' | grep 'failures match' periodic-ci-openshift-release-master-ci-4.17-e2e-azure-ovn-techpreview (all) - 37 runs, 62% failed, 13% of failures match = 8% impact periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-techpreview-serial (all) - 6 runs, 83% failed, 20% of failures match = 17% impact periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-techpreview (all) - 5 runs, 60% failed, 33% of failures match = 20% impact periodic-ci-openshift-multiarch-master-nightly-4.17-ocp-e2e-aws-ovn-arm64-techpreview-serial (all) - 10 runs, 40% failed, 25% of failures match = 10% impact periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-techpreview-serial (all) - 7 runs, 29% failed, 50% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-techpreview-serial (all) - 7 runs, 100% failed, 14% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-techpreview-serial (all) - 5 runs, 100% failed, 20% of failures match = 20% impact periodic-ci-openshift-multiarch-master-nightly-4.17-ocp-e2e-aws-ovn-arm64-techpreview (all) - 10 runs, 40% failed, 25% of failures match = 10% impact periodic-ci-openshift-release-master-ci-4.18-e2e-gcp-ovn-techpreview (all) - 5 runs, 40% failed, 50% of failures match = 20% impact periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-techpreview-serial (all) - 6 runs, 17% failed, 200% of failures match = 33% impact periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node-techpreview-serial (all) - 7 runs, 100% failed, 14% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node-techpreview (all) - 7 runs, 57% failed, 50% of failures match = 29% impact periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-techpreview (all) - 18 runs, 17% failed, 33% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-aws-ovn-arm64-techpreview-serial (all) - 11 runs, 18% failed, 50% of failures match = 9% impact periodic-ci-openshift-release-master-ci-4.17-e2e-azure-ovn-techpreview-serial (all) - 7 runs, 57% failed, 25% of failures match = 14% impact
looks like ~15% impact in those CI runs CI Search turns up.
Run lots of CI. Look for MCD panics.
CI Search results above.
No hits.
Description of problem:
In local setup this error appears when creating a deployment with scaling in the git form page locally: `Deployment in version "v1" cannot be handled as a Deployment: json: cannot unmarshal string into Go struct field DeploymentSpec.spec.replicas of type int32`
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-01-05-154400
How reproducible:
Everytime
Steps to Reproduce:
1. In the local setup go to the git form page 2. Enter a git repo and select deployment as the resource type 3. In scaling enter the value as '5' and click on Create button
Actual results:
Got this error: "Deployment in version "v1" cannot be handled as a Deployment: json: cannot unmarshal string into Go struct field DeploymentSpec.spec.replicas of type int32"
Expected results:
Deployment should be created
Additional info:
Happening with Deployment-config creation as well
Description of problem:
When running an Azure install, the installer noticeably hangs for a long time when running create manifests or create cluster. It will sit unresponsive for almost 2 minutes at: DEBUG OpenShift Installer unreleased-master-9741-gbc9836aa9bd3a4f10d229bb6f87981dddf2adc92 DEBUG Built from commit bc9836aa9bd3a4f10d229bb6f87981dddf2adc92 DEBUG Fetching Metadata... DEBUG Loading Metadata... DEBUG Loading Cluster ID... DEBUG Loading Install Config... DEBUG Loading SSH Key... DEBUG Loading Base Domain... DEBUG Loading Platform... DEBUG Loading Cluster Name... DEBUG Loading Base Domain... DEBUG Loading Platform... DEBUG Loading Pull Secret... DEBUG Loading Platform... INFO Credentials loaded from file "/root/.azure/osServicePrincipal.json" This could also be related to failures we see in CI such as this: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/8123/pull-ci-openshift-installer-master-e2e-azure-ovn/1773611162923962368 level=info msg=Consuming Worker Machines from target directory level=info msg=Credentials loaded from file "/var/run/secrets/ci.openshift.io/cluster-profile/osServicePrincipal.json" level=fatal msg=failed to fetch Terraform Variables: failed to generate asset "Terraform Variables": error connecting to Azure client: failed to list SKUs: compute.ResourceSkusClient#List: Failure responding to request: StatusCode=200 -- Original Error: Error occurred reading http.Response#Body - Error = 'read tcp 10.128.117.2:43870->4.150.240.10:443: read: connection reset by peer' If the call takes too long and the context timeout is canceled, we might potentially see this error.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Run azure install 2. 3.
Actual results:
Expected results:
Additional info:
https://github.com/openshift/installer/pull/8134 has a partial fix
Description of problem:
4.16.0-0.nightly-2024-04-07-182401, Prometheus Operator 0.73.0, too many warnings for "'bearerTokenFile' is deprecated, use 'authorization' instead.", see below
$ oc -n openshift-monitoring logs -c prometheus-operator deploy/prometheus-operator level=info ts=2024-04-08T07:06:17.191301889Z caller=main.go:186 msg="Starting Prometheus Operator" version="(version=0.73.0, branch=rhaos-4.16-rhel-9, revision=3541f90)" level=info ts=2024-04-08T07:06:17.195797026Z caller=main.go:187 build_context="(go=go1.21.7 (Red Hat 1.21.7-1.el9) X:loopvar,strictfipsruntime, platform=linux/amd64, user=root, date=20240405-12:29:19, tags=strictfipsruntime)" level=info ts=2024-04-08T07:06:17.195888428Z caller=main.go:198 msg="namespaces filtering configuration " config="{allow_list=\"\",deny_list=\"\",prometheus_allow_list=\"openshift-monitoring\",alertmanager_allow_list=\"openshift-monitoring\",alertmanagerconfig_allow_list=\"\",thanosruler_allow_list=\"openshift-monitoring\"}" level=info ts=2024-04-08T07:06:17.212735844Z caller=main.go:227 msg="connection established" cluster-version=v1.29.3+e994e5d level=warn ts=2024-04-08T07:06:17.228748881Z caller=main.go:75 msg="resource \"scrapeconfigs\" (group: \"monitoring.coreos.com/v1alpha1\") not installed in the cluster" level=info ts=2024-04-08T07:06:17.25637504Z caller=operator.go:335 component=prometheus-controller msg="Kubernetes API capabilities" endpointslices=true level=warn ts=2024-04-08T07:06:17.258012256Z caller=main.go:75 msg="resource \"prometheusagents\" (group: \"monitoring.coreos.com/v1alpha1\") not installed in the cluster" level=info ts=2024-04-08T07:06:17.360652572Z caller=server.go:298 msg="starting insecure server" address=127.0.0.1:8080 level=info ts=2024-04-08T07:06:17.602723953Z caller=operator.go:283 component=thanos-controller msg="successfully synced all caches" level=info ts=2024-04-08T07:06:17.686834878Z caller=operator.go:313 component=alertmanager-controller msg="successfully synced all caches" level=info ts=2024-04-08T07:06:17.687014402Z caller=operator.go:572 component=alertmanager-controller key=openshift-monitoring/main msg="sync alertmanager" level=info ts=2024-04-08T07:06:17.696906656Z caller=operator.go:392 component=prometheus-controller msg="successfully synced all caches" level=info ts=2024-04-08T07:06:17.698997412Z caller=operator.go:766 component=prometheus-controller key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2024-04-08T07:06:17.904295505Z caller=operator.go:572 component=alertmanager-controller key=openshift-monitoring/main msg="sync alertmanager" level=warn ts=2024-04-08T07:06:18.111274725Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-apiserver-operator/openshift-apiserver-operator level=warn ts=2024-04-08T07:06:18.111387227Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-apiserver/openshift-apiserver level=warn ts=2024-04-08T07:06:18.111430218Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-apiserver/openshift-apiserver-operator-check-endpoints level=warn ts=2024-04-08T07:06:18.11149249Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-authentication-operator/authentication-operator level=warn ts=2024-04-08T07:06:18.111554601Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-authentication/oauth-openshift level=warn ts=2024-04-08T07:06:18.111637633Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cloud-credential-operator/cloud-credential-operator level=warn ts=2024-04-08T07:06:18.111697614Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-monitor level=warn ts=2024-04-08T07:06:18.111733495Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-monitor level=warn ts=2024-04-08T07:06:18.111784766Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-monitor level=warn ts=2024-04-08T07:06:18.111819506Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-monitor level=warn ts=2024-04-08T07:06:18.111895078Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-monitor level=warn ts=2024-04-08T07:06:18.111944309Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/shared-resource-csi-driver-node-monitor level=warn ts=2024-04-08T07:06:18.11197813Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/shared-resource-csi-driver-node-monitor level=warn ts=2024-04-08T07:06:18.112071132Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/shared-resource-csi-driver-node-monitor level=warn ts=2024-04-08T07:06:18.112151634Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-machine-approver/cluster-machine-approver level=warn ts=2024-04-08T07:06:18.112226245Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-version/cluster-version-operator level=warn ts=2024-04-08T07:06:18.112256916Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-config-operator/config-operator level=warn ts=2024-04-08T07:06:18.112284327Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-console-operator/console-operator level=warn ts=2024-04-08T07:06:18.112310487Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-console/console level=warn ts=2024-04-08T07:06:18.112339628Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-controller-manager-operator/openshift-controller-manager-operator level=warn ts=2024-04-08T07:06:18.112370889Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-controller-manager/openshift-controller-manager level=warn ts=2024-04-08T07:06:18.112397339Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-dns-operator/dns-operator level=warn ts=2024-04-08T07:06:18.11243773Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-dns/dns-default level=warn ts=2024-04-08T07:06:18.112484231Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-etcd-operator/etcd-operator level=warn ts=2024-04-08T07:06:18.112532742Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-image-registry/image-registry level=warn ts=2024-04-08T07:06:18.112575493Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-ingress-operator/ingress-operator level=warn ts=2024-04-08T07:06:18.112648155Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-ingress/router-default level=warn ts=2024-04-08T07:06:18.112684775Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-insights/insights-operator level=warn ts=2024-04-08T07:06:18.112738886Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-apiserver-operator/kube-apiserver-operator level=warn ts=2024-04-08T07:06:18.112771917Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-apiserver/kube-apiserver level=warn ts=2024-04-08T07:06:18.112834288Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-controller-manager-operator/kube-controller-manager-operator level=warn ts=2024-04-08T07:06:18.11288797Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-controller-manager/kube-controller-manager level=warn ts=2024-04-08T07:06:18.112923101Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-scheduler-operator/kube-scheduler-operator level=warn ts=2024-04-08T07:06:18.112974211Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-scheduler/kube-scheduler level=warn ts=2024-04-08T07:06:18.113004992Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-scheduler/kube-scheduler level=warn ts=2024-04-08T07:06:18.113031193Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-api/cluster-autoscaler-operator level=warn ts=2024-04-08T07:06:18.113082674Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-api/machine-api-controllers level=warn ts=2024-04-08T07:06:18.113111174Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-api/machine-api-controllers level=warn ts=2024-04-08T07:06:18.113137205Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-api/machine-api-controllers level=warn ts=2024-04-08T07:06:18.113180076Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-api/machine-api-operator level=warn ts=2024-04-08T07:06:18.113207577Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-config-operator/machine-config-controller level=warn ts=2024-04-08T07:06:18.113243277Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-config-operator/machine-config-daemon level=warn ts=2024-04-08T07:06:18.113268968Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-config-operator/machine-config-operator level=warn ts=2024-04-08T07:06:18.113303009Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-marketplace/marketplace-operator level=warn ts=2024-04-08T07:06:18.113566255Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-monitoring/promtail-monitor level=warn ts=2024-04-08T07:06:18.113659677Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-multus/monitor-multus-admission-controller level=warn ts=2024-04-08T07:06:18.113690037Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-multus/monitor-network level=warn ts=2024-04-08T07:06:18.113716478Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-network-diagnostics/network-check-source level=warn ts=2024-04-08T07:06:18.113760539Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-network-operator/network-operator level=warn ts=2024-04-08T07:06:18.113789389Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-oauth-apiserver/openshift-oauth-apiserver level=warn ts=2024-04-08T07:06:18.11382366Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-operator-lifecycle-manager/catalog-operator level=warn ts=2024-04-08T07:06:18.113849491Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-operator-lifecycle-manager/olm-operator level=warn ts=2024-04-08T07:06:18.113882881Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-operator-lifecycle-manager/package-server-manager-metrics level=warn ts=2024-04-08T07:06:18.113910142Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-ovn-kubernetes/monitor-ovn-control-plane-metrics level=warn ts=2024-04-08T07:06:18.113939212Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-ovn-kubernetes/monitor-ovn-node level=warn ts=2024-04-08T07:06:18.113965423Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-ovn-kubernetes/monitor-ovn-node level=warn ts=2024-04-08T07:06:18.114005374Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-route-controller-manager/openshift-route-controller-manager level=warn ts=2024-04-08T07:06:18.114032265Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-service-ca-operator/service-ca-operator level=warn ts=2024-04-08T07:06:18.114075275Z caller=promcfg.go:1806 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 level=info ts=2024-04-08T07:06:18.372521592Z caller=operator.go:766 component=prometheus-controller key=openshift-monitoring/k8s msg="sync prometheus" level=warn ts=2024-04-08T07:06:19.52908448Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-apiserver-operator/openshift-apiserver-operator level=warn ts=2024-04-08T07:06:19.529206143Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-apiserver/openshift-apiserver level=warn ts=2024-04-08T07:06:19.529264914Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-apiserver/openshift-apiserver-operator-check-endpoints level=warn ts=2024-04-08T07:06:19.529314545Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-authentication-operator/authentication-operator level=warn ts=2024-04-08T07:06:19.529363736Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-authentication/oauth-openshift level=warn ts=2024-04-08T07:06:19.529496399Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cloud-credential-operator/cloud-credential-operator level=warn ts=2024-04-08T07:06:19.52954309Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-monitor level=warn ts=2024-04-08T07:06:19.529610031Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-monitor level=warn ts=2024-04-08T07:06:19.529675583Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-monitor level=warn ts=2024-04-08T07:06:19.529722024Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-monitor level=warn ts=2024-04-08T07:06:19.529773425Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-monitor level=warn ts=2024-04-08T07:06:19.529840396Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/shared-resource-csi-driver-node-monitor level=warn ts=2024-04-08T07:06:19.529940188Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/shared-resource-csi-driver-node-monitor level=warn ts=2024-04-08T07:06:19.530042201Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/shared-resource-csi-driver-node-monitor level=warn ts=2024-04-08T07:06:19.530145063Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-machine-approver/cluster-machine-approver level=warn ts=2024-04-08T07:06:19.530242295Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-version/cluster-version-operator level=warn ts=2024-04-08T07:06:19.530318036Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-config-operator/config-operator level=warn ts=2024-04-08T07:06:19.530379448Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-console-operator/console-operator level=warn ts=2024-04-08T07:06:19.530423309Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-console/console level=warn ts=2024-04-08T07:06:19.53046613Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-controller-manager-operator/openshift-controller-manager-operator level=warn ts=2024-04-08T07:06:19.530515121Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-controller-manager/openshift-controller-manager level=warn ts=2024-04-08T07:06:19.530600663Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-dns-operator/dns-operator level=warn ts=2024-04-08T07:06:19.530658014Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-dns/dns-default level=warn ts=2024-04-08T07:06:19.530718695Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-etcd-operator/etcd-operator level=warn ts=2024-04-08T07:06:19.530768006Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-image-registry/image-registry level=warn ts=2024-04-08T07:06:19.530829528Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-ingress-operator/ingress-operator level=warn ts=2024-04-08T07:06:19.530882449Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-ingress/router-default level=warn ts=2024-04-08T07:06:19.53093667Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-insights/insights-operator level=warn ts=2024-04-08T07:06:19.530991941Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-apiserver-operator/kube-apiserver-operator level=warn ts=2024-04-08T07:06:19.531039122Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-apiserver/kube-apiserver level=warn ts=2024-04-08T07:06:19.531094903Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-controller-manager-operator/kube-controller-manager-operator level=warn ts=2024-04-08T07:06:19.531137024Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-controller-manager/kube-controller-manager level=warn ts=2024-04-08T07:06:19.531180345Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-scheduler-operator/kube-scheduler-operator level=warn ts=2024-04-08T07:06:19.531224986Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-scheduler/kube-scheduler level=warn ts=2024-04-08T07:06:19.531270967Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-scheduler/kube-scheduler level=warn ts=2024-04-08T07:06:19.531334098Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-api/cluster-autoscaler-operator level=warn ts=2024-04-08T07:06:19.53138266Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-api/machine-api-controllers level=warn ts=2024-04-08T07:06:19.5314245Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-api/machine-api-controllers level=warn ts=2024-04-08T07:06:19.531463661Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-api/machine-api-controllers level=warn ts=2024-04-08T07:06:19.531513562Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-api/machine-api-operator level=warn ts=2024-04-08T07:06:19.531555783Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-config-operator/machine-config-controller level=warn ts=2024-04-08T07:06:19.531626765Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-config-operator/machine-config-daemon level=warn ts=2024-04-08T07:06:19.531689586Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-config-operator/machine-config-operator level=warn ts=2024-04-08T07:06:19.531733467Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-marketplace/marketplace-operator level=warn ts=2024-04-08T07:06:19.532134636Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-monitoring/promtail-monitor level=warn ts=2024-04-08T07:06:19.532233158Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-multus/monitor-multus-admission-controller level=warn ts=2024-04-08T07:06:19.532507644Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-multus/monitor-network level=warn ts=2024-04-08T07:06:19.532567965Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-network-diagnostics/network-check-source level=warn ts=2024-04-08T07:06:19.532635257Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-network-operator/network-operator level=warn ts=2024-04-08T07:06:19.532683058Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-oauth-apiserver/openshift-oauth-apiserver level=warn ts=2024-04-08T07:06:19.532728279Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-operator-lifecycle-manager/catalog-operator level=warn ts=2024-04-08T07:06:19.53277187Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-operator-lifecycle-manager/olm-operator level=warn ts=2024-04-08T07:06:19.532821821Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-operator-lifecycle-manager/package-server-manager-metrics level=warn ts=2024-04-08T07:06:19.532863662Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-ovn-kubernetes/monitor-ovn-control-plane-metrics level=warn ts=2024-04-08T07:06:19.532904153Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-ovn-kubernetes/monitor-ovn-node level=warn ts=2024-04-08T07:06:19.532944204Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-ovn-kubernetes/monitor-ovn-node level=warn ts=2024-04-08T07:06:19.532990574Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-route-controller-manager/openshift-route-controller-manager level=warn ts=2024-04-08T07:06:19.533037166Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-service-ca-operator/service-ca-operator level=warn ts=2024-04-08T07:06:19.533089337Z caller=promcfg.go:1806 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1
example servicemonitor with bearerTokenFile that causes warining in prometheus operator
$ oc -n openshift-apiserver-operator get servicemonitor openshift-apiserver-operator -oyaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor ... spec: endpoints: - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token interval: 30s metricRelabelings: - action: drop regex: etcd_(debugging|disk|request|server).* sourceLabels: ...
$ oc explain servicemonitor.spec.endpoints.bearerTokenFile
GROUP: monitoring.coreos.com
KIND: ServiceMonitor
VERSION: v1FIELD: bearerTokenFile <string>DESCRIPTION:
File to read bearer token for scraping the target.
Deprecated: use `authorization` instead.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-04-07-182401 True False 52m Cluster version is 4.16.0-0.nightly-2024-04-07-182401
How reproducible:
with Prometheus Operator 0.73.0
Steps to Reproduce:
1. check prometheus-operator logs
Actual results:
too many warnings for "'bearerTokenFile' is deprecated, use 'authorization' instead."
Expected results:
no warnings
Related with https://issues.redhat.com/browse/OCPBUGS-23000
Cluster-autoscaler by default evict all those pods -including those coming from daemon sets- In the case of EFS-CSI drivers, which are mounted as nfs volumes, this is causing nfs stale and that application worloads are not terminated gracefully.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
- While scaling down a node from the cluster-autoscaler-operator, the DS pods are beeing evicted.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
CSI pods might not be evicted by the cluster autoscaler (at least prior to workloads termination) as it might produce data corruption
Additional info:
Is possible to disable csi pods eviction adding the following annotation label on the csi driver pod cluster-autoscaler.kubernetes.io/enable-ds-eviction: "false"
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-38963. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-33308. The following is the description of the original issue:
—
Description of problem:
When creating an OCP cluster on AWS and selecting "publish: Internal," the ingress operator may create external LB mappings to external subnets. This can occur if public subnets were specified during installation at install-config. https://docs.openshift.com/container-platform/4.15/installing/installing_aws/installing-aws-private.html#private-clusters-about-aws_installing-aws-private A configuration validation should be added to the installer.
Version-Release number of selected component (if applicable):
4.14+ probably older versions as well.
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Slack thread: https://redhat-internal.slack.com/archives/C68TNFWA2/p1714986876688959
Please review the following PR: https://github.com/openshift/kube-state-metrics/pull/108
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
All the apiservers:
Expose both `apiserver_request_slo_duration_seconds` and `apiserver_request_sli_duration_seconds`. The SLI metric was introduced in Kubernetes 1.26 as a replacement of `apiserver_request_slo_duration_seconds` which was deprecated in Kubernetes 1.27. This change is only a renaming so both metrics expose the same data. To avoid storing duplicated data in Prometheus, we need to drop the `apiserver_request_slo_duration_seconds` in favor of `apiserver_request_sli_duration_seconds`.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/must-gather/pull/409
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/prometheus/pull/187
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/network-tools/pull/105
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
We need to make controllerAvailabilityPolicy field inmutable in the HostedCluster spec section to ensure the customer cannot go from/to SingleReplica to HighAvailability.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Not issue, just upstream sync (or issue: multus is not up-to-date).
Description of problem:
There is a new zone in PowerVS called dal12. We need to add this zone to the list of supported zones in the installer.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Deploy OpenShift cluster to the zone 2. 3.
Actual results:
Fails
Expected results:
Works
Additional info:
Please review the following PR: https://github.com/openshift/builder/pull/385
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Snapshots taken to gather deprecation information from bundles are from the Subscription namespace instead of the CatalogSource namespace. That means that if the Subscription is in a different namespace then no bundles will be present in the snapshot.
How reproducible:
100%
Steps to Reproduce:
1.Create CatalogSource with olm.deprecation entries 2.Create Subscription targeting a package with deprecations in a different namespace.
Actual results:
No Deprecation Conditions will be present.
Expected results:
Deprecation Conditions should be present.
This is a clone of issue OCPBUGS-35368. The following is the description of the original issue:
—
Description of problem:
TestAllowedSourceRangesStatus test is flaking with the error: allowed_source_ranges_test.go:197: expected the annotation to be reflected in status.allowedSourceRanges: timed out waiting for the condition I also notice it sometimes coincides with a TestScopeChange error. It may be related updating LoadBalancer type operations, for example, https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/978/pull-ci-openshift-cluster-ingress-operator-master-e2e-azure-operator/1800249453098045440
Version-Release number of selected component (if applicable):
4.17
How reproducible:
~25-50%
Steps to Reproduce:
1. Run cluster-ingress-operator TestAllowedSourceRangesStatus E2E tests 2. 3.
Actual results:
Test is flaking
Expected results:
Test shouldn't flake
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
If these pods are evicted, they loose all knowlage of exsisting dhcp leases, and any pods using dhcp ipam will fail to renew the dhcp lease. even after the pod is re-created.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. use a NAD with ipam: dhcp. 2. delete the dhcp deamon pod on the smae node as your workload. 3. observe the lease expire on dhcp server / get reissued to a different pod causing network outage from duplicate addresses.
Actual results:
dhcp-daemon
Expected results:
dhcp-daemon pod does not get evicted before workloads. because of system-node-critical
Additional info:
All other multus components system-node-critical priority: 2000001000 priorityClassName: system-node-critical
Description of problem:
In Azure Stack, the Azure-Disk CSI Driver node pod CrashLoopBackOff: openshift-cluster-csi-drivers azure-disk-csi-driver-node-57rxv 1/3 CrashLoopBackOff 33 (3m55s ago) 59m 10.0.1.5 ci-op-q8b6n4iv-904ed-kp5mv-worker-mtcazs-m62cj <none> <none> openshift-cluster-csi-drivers azure-disk-csi-driver-node-8wvqm 1/3 CrashLoopBackOff 35 (29s ago) 67m 10.0.0.6 ci-op-q8b6n4iv-904ed-kp5mv-master-1 <none> <none> openshift-cluster-csi-drivers azure-disk-csi-driver-node-97ww5 1/3 CrashLoopBackOff 33 (12s ago) 67m 10.0.0.7 ci-op-q8b6n4iv-904ed-kp5mv-master-2 <none> <none> openshift-cluster-csi-drivers azure-disk-csi-driver-node-9hzw9 1/3 CrashLoopBackOff 35 (108s ago) 59m 10.0.1.4 ci-op-q8b6n4iv-904ed-kp5mv-worker-mtcazs-gjqmw <none> <none> openshift-cluster-csi-drivers azure-disk-csi-driver-node-glgzr 1/3 CrashLoopBackOff 34 (69s ago) 67m 10.0.0.8 ci-op-q8b6n4iv-904ed-kp5mv-master-0 <none> <none> openshift-cluster-csi-drivers azure-disk-csi-driver-node-hktfb 2/3 CrashLoopBackOff 48 (63s ago) 60m 10.0.1.6 ci-op-q8b6n4iv-904ed-kp5mv-worker-mtcazs-kdbpf <none> <none>
The CSI-Driver container log: panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0xc8 pc=0x18ff5db] goroutine 228 [running]: sigs.k8s.io/cloud-provider-azure/pkg/provider.(*Cloud).GetZone(0xc00021ec00, {0xc0002d57d0?, 0xc00005e3e0?}) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/sigs.k8s.io/cloud-provider-azure/pkg/provider/azure_zones.go:182 +0x2db sigs.k8s.io/azuredisk-csi-driver/pkg/azuredisk.(*Driver).NodeGetInfo(0xc000144000, {0x21ebbf0, 0xc0002d5470}, 0x273606a?) /go/src/github.com/openshift/azure-disk-csi-driver/pkg/azuredisk/nodeserver.go:336 +0x13b github.com/container-storage-interface/spec/lib/go/csi._Node_NodeGetInfo_Handler.func1({0x21ebbf0, 0xc0002d5470}, {0x1d71a60?, 0xc0003b0320}) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:7160 +0x72 sigs.k8s.io/azuredisk-csi-driver/pkg/csi-common.logGRPC({0x21ebbf0, 0xc0002d5470}, {0x1d71a60?, 0xc0003b0320?}, 0xc0003b0340, 0xc00050ae10) /go/src/github.com/openshift/azure-disk-csi-driver/pkg/csi-common/utils.go:80 +0x409 github.com/container-storage-interface/spec/lib/go/csi._Node_NodeGetInfo_Handler({0x1ec2f40?, 0xc000144000}, {0x21ebbf0, 0xc0002d5470}, 0xc000054680, 0x20167a0) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:7162 +0x135 google.golang.org/grpc.(*Server).processUnaryRPC(0xc000530000, {0x21ebbf0, 0xc0002d53b0}, {0x21f5f40, 0xc00057b1e0}, 0xc00011cb40, 0xc00052c810, 0x30fa1c8, 0x0) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1343 +0xe03 google.golang.org/grpc.(*Server).handleStream(0xc000530000, {0x21f5f40, 0xc00057b1e0}, 0xc00011cb40) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1737 +0xc4c google.golang.org/grpc.(*Server).serveStreams.func1.1() /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:986 +0x86 created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 260 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:997 +0x145
The registrar container log: E0321 23:08:02.679727 1 main.go:103] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Unavailable desc = error reading from server: EOF, restarting registration container.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-03-21-152650
How reproducible:
See it in CI profile, and manual install failed earlier.
Steps to Reproduce:
See Description
Actual results:
Azure-Disk CSI Driver node pod CrashLoopBackOff
Expected results:
Azure-Disk CSI Driver node pod should be running
Additional info:
See gather-extra and must-gather: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-azure-stack-ipi-proxy-fips-f2/1770921405509013504/artifacts/azure-stack-ipi-proxy-fips-f2/
There was a glitch with the prometheus-adapter image after the 4.16 branching.
https://redhat-internal.slack.com/archives/C01CQA76KMX/p1702041610708869
Description of problem:
The installer doesn’t do precheck if node architecture and vm type are consistent for aws and gcp, it works on azure
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-multi-2023-12-06-195439
How reproducible:
Always
Steps to Reproduce:
1.Config compute architecture field to arm64 but vm type choose amd64 instance type in install-config 2.Create cluster 3.Check installation
Actual results:
Azure will precheck if architecture is consistent with instance type when creating manifests, like: 12-07 11:18:24.452 [INFO] Generating manifests files.....12-07 11:18:24.452 level=info msg=Credentials loaded from file "/home/jenkins/ws/workspace/ocp-common/Flexy-install/flexy/workdir/azurecreds20231207-285-jd7gpj" 12-07 11:18:56.474 level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: controlPlane.platform.azure.type: Invalid value: "Standard_D4ps_v5": instance type architecture 'Arm64' does not match install config architecture amd64 But aws and gcp don’t have precheck, it will fail during installation, but many resources have been created. The case more likely to happen in multiarch cluster
Expected results:
The installer can do a precheck for architecture and vm type , especially for heterogeneous supported platforms(aws,gcp,azure)
Additional info:
Please review the following PR: https://github.com/openshift/agent-installer-utils/pull/32
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Navigation: Workloads -> Deployments -> (select any Deployment from list) -> Details -> Volumes -> Remove volume Issue: Message "Are you sure you want to remove volume audit-policies from Deployment: apiserver?" is in English. Observation: Translation is present in branch release-4.15 file... frontend/public/locales/ja/public.json
Version-Release number of selected component (if applicable):
4.15.0-rc.3
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Content is in English
Expected results:
Content should be in selected language
Additional info:
Reference screenshot attached.
Description of problem:
- One node [ rendezvous] is failed to add the cluster and there are some pending CSR's. - omc get csr NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-44qjs 21m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-9n9hc 5m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-9xw24 1h kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-brm6f 1h kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-dz75g 36m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-l8c7v 1h kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-mv7w5 52m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-v6pgd 1h kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending
In order to complete the installation, cu needs to approve the those CSR's manually.
Steps to Reproduce:
agent-based installation.
Actual results:
CSR's are in pending state.
Expected results:
CSR's should approved automatically
Additional info:
Logs : https://drive.google.com/drive/folders/1UCgC6oMx28k-_WXy8w1iN_t9h9rtmnfo?usp=sharing
This is a clone of issue OCPBUGS-34649. The following is the description of the original issue:
—
Description of problem:
According to https://github.com/openshift/enhancements/pull/1502 all managed TLS artifacts (secrets, configmaps and files on disk) should have clear ownership and other necessary metadata `metal3-ironic-tls` is created by cluster-baremetal-operator but doesn't have ownership annotation
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-39419. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-38794. The following is the description of the original issue:
—
Description of problem:
HCP cluster is being updated but the nodepool is stuck updating: ~~~ NAME CLUSTER DESIRED NODES CURRENT NODES AUTOSCALING AUTOREPAIR VERSION UPDATINGVERSION UPDATINGCONFIG MESSAGE nodepool-dev-cluster dev 2 2 False False 4.15.22 True True ~~~
Version-Release number of selected component (if applicable):
Hosting OCP cluster 4.15 HCP 4.15.23
How reproducible:
N/A
Steps to Reproduce:
1. 2. 3.
Actual results:
Nodepool stuck in upgrade
Expected results:
Upgrade success
Additional info:
I have found this error repeating continually in the ignition-server pods: ~~~ {"level":"error","ts":"2024-08-20T09:02:19Z","msg":"Reconciler error","controller":"secret","controllerGroup":"","controllerKind":"Secret","Secret":{"name":"token-nodepool-dev-cluster-3146da34","namespace":"dev-dev"},"namespace":"dev-dev","name":"token-nodepool-dev-cluster-3146da34","reconcileID":"ec1f0a7f-1657-4245-99ef-c984977ff0f8","error":"error getting ignition payload: failed to download binaries: failed to extract image file: failed to extract image file: file not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"} {"level":"info","ts":"2024-08-20T09:02:20Z","logger":"get-payload","msg":"discovered machine-config-operator image","image":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f3b55cc8f88b9e6564fe6ad0bc431cd7270c0586a06d9b4a19ff2b518c461ede"} {"level":"info","ts":"2024-08-20T09:02:20Z","logger":"get-payload","msg":"created working directory","dir":"/payloads/get-payload4089452863"} {"level":"info","ts":"2024-08-20T09:02:28Z","logger":"get-payload","msg":"extracted image-references","time":"8s"} {"level":"info","ts":"2024-08-20T09:02:38Z","logger":"get-payload","msg":"extracted templates","time":"10s"} {"level":"info","ts":"2024-08-20T09:02:38Z","logger":"image-cache","msg":"retrieved cached file","imageRef":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f3b55cc8f88b9e6564fe6ad0bc431cd7270c0586a06d9b4a19ff2b518c461ede","file":"usr/lib/os-release"} {"level":"info","ts":"2024-08-20T09:02:38Z","logger":"get-payload","msg":"read os-release","mcoRHELMajorVersion":"8","cpoRHELMajorVersion":"9"} {"level":"info","ts":"2024-08-20T09:02:38Z","logger":"get-payload","msg":"copying file","src":"usr/bin/machine-config-operator.rhel9","dest":"/payloads/get-payload4089452863/bin/machine-config-operator"} ~~~
This is a clone of issue OCPBUGS-34638. The following is the description of the original issue:
—
Description of problem:
For a cluster having one worker machine of A3 instance type, during "destroy cluster" it keeps telling below failure until I stopped the instance via "gcloud". WARNING failed to stop instance jiwei-0530b-q9t8w-worker-c-ck6s8 in zone us-central1-c: googleapi: Error 400: VM has a Local SSD attached but an undefined value for `discard-local-ssd`. If using gcloud, please add `--discard-local-ssd=false` or `--discard-local-ssd=true` to your command., badRequest
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-multi-2024-05-29-143245
How reproducible:
Always
Steps to Reproduce:
1. "create install-config" and then "create manifests" 2. edit a worker machineset YAML, to specify "machineType: a3-highgpu-8g" along with "onHostMaintenance: Terminate" 3. "create cluster", and make sure it succeeds 4. "destroy cluster"
Actual results:
Uninstalling the cluster keeps telling stopping instance error.
Expected results:
"destroy cluster" should proceed without any warning/error, and delete everything finally.
Additional info:
FYI the .openshift-install.log is available at https://drive.google.com/file/d/15xIwzi0swDk84wqg32tC_4KfUahCalrL/view?usp=drive_link FYI to stop the A3 instance via "gcloud" by specifying "--discard-local-ssd=false" does succeed. $ gcloud compute instances list --format="table(creationTimestamp.date('%Y-%m-%d %H:%M:%S'):sort=1,zone,status,name,machineType,tags.items)" --filter="name~jiwei" 2>/dev/null CREATION_TIMESTAMP ZONE STATUS NAME MACHINE_TYPE ITEMS 2024-05-29 20:55:52 us-central1-a TERMINATED jiwei-0530b-q9t8w-master-0 n2-standard-4 ['jiwei-0530b-q9t8w-master'] 2024-05-29 20:55:52 us-central1-b TERMINATED jiwei-0530b-q9t8w-master-1 n2-standard-4 ['jiwei-0530b-q9t8w-master'] 2024-05-29 20:55:52 us-central1-c TERMINATED jiwei-0530b-q9t8w-master-2 n2-standard-4 ['jiwei-0530b-q9t8w-master'] 2024-05-29 21:10:08 us-central1-a TERMINATED jiwei-0530b-q9t8w-worker-a-rkxkk n2-standard-4 ['jiwei-0530b-q9t8w-worker'] 2024-05-29 21:10:19 us-central1-b TERMINATED jiwei-0530b-q9t8w-worker-b-qg6jv n2-standard-4 ['jiwei-0530b-q9t8w-worker'] 2024-05-29 21:10:31 us-central1-c RUNNING jiwei-0530b-q9t8w-worker-c-ck6s8 a3-highgpu-8g ['jiwei-0530b-q9t8w-worker'] $ gcloud compute instances stop jiwei-0530b-q9t8w-worker-c-ck6s8 --zone us-central1-c ERROR: (gcloud.compute.instances.stop) HTTPError 400: VM has a Local SSD attached but an undefined value for `discard-local-ssd`. If using gcloud, please add `--discard-local-ssd=false` or `--discard-local-ssd=true` to your command. $ gcloud compute instances stop jiwei-0530b-q9t8w-worker-c-ck6s8 --zone us-central1-c --discard-local-ssd=false Stopping instance(s) jiwei-0530b-q9t8w-worker-c-ck6s8...done. Updated [https://compute.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instances/jiwei-0530b-q9t8w-worker-c-ck6s8]. $ gcloud compute instances list --format="table(creationTimestamp.date('%Y-%m-%d %H:%M:%S'):sort=1,zone,status,name,machineType,tags.items)" --filter="name~jiwei" 2>/dev/null CREATION_TIMESTAMP ZONE STATUS NAME MACHINE_TYPE ITEMS 2024-05-29 20:55:52 us-central1-a TERMINATED jiwei-0530b-q9t8w-master-0 n2-standard-4 ['jiwei-0530b-q9t8w-master'] 2024-05-29 20:55:52 us-central1-b TERMINATED jiwei-0530b-q9t8w-master-1 n2-standard-4 ['jiwei-0530b-q9t8w-master'] 2024-05-29 20:55:52 us-central1-c TERMINATED jiwei-0530b-q9t8w-master-2 n2-standard-4 ['jiwei-0530b-q9t8w-master'] 2024-05-29 21:10:08 us-central1-a TERMINATED jiwei-0530b-q9t8w-worker-a-rkxkk n2-standard-4 ['jiwei-0530b-q9t8w-worker'] 2024-05-29 21:10:19 us-central1-b TERMINATED jiwei-0530b-q9t8w-worker-b-qg6jv n2-standard-4 ['jiwei-0530b-q9t8w-worker'] 2024-05-29 21:10:31 us-central1-c TERMINATED jiwei-0530b-q9t8w-worker-c-ck6s8 a3-highgpu-8g ['jiwei-0530b-q9t8w-worker'] $ gcloud compute instances delete -q jiwei-0530b-q9t8w-worker-c-ck6s8 --zone us-central1-c Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instances/jiwei-0530b-q9t8w-worker-c-ck6s8]. $
Please review the following PR: https://github.com/openshift/route-override-cni/pull/51
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/246
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/oc/pull/1627
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
We should add a Dockerfile that is optimized for running the tests locally. (The current Dockerfile assumes it is running with the CI setup.)
Description of problem:
When the user configures the install-config.yaml additionalTrustBundle field (for example, in a disconnected installation using a local registry), the user-ca-bundle configmap gets populated with more content than strictly required
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Setup a local registry and mirror the content of an ocp release 2. Configure the install-config.yaml for a mirrored installation. In particular, configure the additionalTrustBundle field with the registry cert 3. Create the agent ISO, boot the nodes and wait for the installation to complete
Actual results:
The user-ca-bundle cm does not contain onyl the registry cert
Expected results:
user-ca-bundle configmap with just the content of the install-config additionalTrustBundle field
Additional info:
Description of problem:
- Investigate why `name` and `namespcae` properties are passed as arguments in `k8sCreate` instance for Create YAML editor function - Remove the `name` and `namespcae` arguments in `k8sCreate` instance for Create YAML editor function if it does not require a big change. Problem: If consoleFetchCommon takes an additional option(argument) and return response based on the option as proposed in "[Add support for returning response.header in consoleFetchCommon function|https://issues.redhat.com/browse/CONSOLE-3949]" story, the wrong and unused arguments in k8sCreate would cause the consoleFetchCommon method arguments to return entire response instead of response body which would break the Create Resource YAML functionality. Code: https://github.com/openshift/console/blob/master/frontend/public/components/edit-yaml.jsx#L334
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/kubernetes-metrics-server/pull/24
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When images have been skipped and no images have been mirrored i see idms and itms are generated. 2024/05/15 15:38:25 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/05/15 15:38:25 [INFO] : 👋 Hello, welcome to oc-mirror 2024/05/15 15:38:25 [INFO] : ⚙️ setting up the environment for you... 2024/05/15 15:38:25 [INFO] : 🔀 workflow mode: mirrorToMirror 2024/05/15 15:38:25 [INFO] : 🕵️ going to discover the necessary images... 2024/05/15 15:38:25 [INFO] : 🔍 collecting release images... 2024/05/15 15:38:25 [INFO] : 🔍 collecting operator images... 2024/05/15 15:38:25 [INFO] : 🔍 collecting additional images... 2024/05/15 15:38:25 [WARN] : [AdditionalImagesCollector] mirroring skipped : source image quay.io/cilium/cilium-etcd-operator:v2.0.7@sha256:04b8327f7f992693c2cb483b999041ed8f92efc8e14f2a5f3ab95574a65ea2dc has both tag and digest 2024/05/15 15:38:25 [WARN] : [AdditionalImagesCollector] mirroring skipped : source image quay.io/coreos/etcd:v3.5.4@sha256:a67fb152d4c53223e96e818420c37f11d05c2d92cf62c05ca5604066c37295e9 has both tag and digest 2024/05/15 15:38:25 [INFO] : 🚀 Start copying the images... 2024/05/15 15:38:25 [INFO] : === Results === 2024/05/15 15:38:25 [INFO] : All release images mirrored successfully 0 / 0 ✅ 2024/05/15 15:38:25 [INFO] : All operator images mirrored successfully 0 / 0 ✅ 2024/05/15 15:38:25 [INFO] : All additional images mirrored successfully 0 / 0 ✅ 2024/05/15 15:38:25 [INFO] : 📄 Generating IDMS and ITMS files... 2024/05/15 15:38:25 [INFO] : /app1/knarra/customertest1/working-dir/cluster-resources/idms-oc-mirror.yaml file created 2024/05/15 15:38:25 [INFO] : 📄 Generating CatalogSource file... 2024/05/15 15:38:25 [INFO] : mirror time : 715.644µs 2024/05/15 15:38:25 [INFO] : 👋 Goodbye, thank you for using oc-mirror [fedora@preserve-fedora36 knarra]$ ls -l /app1/knarra/customertest1/working-dir/cluster-resources/idms-oc-mirror.yaml -rw-r--r--. 1 fedora fedora 0 May 15 15:38 /app1/knarra/customertest1/working-dir/cluster-resources/idms-oc-mirror.yaml [fedora@preserve-fedora36 knarra]$ cat /app1/knarra/customertest1/working-dir/cluster-resources/idms-oc-mirror.yaml
Version-Release number of selected component (if applicable):
4.16 oc-mirror
How reproducible:
Always
Steps to Reproduce:
1. Use the following imageSetConfig.yaml and run command `./oc-mirror --v2 -c /tmp/bug331961.yaml --workspace file:///app1/knarra/customertest1 docker://localhost:5000/bug331961 --dest-tls-verify=false` cat /tmp/imageSetConfig.yaml kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: additionalImages: - name: quay.io/cilium/cilium-etcd-operator:v2.0.7@sha256:04b8327f7f992693c2cb483b999041ed8f92efc8e14f2a5f3ab95574a65ea2dc - name: quay.io/coreos/etcd:v3.5.4@sha256:a67fb152d4c53223e96e818420c37f11d05c2d92cf62c05ca5604066c37295e9
Actual results:
Nothing will be mirrored and the images listed will be skipped as these images has both tag and digest but i see idms and itms empty files being generated
Expected results:
If nothing is mirrored, idms and itms files should not be generated.
Additional info:
https://issues.redhat.com/browse/OCPBUGS-33196
This is a clone of issue OCPBUGS-35236. The following is the description of the original issue:
—
Attempts to update a cluster to a release payload with a signature published by Red Hat fails with CVO failing to verity the signature, signalled by the ReleaseAccepted=False condition:
Retrieving payload failed version="4.16.0-rc.4" image="quay.io/openshift-release-dev/ocp-release@sha256:5e76f8c2cdc81fa40abb809ee5e2d56cb84f409aab773aa9b9c7e8ed8811bf74" failure=The update cannot be verified: unable to verify sha256:5e76f8c2cdc81fa40abb809ee5e2d56cb84f409aab773aa9b9c7e8ed8811bf74 against keyrings: verifier-public-key-redhat
CVO shows evidence of not being able to find the proper signature in its stores:
$ grep verifier-public-key-redhat cvo.log | head I0610 07:38:16.208595 1 event.go:364] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RetrievePayloadFailed' Retrieving payload failed version="4.16.0-rc.4" image="quay.io/openshift-release-dev/ocp-release@sha256:5e76f8c2cdc81fa40abb809ee5e2d56cb84f409aab773aa9b9c7e8ed8811bf74" failure=The update cannot be verified: unable to verify sha256:5e76f8c2cdc81fa40abb809ee5e2d56cb84f409aab773aa9b9c7e8ed8811bf74 against keyrings: verifier-public-key-redhat // [2024-06-10T07:38:16Z: prefix sha256-5e76f8c2cdc81fa40abb809ee5e2d56cb84f409aab773aa9b9c7e8ed8811bf74 in config map signatures-managed: no more signatures to check, 2024-06-10T07:38:16Z: ClusterVersion spec.signatureStores is an empty array. Unset signatureStores entirely if you want to enable the default signature stores] ...
4.16.0-rc.3
4.16.0-rc.4
4.17.0-ec.0
Seems always. All CI build farm clusters showed this behavior when trying to update from 4.16.0-rc.3
1. Launch update to a version with a signature published by RH
ReleaseAccepted=False and update is stuck
ReleaseAccepted=True and update proceeds
Suspected culprit is https://github.com/openshift/cluster-version-operator/pull/1030/ so the fix may be a revert or an attempt to fix-forward, but revert seems safer at this point.
Evidence:
[1]
...ClusterVersion spec.signatureStores is an empty array. Unset signatureStores entirely if you want to enable the default signature store... W0610 07:58:59.095970 1 warnings.go:70] unknown field "spec.signatureStores"
https://github.com/openshift/runbooks/blob/master/alerts/AggregatedAPIErrors.md exists but is unused by us. We should
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
the example namespaced page is not working
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-04-22-023835
How reproducible:
Always
Steps to Reproduce:
1. Deploy console-demo-plugin manifests, enable the plugin $ oc apply -f https://raw.githubusercontent.com/openshift/console/master/dynamic-demo-plugin/oc-manifest.yaml $ oc patch console.operator cluster --type='json' -p='[{"op": "add", "path": "/spec/plugins/-", "value":"console-demo-plugin"}]' 2. Change to Demo perspective, click on `Example Namespaced Page` menu
Actual results:
2. an error page returned Cannot destructure property 'ns' of '(intermediate value)(intermediate value)(intermediate value)' as it is undefined.
Expected results:
2. a page with namespace dropdown menu should be rendered
Additional info:
Remove odepaz/osherdp from all systems in Red Hat.
This is a clone of issue OCPBUGS-39209. The following is the description of the original issue:
—
Description of problem:
Attempting to Migrate from OpenShiftSDN to OVNKubernetes but experiencing the below Error once the Limited Live Migration is started.
+ exec /usr/bin/hybrid-overlay-node --node ip-10-241-1-192.us-east-2.compute.internal --config-file=/run/ovnkube-config/ovnkube.conf --bootstrap-kubeconfig=/var/lib/kubelet/kubeconfig --cert-dir=/etc/ovn/ovnkube-node-certs --cert-duration=24h I0829 14:06:20.313928 82345 config.go:2192] Parsed config file /run/ovnkube-config/ovnkube.conf I0829 14:06:20.314202 82345 config.go:2193] Parsed config: {Default:{MTU:8901 RoutableMTU:0 ConntrackZone:64000 HostMasqConntrackZone:0 OVNMasqConntrackZone:0 HostNodePortConntrackZone:0 ReassemblyConntrackZone:0 EncapType:geneve EncapIP: EncapPort:6081 InactivityProbe:100000 OpenFlowProbe:180 OfctrlWaitBeforeClear:0 MonitorAll:true OVSDBTxnTimeout:1m40s LFlowCacheEnable:true LFlowCacheLimit:0 LFlowCacheLimitKb:1048576 RawClusterSubnets:100.64.0.0/15/23 ClusterSubnets:[] EnableUDPAggregation:true Zone:global} Logging:{File: CNIFile: LibovsdbFile:/var/log/ovnkube/libovsdb.log Level:4 LogFileMaxSize:100 LogFileMaxBackups:5 LogFileMaxAge:0 ACLLoggingRateLimit:20} Monitoring:{RawNetFlowTargets: RawSFlowTargets: RawIPFIXTargets: NetFlowTargets:[] SFlowTargets:[] IPFIXTargets:[]} IPFIX:{Sampling:400 CacheActiveTimeout:60 CacheMaxFlows:0} CNI:{ConfDir:/etc/cni/net.d Plugin:ovn-k8s-cni-overlay} OVNKubernetesFeature:{EnableAdminNetworkPolicy:true EnableEgressIP:true EgressIPReachabiltyTotalTimeout:1 EnableEgressFirewall:true EnableEgressQoS:true EnableEgressService:true EgressIPNodeHealthCheckPort:9107 EnableMultiNetwork:true EnableMultiNetworkPolicy:false EnableStatelessNetPol:false EnableInterconnect:false EnableMultiExternalGateway:true EnablePersistentIPs:false EnableDNSNameResolver:false EnableServiceTemplateSupport:false} Kubernetes:{BootstrapKubeconfig: CertDir: CertDuration:10m0s Kubeconfig: CACert: CAData:[] APIServer:https://api-int.nonamenetwork.sandbox1730.opentlc.com:6443 Token: TokenFile: CompatServiceCIDR: RawServiceCIDRs:198.18.0.0/16 ServiceCIDRs:[] OVNConfigNamespace:openshift-ovn-kubernetes OVNEmptyLbEvents:false PodIP: RawNoHostSubnetNodes:migration.network.openshift.io/plugin= NoHostSubnetNodes:<nil> HostNetworkNamespace:openshift-host-network PlatformType:AWS HealthzBindAddress:0.0.0.0:10256 CompatMetricsBindAddress: CompatOVNMetricsBindAddress: CompatMetricsEnablePprof:false DNSServiceNamespace:openshift-dns DNSServiceName:dns-default} Metrics:{BindAddress: OVNMetricsBindAddress: ExportOVSMetrics:false EnablePprof:false NodeServerPrivKey: NodeServerCert: EnableConfigDuration:false EnableScaleMetrics:false} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} Gateway:{Mode:shared Interface: EgressGWInterface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false V4JoinSubnet:100.64.0.0/16 V6JoinSubnet:fd98::/64 V4MasqueradeSubnet:169.254.169.0/29 V6MasqueradeSubnet:fd69::/125 MasqueradeIPs:{V4OVNMasqueradeIP:169.254.169.1 V6OVNMasqueradeIP:fd69::1 V4HostMasqueradeIP:169.254.169.2 V6HostMasqueradeIP:fd69::2 V4HostETPLocalMasqueradeIP:169.254.169.3 V6HostETPLocalMasqueradeIP:fd69::3 V4DummyNextHopMasqueradeIP:169.254.169.4 V6DummyNextHopMasqueradeIP:fd69::4 V4OVNServiceHairpinMasqueradeIP:169.254.169.5 V6OVNServiceHairpinMasqueradeIP:fd69::5} DisablePacketMTUCheck:false RouterSubnet: SingleNode:false DisableForwarding:false AllowNoUplink:false} MasterHA:{ElectionLeaseDuration:137 ElectionRenewDeadline:107 ElectionRetryPeriod:26} ClusterMgrHA:{ElectionLeaseDuration:137 ElectionRenewDeadline:107 ElectionRetryPeriod:26} HybridOverlay:{Enabled:true RawClusterSubnets: ClusterSubnets:[] VXLANPort:4789} OvnKubeNode:{Mode:full DPResourceDeviceIdsMap:map[] MgmtPortNetdev: MgmtPortDPResourceName:} ClusterManager:{V4TransitSwitchSubnet:100.88.0.0/16 V6TransitSwitchSubnet:fd97::/64}} F0829 14:06:20.315468 82345 hybrid-overlay-node.go:54] illegal network configuration: built-in join subnet "100.64.0.0/16" overlaps cluster subnet "100.64.0.0/15"
The OpenShift Container Platform 4 - Cluster has been installed with the below configuration and therefore has a conflict because of the clusterNetwork with the Join Subnet of OVNKubernetes.
$ oc get cm -n kube-system cluster-config-v1 -o yaml
apiVersion: v1
data:
install-config: |
additionalTrustBundlePolicy: Proxyonly
apiVersion: v1
baseDomain: sandbox1730.opentlc.com
compute:
- architecture: amd64
hyperthreading: Enabled
name: worker
platform: {}
replicas: 3
controlPlane:
architecture: amd64
hyperthreading: Enabled
name: master
platform: {}
replicas: 3
metadata:
creationTimestamp: null
name: nonamenetwork
networking:
clusterNetwork:
- cidr: 100.64.0.0/15
hostPrefix: 23
machineNetwork:
- cidr: 10.241.0.0/16
networkType: OpenShiftSDN
serviceNetwork:
- 198.18.0.0/16
platform:
aws:
region: us-east-2
publish: External
pullSecret: ""
So following the procedure, the below steps were executed but still the problem is being reported.
oc patch network.operator.openshift.io cluster --type='merge' -p='{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"ipv4":{"internalJoinSubnet": "100.68.0.0/16"}}}}}'
Checking whether change was applied and one can see it being there/configured.
$ oc get network.operator cluster -o yaml apiVersion: operator.openshift.io/v1 kind: Network metadata: creationTimestamp: "2024-08-29T10:05:36Z" generation: 376 name: cluster resourceVersion: "135345" uid: 37f08c71-98fa-430c-b30f-58f82142788c spec: clusterNetwork: - cidr: 100.64.0.0/15 hostPrefix: 23 defaultNetwork: openshiftSDNConfig: enableUnidling: true mode: NetworkPolicy mtu: 8951 vxlanPort: 4789 ovnKubernetesConfig: egressIPConfig: {} gatewayConfig: ipv4: {} ipv6: {} routingViaHost: false genevePort: 6081 ipsecConfig: mode: Disabled ipv4: internalJoinSubnet: 100.68.0.0/16 mtu: 8901 policyAuditConfig: destination: "null" maxFileSize: 50 maxLogFiles: 5 rateLimit: 20 syslogFacility: local0 type: OpenShiftSDN deployKubeProxy: false disableMultiNetwork: false disableNetworkDiagnostics: false kubeProxyConfig: bindAddress: 0.0.0.0 logLevel: Normal managementState: Managed migration: mode: Live networkType: OVNKubernetes observedConfig: null operatorLogLevel: Normal serviceNetwork: - 198.18.0.0/16 unsupportedConfigOverrides: null useMultiNetworkPolicy: false
Following the above the Limited Live Migration is being triggered, which then suddently stops because of the Error shown.
oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.16.9
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift Container Platform 4 with OpenShiftSDN, the configuration shown above and then update to OpenShift Container Platform 4.16
2. Change internalJoinSubnet to prevent a conflict with the Join Subnet of OVNKubernetes (oc patch network.operator.openshift.io cluster --type='merge' -p='{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"ipv4":
}}}}')
3. Initiate the Limited Live Migration running oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'
4. Check the logs of ovnkube-node using oc logs ovnkube-node-XXXXX -c ovnkube-controller
Actual results:
+ exec /usr/bin/hybrid-overlay-node --node ip-10-241-1-192.us-east-2.compute.internal --config-file=/run/ovnkube-config/ovnkube.conf --bootstrap-kubeconfig=/var/lib/kubelet/kubeconfig --cert-dir=/etc/ovn/ovnkube-node-certs --cert-duration=24h I0829 14:06:20.313928 82345 config.go:2192] Parsed config file /run/ovnkube-config/ovnkube.conf I0829 14:06:20.314202 82345 config.go:2193] Parsed config: {Default:{MTU:8901 RoutableMTU:0 ConntrackZone:64000 HostMasqConntrackZone:0 OVNMasqConntrackZone:0 HostNodePortConntrackZone:0 ReassemblyConntrackZone:0 EncapType:geneve EncapIP: EncapPort:6081 InactivityProbe:100000 OpenFlowProbe:180 OfctrlWaitBeforeClear:0 MonitorAll:true OVSDBTxnTimeout:1m40s LFlowCacheEnable:true LFlowCacheLimit:0 LFlowCacheLimitKb:1048576 RawClusterSubnets:100.64.0.0/15/23 ClusterSubnets:[] EnableUDPAggregation:true Zone:global} Logging:{File: CNIFile: LibovsdbFile:/var/log/ovnkube/libovsdb.log Level:4 LogFileMaxSize:100 LogFileMaxBackups:5 LogFileMaxAge:0 ACLLoggingRateLimit:20} Monitoring:{RawNetFlowTargets: RawSFlowTargets: RawIPFIXTargets: NetFlowTargets:[] SFlowTargets:[] IPFIXTargets:[]} IPFIX:{Sampling:400 CacheActiveTimeout:60 CacheMaxFlows:0} CNI:{ConfDir:/etc/cni/net.d Plugin:ovn-k8s-cni-overlay} OVNKubernetesFeature:{EnableAdminNetworkPolicy:true EnableEgressIP:true EgressIPReachabiltyTotalTimeout:1 EnableEgressFirewall:true EnableEgressQoS:true EnableEgressService:true EgressIPNodeHealthCheckPort:9107 EnableMultiNetwork:true EnableMultiNetworkPolicy:false EnableStatelessNetPol:false EnableInterconnect:false EnableMultiExternalGateway:true EnablePersistentIPs:false EnableDNSNameResolver:false EnableServiceTemplateSupport:false} Kubernetes:{BootstrapKubeconfig: CertDir: CertDuration:10m0s Kubeconfig: CACert: CAData:[] APIServer:https://api-int.nonamenetwork.sandbox1730.opentlc.com:6443 Token: TokenFile: CompatServiceCIDR: RawServiceCIDRs:198.18.0.0/16 ServiceCIDRs:[] OVNConfigNamespace:openshift-ovn-kubernetes OVNEmptyLbEvents:false PodIP: RawNoHostSubnetNodes:migration.network.openshift.io/plugin= NoHostSubnetNodes:<nil> HostNetworkNamespace:openshift-host-network PlatformType:AWS HealthzBindAddress:0.0.0.0:10256 CompatMetricsBindAddress: CompatOVNMetricsBindAddress: CompatMetricsEnablePprof:false DNSServiceNamespace:openshift-dns DNSServiceName:dns-default} Metrics:{BindAddress: OVNMetricsBindAddress: ExportOVSMetrics:false EnablePprof:false NodeServerPrivKey: NodeServerCert: EnableConfigDuration:false EnableScaleMetrics:false} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} Gateway:{Mode:shared Interface: EgressGWInterface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false V4JoinSubnet:100.64.0.0/16 V6JoinSubnet:fd98::/64 V4MasqueradeSubnet:169.254.169.0/29 V6MasqueradeSubnet:fd69::/125 MasqueradeIPs:{V4OVNMasqueradeIP:169.254.169.1 V6OVNMasqueradeIP:fd69::1 V4HostMasqueradeIP:169.254.169.2 V6HostMasqueradeIP:fd69::2 V4HostETPLocalMasqueradeIP:169.254.169.3 V6HostETPLocalMasqueradeIP:fd69::3 V4DummyNextHopMasqueradeIP:169.254.169.4 V6DummyNextHopMasqueradeIP:fd69::4 V4OVNServiceHairpinMasqueradeIP:169.254.169.5 V6OVNServiceHairpinMasqueradeIP:fd69::5} DisablePacketMTUCheck:false RouterSubnet: SingleNode:false DisableForwarding:false AllowNoUplink:false} MasterHA:{ElectionLeaseDuration:137 ElectionRenewDeadline:107 ElectionRetryPeriod:26} ClusterMgrHA:{ElectionLeaseDuration:137 ElectionRenewDeadline:107 ElectionRetryPeriod:26} HybridOverlay:{Enabled:true RawClusterSubnets: ClusterSubnets:[] VXLANPort:4789} OvnKubeNode:{Mode:full DPResourceDeviceIdsMap:map[] MgmtPortNetdev: MgmtPortDPResourceName:} ClusterManager:{V4TransitSwitchSubnet:100.88.0.0/16 V6TransitSwitchSubnet:fd97::/64}} F0829 14:06:20.315468 82345 hybrid-overlay-node.go:54] illegal network configuration: built-in join subnet "100.64.0.0/16" overlaps cluster subnet "100.64.0.0/15"
Expected results:
OVNKubernetes Limited Live Migration to recognize the change applied for internalJoinSubnet and don't report any CIDR/Subnet overlap during the OVNKubernetes Limited Live Migration
Additional info:
N/A
Affected Platforms:
OpenShift Container Platform 4.16 on AWS
This is a clone of issue OCPBUGS-42066. The following is the description of the original issue:
—
Description of problem:
We need to backport https://github.com/openshift/cluster-monitoring-operator/pull/2271 to 4.16 because the CMO e2e tests fail almost all the time in release-4.16.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a non functional change.
Description of problem:
"[sig-apps][Feature:DeploymentConfig] deploymentconfigs when tagging images should successfully tag the deployed image [apigroup:apps.openshift.io][apigroup:authorization.openshift.io][apigroup:image.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" has the following warning: warnings.go:70] apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-03-07-234116
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-30948. The following is the description of the original issue:
—
Description of problem:
When doing offline SDN migration, setting the parameter "spec.migration.features.egressIP" to "false" to disable automatic migration of egressIP configuration doesn't work.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Launch a cluster with OpenShiftSDN. Configure an egressip to a node. 2. Start offline SDN migration. 3. In step-3, execute oc patch Network.operator.openshift.io cluster --type='merge' \ --patch '{ "spec": { "migration": { "networkType": "OVNKubernetes", "features": { "egressIP": false } } } }'
Actual results:
An egressip.k8s.ovn.org CR is created automatcially.
Expected results:
No egressip CR shall be created for OVN-K
Additional info:
Description of problem:
1. TaskRuns list page is loading constantly for all projects 2. Archive icon is not displayed for some tasks in TaskRun list page 3. On change of ns to All Projects, PipelineRuns and TaskRuns are not loading properly
Version-Release number of selected component (if applicable):
4.15.z
How reproducible:
Always
Steps to Reproduce:
1.Create some TaskRun 2.Go to TaskRun list page 3.Select all project in project dropdown
Actual results:
Screen is keep on loading
Expected results:
Should load TaskRuns from all projects
Additional info:
Description of problem:
Install IPI cluster against 4.15 nightly build on Azure MAG and Azure Stack Hub or with Azure workload identity, image-registry co is degraded with different errors. On MAG: $ oc get co image-registry NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE image-registry 4.15.0-0.nightly-2024-02-16-235514 True False True 5h44m AzurePathFixControllerDegraded: Migration failed: panic: Get "https://imageregistryjima41xvvww.blob.core.windows.net/jima415a-hfxfh-image-registry-vbibdmawmsvqckhvmmiwisebryohfbtm?comp=list&prefix=docker&restype=container": dial tcp: lookup imageregistryjima41xvvww.blob.core.windows.net on 172.30.0.10:53: no such host... $ oc get pod -n openshift-image-registry NAME READY STATUS RESTARTS AGE azure-path-fix-ssn5w 0/1 Error 0 5h47m cluster-image-registry-operator-86cdf775c7-7brn6 1/1 Running 1 (5h50m ago) 5h58m image-registry-5c6796b86d-46lvx 1/1 Running 0 5h47m image-registry-5c6796b86d-9st5d 1/1 Running 0 5h47m node-ca-48lsh 1/1 Running 0 5h44m node-ca-5rrsl 1/1 Running 0 5h47m node-ca-8sc92 1/1 Running 0 5h47m node-ca-h6trz 1/1 Running 0 5h47m node-ca-hm7s2 1/1 Running 0 5h47m node-ca-z7tv8 1/1 Running 0 5h44m $ oc logs azure-path-fix-ssn5w -n openshift-image-registry panic: Get "https://imageregistryjima41xvvww.blob.core.windows.net/jima415a-hfxfh-image-registry-vbibdmawmsvqckhvmmiwisebryohfbtm?comp=list&prefix=docker&restype=container": dial tcp: lookup imageregistryjima41xvvww.blob.core.windows.net on 172.30.0.10:53: no such hostgoroutine 1 [running]: main.main() /go/src/github.com/openshift/cluster-image-registry-operator/cmd/move-blobs/main.go:49 +0x125 The blob storage endpoint seems not correct, should be: $ az storage account show -n imageregistryjima41xvvww -g jima415a-hfxfh-rg --query primaryEndpoints { "blob": "https://imageregistryjima41xvvww.blob.core.usgovcloudapi.net/", "dfs": "https://imageregistryjima41xvvww.dfs.core.usgovcloudapi.net/", "file": "https://imageregistryjima41xvvww.file.core.usgovcloudapi.net/", "internetEndpoints": null, "microsoftEndpoints": null, "queue": "https://imageregistryjima41xvvww.queue.core.usgovcloudapi.net/", "table": "https://imageregistryjima41xvvww.table.core.usgovcloudapi.net/", "web": "https://imageregistryjima41xvvww.z2.web.core.usgovcloudapi.net/" } On Azure Stack Hub: $ oc get co image-registry NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE image-registry 4.15.0-0.nightly-2024-02-16-235514 True False True 3h32m AzurePathFixControllerDegraded: Migration failed: panic: open : no such file or directory... $ oc get pod -n openshift-image-registry NAME READY STATUS RESTARTS AGE azure-path-fix-8jdg7 0/1 Error 0 3h35m cluster-image-registry-operator-86cdf775c7-jwnd4 1/1 Running 1 (3h38m ago) 3h54m image-registry-658669fbb4-llv8z 1/1 Running 0 3h35m image-registry-658669fbb4-lmfr6 1/1 Running 0 3h35m node-ca-2jkjx 1/1 Running 0 3h35m node-ca-dcg2v 1/1 Running 0 3h35m node-ca-q6xmn 1/1 Running 0 3h35m node-ca-r46r2 1/1 Running 0 3h35m node-ca-s8jkb 1/1 Running 0 3h35m node-ca-ww6ql 1/1 Running 0 3h35m $ oc logs azure-path-fix-8jdg7 -n openshift-image-registry panic: open : no such file or directorygoroutine 1 [running]: main.main() /go/src/github.com/openshift/cluster-image-registry-operator/cmd/move-blobs/main.go:36 +0x145 On cluster with Azure workload identity: Some operator's PROGRESSING is True image-registry 4.15.0-0.nightly-2024-02-16-235514 True True False 43m Progressing: The deployment has not completed... pod azure-path-fix is in CreateContainerConfigError status, and get error in its Event. "state": { "waiting": { "message": "couldn't find key REGISTRY_STORAGE_AZURE_ACCOUNTKEY in Secret openshift-image-registry/image-registry-private-configuration", "reason": "CreateContainerConfigError" } }
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-02-16-235514
How reproducible:
Always
Steps to Reproduce:
1. Install IPI cluster on MAG or Azure Stack Hub or config Azure workload identity 2. 3.
Actual results:
Installation failed and image-registry operator is degraded
Expected results:
Installation is successful.
Additional info:
Seems that issue is related with https://github.com/openshift/image-registry/pull/393
This is a clone of issue OCPBUGS-32405. The following is the description of the original issue:
—
Description of problem:
When creating a serverless function in create serverless form, BuildConfig is not created
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1.Install Serverless operator 2.Add https://github.com/openshift-dev-console/kn-func-node-cloudevents in create serverless form 3.Create the function and check BuildConfig page
Actual results:
BuildConfig is not created
Expected results:
Should create BuildConfig
Additional info:
This is a clone of issue OCPBUGS-31685. The following is the description of the original issue:
—
See the bug reported here https://github.com/openshift/console/issues/13696
We want to use the latest version of CAPO in MAPO. We need to revendor CAPO to version 0.9 before the 4.16 FF.
There are several API changes that might require Matt's help.
Please review the following PR: https://github.com/openshift/cluster-config-operator/pull/390
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Not issue, just upstream sync (or issue: multus is not up-to-date).
Description of problem:
The image registry CO is not progressing on Azure Hosted Control Planes
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Create an Azure HCP 2. Create a kubeconfig for the guest cluster 3. Check the image-registry CO
Actual results:
image-registry co's message is Progressing: The registry is ready...
Expected results:
image-registry finishes progressing
Additional info:
I let it go for about 34m % oc get co | grep -i image image-registry 4.16.0-0.nightly-multi-2024-02-26-105325 True True False 34m Progressing: The registry is ready... % oc get co/image-registry -oyaml ... - lastTransitionTime: "2024-02-28T19:10:30Z" message: |- Progressing: The registry is ready NodeCADaemonProgressing: The daemon set node-ca is deployed AzurePathFixProgressing: The job does not exist reason: AzurePathFixNotFound::Ready status: "True" type: Progressing
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/machine-api-provider-powervs/pull/67
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
No QA required, updating approvers across releases
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
[sig-builds][Feature:Builds][Slow] update failure status Build status OutOfMemoryKilled should contain OutOfMemoryKilled failure reason and message [apigroup:build.openshift.io] test is failing on 4.15 (e.g. https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_oc/1726/pull-ci-openshift-oc-release-4.15-e2e-aws-ovn-builds/1780913191149113344)
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
https://github.com/containerd/containerd/issues/8180 would be the reason of failure. Because in 4.15 the expected status is OOMKilled however test fails after getting an Error status with the correct exit code (137)
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/117
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Today Egress firewall rules with 'nodeSelector' only use the nodeIP in the OVN ACL rule. But there is possibility that one node may have secondary IPs other that the nodeIP. We shall create ACL with all the possible IPs of the selected node.
Version-Release number of selected component (if applicable):
How reproducible:
Create a egress firewall rule with 'nodeSelector'
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Manifests will be removed from CCO image so we have to start using CCA(cluster-config-api) image for bootstrap
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
KAS bootstrap container fails
Expected results:
KAS bootstrap container suceeds
Additional info:
Please review the following PR: https://github.com/openshift/oauth-apiserver/pull/94
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The kubelet is running with `unconfined_service_t`. It should run as `kubelet_exec_t`. This is causing all our plugins to fail because of Selinux denial. sh-5.1# ps -AZ | grep kubelet system_u:system_r:unconfined_service_t:s0 8719 ? 00:24:50 kubelet This issue was previously observed and resolved in 4.14.10.
Version-Release number of selected component (if applicable):
OCP 4.15
How reproducible:
Run ps -AZ | grep kubelet to see kubelet running with wrong label
Steps to Reproduce:
1. 2. 3.
Actual results:
Kubelet is running as unconfined_service_t
Expected results:
Kubelet should run as kubelet_exec_t
Additional info:
This is a clone of issue OCPBUGS-33789. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
There is an intermittent issue with the UploadImage() implementation in github.com/nutanix-cloud-native/prism-go-client@v0.3.4, on which the OCP installer depends. When testing the OCP installer with ClusterAPIInstall=true, I frequently hit the error with UploadImage() when calling to upload the bootstrap image to PC from the local image file. The error logs: INFO creating the bootstrap image demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso (uuid: 75694edf-f9c4-4d9a-9a44-731a4d103cc8), taskUUID: c8eafd49-54e2-4fb9-a3df-c456863d71fd. INFO created the bootstrap image demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso (uuid: 75694edf-f9c4-4d9a-9a44-731a4d103cc8). INFO preparing to upload the bootstrap image demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso (uuid: 75694edf-f9c4-4d9a-9a44-731a4d103cc8) data from file /Users/yanhuali/Library/Caches/openshift-installer/image_cache/demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso ERROR failed to upload the bootstrap image data "demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso" from filepath /Users/yanhuali/Library/Caches/openshift-installer/image_cache/demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso: status: 400 Bad Request, error-response: { ERROR "api_version": "3.1", ERROR "code": 400, ERROR "message_list": [ ERROR { ERROR "message": "Given input is invalid. Image 75694edf-f9c4-4d9a-9a44-731a4d103cc8 is already complete", ERROR "reason": "INVALID_ARGUMENT" ERROR }ERROR ], ERROR "state": "ERROR" ERROR } ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed preparing ignition data: failed to upload the bootstrap image data "demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso" from filepath /Users/yanhuali/Library/Caches/openshift-installer/image_cache/demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso: status: 400 Bad Request, error-response: { ERROR "api_version": "3.1", ERROR "code": 400, ERROR "message_list": [ ERROR { ERROR "message": "Given input is invalid. Image 75694edf-f9c4-4d9a-9a44-731a4d103cc8 is already complete", ERROR "reason": "INVALID_ARGUMENT" ERROR }ERROR ], ERROR "state": "ERROR" ERROR } The OCP installer code calling the prism-go-client function UploadImage() is here:https://github.com/openshift/installer/blob/master/pkg/infrastructure/nutanix/clusterapi/clusterapi.go#L172-L207
How reproducible:
Use OCP IPI 4.16 to provision a Nutanix OCP cluster with the install-config ClusterAPIInstall=true. This is an intermittent issue, so you need to repeat the test several times to reproduce.
Steps to Reproduce:
1. 2. 3.
Actual results:
The installer intermittently failed at uploading the bootstrap image data to PC from the local image data file.
Expected results:
The installer successfully to create the Nutanix OCP cluster with the install-config ClusterAPIInstall=true.
Additional info:
Description of problem:
Invalid idms files are being generated when imageSetConfig file does not have filtering based on channels for operators when mirroring from disk2mirror
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202403251146.p0.g03ce0ca.assembly.stream.el9-03ce0ca", GitCommit:"03ce0ca797e73b6762fd3e24100ce043199519e9", GitTreeState:"clean", BuildDate:"2024-03-25T16:34:33Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
always
Steps to Reproduce:
1) Use following imagesetconfig : cat config.yaml kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 mirror: operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 packages: - name: devworkspace-operator minVersion: "0.23.0" - name: quay-operator maxVersion: "3.10.2" - name: cluster-logging minVersion: 5.8.3 maxVersion: 5.8.5 2) Do mirror2Disk and disk2Mirror : `oc-mirror --config config.yaml file://outnochannel --v2` `oc-mirror --config config.yaml --from file://outnochannel --v2 docker://ec2-3-17-164-23.us-east-2.compute.amazonaws.com:5000/default` 3) Create the catalogsource, idms, itms resources
Actual results:
4) failed to create ImageTagMirrorSet oc create -f outnochannel/working-dir/cluster-resources/idms-oc-mirror.yaml The ImageDigestMirrorSet "idms-operator-0" is invalid: spec.imageDigestMirrors[2].source: Invalid value: "registry.redhat.io/": spec.imageDigestMirrors[2].source in body should match '^\*(?:\.(?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9]))+$|^((?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9])(?:(?:\.(?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9]))+)?(?::[0-9]+)?)(?:(?:/[a-z0-9]+(?:(?:(?:[._]|__|[-]*)[a-z0-9]+)+)?)+)?$' cat outnochannel/working-dir/cluster-resources/idms-oc-mirror.yaml --- apiVersion: config.openshift.io/v1 kind: ImageDigestMirrorSet metadata: creationTimestamp: null name: idms-operator-0 spec: imageDigestMirrors: - mirrors: - ec2-3-17-164-23.us-east-2.compute.amazonaws.com:5000/default/devworkspace source: registry.redhat.io/devworkspace - mirrors: - ec2-3-17-164-23.us-east-2.compute.amazonaws.com:5000/default/openshift4 source: registry.redhat.io/openshift4 - mirrors: - ec2-3-17-164-23.us-east-2.compute.amazonaws.com:5000/default source: registry.redhat.io/ - mirrors: - ec2-3-17-164-23.us-east-2.compute.amazonaws.com:5000/default/quay source: registry.redhat.io/quay - mirrors: - ec2-3-17-164-23.us-east-2.compute.amazonaws.com:5000/default/rhel8 source: registry.redhat.io/rhel8 - mirrors: - ec2-3-17-164-23.us-east-2.compute.amazonaws.com:5000/default/openshift-logging source: registry.redhat.io/openshift-logging status: {}
Expected results:
4) succeed to create the cluster resource
Please review the following PR: https://github.com/openshift/oauth-proxy/pull/270
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Update hash creation to use sha512 instead of sha1
Links
This is a clone of issue OCPBUGS-37052. The following is the description of the original issue:
—
Description of problem:
This is a followup of https://issues.redhat.com/browse/OCPBUGS-34996, in which comments led us to better understand the issue customers are facing. LDAP IDP traffic from the oauth pod seems to be going through the configured HTTP(S) proxy, while it should not due to it being a different protocol. This results in customers adding the ldap endpoint to their no-proxy config to circumvent the issue.
Version-Release number of selected component (if applicable):
4.15.11
How reproducible:
Steps to Reproduce:
(From the customer) 1. Configure LDAP IDP 2. Configure Proxy 3. LDAP IDP communication from the control plane oauth pod goes through proxy instead of going to the ldap endpoint directly
Actual results:
LDAP IDP communication from the control plane oauth pod goes through proxy
Expected results:
LDAP IDP communication from the control plane oauth pod should go to the ldap endpoint directly using the ldap protocol, it should not go through the proxy settings
Additional info:
For more information, see linked tickets.
This is a clone of issue OCPBUGS-33925. The following is the description of the original issue:
—
Description of problem:
When install a 4.16 cluster with the same API public DNS already existed, Installer is prompting Terraform Variables initialization errors, which is not expected since the Terraform support should be removed from the installer. 05-19 17:36:32.935 level=fatal msg=failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": baseDomain: Invalid value: "qe.devcluster.openshift.com": the zone already has record sets for the domain of the cluster: [api.gpei-0519a.qe.devcluster.openshift.com. (A)]
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-18-212906, which has the CAPI install as default
How reproducible:
Steps to Reproduce:
1. Create a 4.16 cluster with the cluster name: gpei-0519a 2. After the cluster installation finished, try to create the 2nd one with the same cluster name
Actual results:
05-19 17:36:26.390 level=debug msg=OpenShift Installer 4.16.0-0.nightly-2024-05-18-212906 05-19 17:36:26.390 level=debug msg=Built from commit 3eed76e1400cac88af6638bb097ada1607137f3f 05-19 17:36:26.390 level=debug msg=Fetching Metadata... 05-19 17:36:26.390 level=debug msg=Loading Metadata... 05-19 17:36:26.390 level=debug msg= Loading Cluster ID... 05-19 17:36:26.390 level=debug msg= Loading Install Config... 05-19 17:36:26.390 level=debug msg= Loading SSH Key... 05-19 17:36:26.390 level=debug msg= Loading Base Domain... 05-19 17:36:26.390 level=debug msg= Loading Platform... 05-19 17:36:26.390 level=debug msg= Loading Cluster Name... 05-19 17:36:26.390 level=debug msg= Loading Base Domain... 05-19 17:36:26.390 level=debug msg= Loading Platform... 05-19 17:36:26.390 level=debug msg= Loading Pull Secret... 05-19 17:36:26.390 level=debug msg= Loading Platform... 05-19 17:36:26.390 level=debug msg= Using Install Config loaded from state file 05-19 17:36:26.391 level=debug msg= Using Cluster ID loaded from state file 05-19 17:36:26.391 level=debug msg= Loading Install Config... 05-19 17:36:26.391 level=debug msg= Loading Bootstrap Ignition Config... 05-19 17:36:26.391 level=debug msg= Loading Ironic bootstrap credentials... 05-19 17:36:26.391 level=debug msg= Using Ironic bootstrap credentials loaded from state file 05-19 17:36:26.391 level=debug msg= Loading CVO Ignore... 05-19 17:36:26.391 level=debug msg= Loading Common Manifests... 05-19 17:36:26.391 level=debug msg= Loading Cluster ID... 05-19 17:36:26.391 level=debug msg= Loading Install Config... 05-19 17:36:26.391 level=debug msg= Loading Ingress Config... 05-19 17:36:26.391 level=debug msg= Loading Install Config... 05-19 17:36:26.391 level=debug msg= Using Ingress Config loaded from state file 05-19 17:36:26.391 level=debug msg= Loading DNS Config... 05-19 17:36:26.391 level=debug msg= Loading Install Config... 05-19 17:36:26.392 level=debug msg= Loading Cluster ID... 05-19 17:36:26.392 level=debug msg= Loading Platform Credentials Check... 05-19 17:36:26.392 level=debug msg= Loading Install Config... 05-19 17:36:26.392 level=debug msg= Using Platform Credentials Check loaded from state file 05-19 17:36:26.392 level=debug msg= Using DNS Config loaded from state file 05-19 17:36:26.392 level=debug msg= Loading Infrastructure Config... 05-19 17:36:26.392 level=debug msg= Loading Cluster ID... 05-19 17:36:26.392 level=debug msg= Loading Install Config... 05-19 17:36:26.392 level=debug msg= Loading Cloud Provider Config... 05-19 17:36:26.392 level=debug msg= Loading Install Config... 05-19 17:36:26.392 level=debug msg= Loading Cluster ID... 05-19 17:36:26.392 level=debug msg= Loading Platform Credentials Check... 05-19 17:36:26.392 level=debug msg= Using Cloud Provider Config loaded from state file 05-19 17:36:26.393 level=debug msg= Loading Additional Trust Bundle Config... 05-19 17:36:26.393 level=debug msg= Loading Install Config... 05-19 17:36:26.393 level=debug msg= Using Additional Trust Bundle Config loaded from state file 05-19 17:36:26.393 level=debug msg= Using Infrastructure Config loaded from state file 05-19 17:36:26.393 level=debug msg= Loading Network Config... 05-19 17:36:26.393 level=debug msg= Loading Install Config... 05-19 17:36:26.393 level=debug msg= Using Network Config loaded from state file 05-19 17:36:26.393 level=debug msg= Loading Proxy Config... 05-19 17:36:26.393 level=debug msg= Loading Install Config... 05-19 17:36:26.393 level=debug msg= Loading Network Config... 05-19 17:36:26.393 level=debug msg= Using Proxy Config loaded from state file 05-19 17:36:26.393 level=debug msg= Loading Scheduler Config... 05-19 17:36:26.394 level=debug msg= Loading Install Config... 05-19 17:36:26.394 level=debug msg= Using Scheduler Config loaded from state file 05-19 17:36:26.394 level=debug msg= Loading Image Content Source Policy... 05-19 17:36:26.394 level=debug msg= Loading Install Config... 05-19 17:36:26.394 level=debug msg= Using Image Content Source Policy loaded from state file 05-19 17:36:26.394 level=debug msg= Loading Cluster CSI Driver Config... 05-19 17:36:26.394 level=debug msg= Loading Install Config... 05-19 17:36:26.394 level=debug msg= Loading Cluster ID... 05-19 17:36:26.394 level=debug msg= Using Cluster CSI Driver Config loaded from state file 05-19 17:36:26.394 level=debug msg= Loading Image Digest Mirror Set... 05-19 17:36:26.394 level=debug msg= Loading Install Config... 05-19 17:36:26.394 level=debug msg= Using Image Digest Mirror Set loaded from state file 05-19 17:36:26.394 level=debug msg= Loading Machine Config Server Root CA... 05-19 17:36:26.395 level=debug msg= Using Machine Config Server Root CA loaded from state file 05-19 17:36:26.395 level=debug msg= Loading Certificate (mcs)... 05-19 17:36:26.395 level=debug msg= Loading Machine Config Server Root CA... 05-19 17:36:26.395 level=debug msg= Loading Install Config... 05-19 17:36:26.395 level=debug msg= Using Certificate (mcs) loaded from state file 05-19 17:36:26.395 level=debug msg= Loading CVOOverrides... 05-19 17:36:26.395 level=debug msg= Using CVOOverrides loaded from state file 05-19 17:36:26.395 level=debug msg= Loading KubeCloudConfig... 05-19 17:36:26.395 level=debug msg= Using KubeCloudConfig loaded from state file 05-19 17:36:26.395 level=debug msg= Loading KubeSystemConfigmapRootCA... 05-19 17:36:26.395 level=debug msg= Using KubeSystemConfigmapRootCA loaded from state file 05-19 17:36:26.395 level=debug msg= Loading MachineConfigServerTLSSecret... 05-19 17:36:26.396 level=debug msg= Using MachineConfigServerTLSSecret loaded from state file 05-19 17:36:26.396 level=debug msg= Loading OpenshiftConfigSecretPullSecret... 05-19 17:36:26.396 level=debug msg= Using OpenshiftConfigSecretPullSecret loaded from state file 05-19 17:36:26.396 level=debug msg= Using Common Manifests loaded from state file 05-19 17:36:26.396 level=debug msg= Loading Openshift Manifests... 05-19 17:36:26.396 level=debug msg= Loading Install Config... 05-19 17:36:26.396 level=debug msg= Loading Cluster ID... 05-19 17:36:26.396 level=debug msg= Loading Kubeadmin Password... 05-19 17:36:26.396 level=debug msg= Using Kubeadmin Password loaded from state file 05-19 17:36:26.396 level=debug msg= Loading OpenShift Install (Manifests)... 05-19 17:36:26.396 level=debug msg= Using OpenShift Install (Manifests) loaded from state file 05-19 17:36:26.397 level=debug msg= Loading Feature Gate Config... 05-19 17:36:26.397 level=debug msg= Loading Install Config... 05-19 17:36:26.397 level=debug msg= Using Feature Gate Config loaded from state file 05-19 17:36:26.397 level=debug msg= Loading CloudCredsSecret... 05-19 17:36:26.397 level=debug msg= Using CloudCredsSecret loaded from state file 05-19 17:36:26.397 level=debug msg= Loading KubeadminPasswordSecret... 05-19 17:36:26.397 level=debug msg= Using KubeadminPasswordSecret loaded from state file 05-19 17:36:26.397 level=debug msg= Loading RoleCloudCredsSecretReader... 05-19 17:36:26.397 level=debug msg= Using RoleCloudCredsSecretReader loaded from state file 05-19 17:36:26.397 level=debug msg= Loading Baremetal Config CR... 05-19 17:36:26.397 level=debug msg= Using Baremetal Config CR loaded from state file 05-19 17:36:26.397 level=debug msg= Loading Image... 05-19 17:36:26.397 level=debug msg= Loading Install Config... 05-19 17:36:26.398 level=debug msg= Using Image loaded from state file 05-19 17:36:26.398 level=debug msg= Loading AzureCloudProviderSecret... 05-19 17:36:26.398 level=debug msg= Using AzureCloudProviderSecret loaded from state file 05-19 17:36:26.398 level=debug msg= Using Openshift Manifests loaded from state file 05-19 17:36:26.398 level=debug msg= Using CVO Ignore loaded from state file 05-19 17:36:26.398 level=debug msg= Loading Install Config... 05-19 17:36:26.398 level=debug msg= Loading Kubeconfig Admin Internal Client... 05-19 17:36:26.398 level=debug msg= Loading Certificate (admin-kubeconfig-client)... 05-19 17:36:26.398 level=debug msg= Loading Certificate (admin-kubeconfig-signer)... 05-19 17:36:26.398 level=debug msg= Using Certificate (admin-kubeconfig-signer) loaded from state file 05-19 17:36:26.398 level=debug msg= Using Certificate (admin-kubeconfig-client) loaded from state file 05-19 17:36:26.399 level=debug msg= Loading Certificate (kube-apiserver-complete-server-ca-bundle)... 05-19 17:36:26.399 level=debug msg= Loading Certificate (kube-apiserver-localhost-ca-bundle)... 05-19 17:36:26.399 level=debug msg= Loading Certificate (kube-apiserver-localhost-signer)... 05-19 17:36:26.399 level=debug msg= Using Certificate (kube-apiserver-localhost-signer) loaded from state file 05-19 17:36:26.399 level=debug msg= Using Certificate (kube-apiserver-localhost-ca-bundle) loaded from state file 05-19 17:36:26.399 level=debug msg= Loading Certificate (kube-apiserver-service-network-ca-bundle)... 05-19 17:36:26.399 level=debug msg= Loading Certificate (kube-apiserver-service-network-signer)... 05-19 17:36:26.399 level=debug msg= Using Certificate (kube-apiserver-service-network-signer) loaded from state file 05-19 17:36:26.399 level=debug msg= Using Certificate (kube-apiserver-service-network-ca-bundle) loaded from state file 05-19 17:36:26.400 level=debug msg= Loading Certificate (kube-apiserver-lb-ca-bundle)... 05-19 17:36:26.400 level=debug msg= Loading Certificate (kube-apiserver-lb-signer)... 05-19 17:36:26.400 level=debug msg= Using Certificate (kube-apiserver-lb-signer) loaded from state file 05-19 17:36:26.400 level=debug msg= Using Certificate (kube-apiserver-lb-ca-bundle) loaded from state file 05-19 17:36:26.400 level=debug msg= Using Certificate (kube-apiserver-complete-server-ca-bundle) loaded from state file 05-19 17:36:26.400 level=debug msg= Loading Install Config... 05-19 17:36:26.400 level=debug msg= Using Kubeconfig Admin Internal Client loaded from state file 05-19 17:36:26.400 level=debug msg= Loading Kubeconfig Kubelet... 05-19 17:36:26.400 level=debug msg= Loading Certificate (kube-apiserver-complete-server-ca-bundle)... 05-19 17:36:26.400 level=debug msg= Loading Certificate (kubelet-client)... 05-19 17:36:26.401 level=debug msg= Loading Certificate (kubelet-bootstrap-kubeconfig-signer)... 05-19 17:36:26.401 level=debug msg= Using Certificate (kubelet-bootstrap-kubeconfig-signer) loaded from state file 05-19 17:36:26.401 level=debug msg= Using Certificate (kubelet-client) loaded from state file 05-19 17:36:26.401 level=debug msg= Loading Install Config... 05-19 17:36:26.401 level=debug msg= Using Kubeconfig Kubelet loaded from state file 05-19 17:36:26.401 level=debug msg= Loading Kubeconfig Admin Client (Loopback)... 05-19 17:36:26.401 level=debug msg= Loading Certificate (admin-kubeconfig-client)... 05-19 17:36:26.401 level=debug msg= Loading Certificate (kube-apiserver-localhost-ca-bundle)... 05-19 17:36:26.401 level=debug msg= Loading Install Config... 05-19 17:36:26.401 level=debug msg= Using Kubeconfig Admin Client (Loopback) loaded from state file 05-19 17:36:26.401 level=debug msg= Loading Master Ignition Customization Check... 05-19 17:36:26.402 level=debug msg= Loading Install Config... 05-19 17:36:26.402 level=debug msg= Loading Machine Config Server Root CA... 05-19 17:36:26.402 level=debug msg= Loading Master Ignition Config... 05-19 17:36:26.402 level=debug msg= Loading Install Config... 05-19 17:36:26.402 level=debug msg= Loading Machine Config Server Root CA... 05-19 17:36:26.402 level=debug msg= Loading Master Ignition Config from both state file and target directory 05-19 17:36:26.402 level=debug msg= On-disk Master Ignition Config matches asset in state file 05-19 17:36:26.402 level=debug msg= Using Master Ignition Config loaded from state file 05-19 17:36:26.402 level=debug msg= Using Master Ignition Customization Check loaded from state file 05-19 17:36:26.402 level=debug msg= Loading Worker Ignition Customization Check... 05-19 17:36:26.402 level=debug msg= Loading Install Config... 05-19 17:36:26.402 level=debug msg= Loading Machine Config Server Root CA... 05-19 17:36:26.403 level=debug msg= Loading Worker Ignition Config... 05-19 17:36:26.403 level=debug msg= Loading Install Config... 05-19 17:36:26.403 level=debug msg= Loading Machine Config Server Root CA... 05-19 17:36:26.403 level=debug msg= Loading Worker Ignition Config from both state file and target directory 05-19 17:36:26.403 level=debug msg= On-disk Worker Ignition Config matches asset in state file 05-19 17:36:26.403 level=debug msg= Using Worker Ignition Config loaded from state file 05-19 17:36:26.403 level=debug msg= Using Worker Ignition Customization Check loaded from state file 05-19 17:36:26.403 level=debug msg= Loading Master Machines... 05-19 17:36:26.403 level=debug msg= Loading Cluster ID... 05-19 17:36:26.403 level=debug msg= Loading Platform Credentials Check... 05-19 17:36:26.403 level=debug msg= Loading Install Config... 05-19 17:36:26.403 level=debug msg= Loading Image... 05-19 17:36:26.404 level=debug msg= Loading Master Ignition Config... 05-19 17:36:26.404 level=debug msg= Using Master Machines loaded from state file 05-19 17:36:26.404 level=debug msg= Loading Worker Machines... 05-19 17:36:26.404 level=debug msg= Loading Cluster ID... 05-19 17:36:26.404 level=debug msg= Loading Platform Credentials Check... 05-19 17:36:26.404 level=debug msg= Loading Install Config... 05-19 17:36:26.404 level=debug msg= Loading Image... 05-19 17:36:26.404 level=debug msg= Loading Release... 05-19 17:36:26.404 level=debug msg= Loading Install Config... 05-19 17:36:26.404 level=debug msg= Using Release loaded from state file 05-19 17:36:26.404 level=debug msg= Loading Worker Ignition Config... 05-19 17:36:26.404 level=debug msg= Using Worker Machines loaded from state file 05-19 17:36:26.404 level=debug msg= Loading Common Manifests... 05-19 17:36:26.404 level=debug msg= Loading Openshift Manifests... 05-19 17:36:26.404 level=debug msg= Loading Proxy Config... 05-19 17:36:26.405 level=debug msg= Loading Certificate (admin-kubeconfig-ca-bundle)... 05-19 17:36:26.405 level=debug msg= Loading Certificate (admin-kubeconfig-signer)... 05-19 17:36:26.405 level=debug msg= Using Certificate (admin-kubeconfig-ca-bundle) loaded from state file 05-19 17:36:26.405 level=debug msg= Loading Certificate (aggregator)... 05-19 17:36:26.405 level=debug msg= Using Certificate (aggregator) loaded from state file 05-19 17:36:26.405 level=debug msg= Loading Certificate (aggregator-ca-bundle)... 05-19 17:36:26.405 level=debug msg= Loading Certificate (aggregator-signer)... 05-19 17:36:26.405 level=debug msg= Using Certificate (aggregator-signer) loaded from state file 05-19 17:36:26.405 level=debug msg= Using Certificate (aggregator-ca-bundle) loaded from state file 05-19 17:36:26.405 level=debug msg= Loading Certificate (system:kube-apiserver-proxy)... 05-19 17:36:26.405 level=debug msg= Loading Certificate (aggregator-signer)... 05-19 17:36:26.406 level=debug msg= Using Certificate (system:kube-apiserver-proxy) loaded from state file 05-19 17:36:26.406 level=debug msg= Loading Certificate (aggregator-signer)... 05-19 17:36:26.406 level=debug msg= Loading Certificate (system:kube-apiserver-proxy)... 05-19 17:36:26.406 level=debug msg= Loading Certificate (aggregator)... 05-19 17:36:26.406 level=debug msg= Using Certificate (system:kube-apiserver-proxy) loaded from state file 05-19 17:36:26.406 level=debug msg= Loading Bootstrap SSH Key Pair... 05-19 17:36:26.406 level=debug msg= Using Bootstrap SSH Key Pair loaded from state file 05-19 17:36:26.406 level=debug msg= Loading User-provided Service Account Signing key... 05-19 17:36:26.406 level=debug msg= Using User-provided Service Account Signing key loaded from state file 05-19 17:36:26.406 level=debug msg= Loading Cloud Provider CA Bundle... 05-19 17:36:26.406 level=debug msg= Loading Install Config... 05-19 17:36:26.407 level=debug msg= Using Cloud Provider CA Bundle loaded from state file 05-19 17:36:26.407 level=debug msg= Loading Certificate (journal-gatewayd)... 05-19 17:36:26.407 level=debug msg= Loading Machine Config Server Root CA... 05-19 17:36:26.407 level=debug msg= Using Certificate (journal-gatewayd) loaded from state file 05-19 17:36:26.407 level=debug msg= Loading Certificate (kube-apiserver-lb-ca-bundle)... 05-19 17:36:26.407 level=debug msg= Loading Certificate (kube-apiserver-external-lb-server)... 05-19 17:36:26.407 level=debug msg= Loading Certificate (kube-apiserver-lb-signer)... 05-19 17:36:26.407 level=debug msg= Loading Install Config... 05-19 17:36:26.407 level=debug msg= Using Certificate (kube-apiserver-external-lb-server) loaded from state file 05-19 17:36:26.407 level=debug msg= Loading Certificate (kube-apiserver-internal-lb-server)... 05-19 17:36:26.407 level=debug msg= Loading Certificate (kube-apiserver-lb-signer)... 05-19 17:36:26.408 level=debug msg= Loading Install Config... 05-19 17:36:26.408 level=debug msg= Using Certificate (kube-apiserver-internal-lb-server) loaded from state file 05-19 17:36:26.408 level=debug msg= Loading Certificate (kube-apiserver-lb-signer)... 05-19 17:36:26.408 level=debug msg= Loading Certificate (kube-apiserver-localhost-ca-bundle)... 05-19 17:36:26.408 level=debug msg= Loading Certificate (kube-apiserver-localhost-server)... 05-19 17:36:26.408 level=debug msg= Loading Certificate (kube-apiserver-localhost-signer)... 05-19 17:36:26.408 level=debug msg= Using Certificate (kube-apiserver-localhost-server) loaded from state file 05-19 17:36:26.408 level=debug msg= Loading Certificate (kube-apiserver-localhost-signer)... 05-19 17:36:26.408 level=debug msg= Loading Certificate (kube-apiserver-service-network-ca-bundle)... 05-19 17:36:26.408 level=debug msg= Loading Certificate (kube-apiserver-service-network-server)... 05-19 17:36:26.409 level=debug msg= Loading Certificate (kube-apiserver-service-network-signer)... 05-19 17:36:26.409 level=debug msg= Loading Install Config... 05-19 17:36:26.409 level=debug msg= Using Certificate (kube-apiserver-service-network-server) loaded from state file 05-19 17:36:26.409 level=debug msg= Loading Certificate (kube-apiserver-service-network-signer)... 05-19 17:36:26.409 level=debug msg= Loading Certificate (kube-apiserver-complete-server-ca-bundle)... 05-19 17:36:26.409 level=debug msg= Loading Certificate (kube-apiserver-complete-client-ca-bundle)... 05-19 17:36:26.409 level=debug msg= Loading Certificate (admin-kubeconfig-ca-bundle)... 05-19 17:36:26.409 level=debug msg= Loading Certificate (kubelet-client-ca-bundle)... 05-19 17:36:26.409 level=debug msg= Loading Certificate (kubelet-signer)... 05-19 17:36:26.409 level=debug msg= Using Certificate (kubelet-signer) loaded from state file 05-19 17:36:26.410 level=debug msg= Using Certificate (kubelet-client-ca-bundle) loaded from state file 05-19 17:36:26.410 level=debug msg= Loading Certificate (kube-control-plane-ca-bundle)... 05-19 17:36:26.410 level=debug msg= Loading Certificate (kube-control-plane-signer)... 05-19 17:36:26.410 level=debug msg= Using Certificate (kube-control-plane-signer) loaded from state file 05-19 17:36:26.410 level=debug msg= Loading Certificate (kube-apiserver-lb-signer)... 05-19 17:36:26.410 level=debug msg= Loading Certificate (kube-apiserver-localhost-signer)... 05-19 17:36:26.410 level=debug msg= Loading Certificate (kube-apiserver-service-network-signer)... 05-19 17:36:26.410 level=debug msg= Using Certificate (kube-control-plane-ca-bundle) loaded from state file 05-19 17:36:26.410 level=debug msg= Loading Certificate (kube-apiserver-to-kubelet-ca-bundle)... 05-19 17:36:26.411 level=debug msg= Loading Certificate (kube-apiserver-to-kubelet-signer)... 05-19 17:36:26.411 level=debug msg= Using Certificate (kube-apiserver-to-kubelet-signer) loaded from state file 05-19 17:36:26.411 level=debug msg= Using Certificate (kube-apiserver-to-kubelet-ca-bundle) loaded from state file 05-19 17:36:26.411 level=debug msg= Loading Certificate (kubelet-bootstrap-kubeconfig-ca-bundle)... 05-19 17:36:26.411 level=debug msg= Loading Certificate (kubelet-bootstrap-kubeconfig-signer)... 05-19 17:36:26.411 level=debug msg= Using Certificate (kubelet-bootstrap-kubeconfig-ca-bundle) loaded from state file 05-19 17:36:26.411 level=debug msg= Using Certificate (kube-apiserver-complete-client-ca-bundle) loaded from state file 05-19 17:36:26.411 level=debug msg= Loading Certificate (kube-apiserver-to-kubelet-ca-bundle)... 05-19 17:36:26.411 level=debug msg= Loading Certificate (kube-apiserver-to-kubelet-client)... 05-19 17:36:26.412 level=debug msg= Loading Certificate (kube-apiserver-to-kubelet-signer)... 05-19 17:36:26.412 level=debug msg= Using Certificate (kube-apiserver-to-kubelet-client) loaded from state file 05-19 17:36:26.412 level=debug msg= Loading Certificate (kube-apiserver-to-kubelet-signer)... 05-19 17:36:26.412 level=debug msg= Loading Certificate (kube-control-plane-ca-bundle)... 05-19 17:36:26.412 level=debug msg= Loading Certificate (kube-control-plane-kube-controller-manager-client)... 05-19 17:36:26.412 level=debug msg= Loading Certificate (kube-control-plane-signer)... 05-19 17:36:26.412 level=debug msg= Using Certificate (kube-control-plane-kube-controller-manager-client) loaded from state file 05-19 17:36:26.412 level=debug msg= Loading Certificate (kube-control-plane-kube-scheduler-client)... 05-19 17:36:26.412 level=debug msg= Loading Certificate (kube-control-plane-signer)... 05-19 17:36:26.412 level=debug msg= Using Certificate (kube-control-plane-kube-scheduler-client) loaded from state file 05-19 17:36:26.413 level=debug msg= Loading Certificate (kube-control-plane-signer)... 05-19 17:36:26.413 level=debug msg= Loading Certificate (kubelet-bootstrap-kubeconfig-ca-bundle)... 05-19 17:36:26.413 level=debug msg= Loading Certificate (kubelet-client-ca-bundle)... 05-19 17:36:26.413 level=debug msg= Loading Certificate (kubelet-client)... 05-19 17:36:26.413 level=debug msg= Loading Certificate (kubelet-signer)... 05-19 17:36:26.413 level=debug msg= Loading Certificate (kubelet-serving-ca-bundle)... 05-19 17:36:26.413 level=debug msg= Loading Certificate (kubelet-signer)... 05-19 17:36:26.413 level=debug msg= Using Certificate (kubelet-serving-ca-bundle) loaded from state file 05-19 17:36:26.413 level=debug msg= Loading Certificate (mcs)... 05-19 17:36:26.413 level=debug msg= Loading Machine Config Server Root CA... 05-19 17:36:26.413 level=debug msg= Loading Key Pair (service-account.pub)... 05-19 17:36:26.414 level=debug msg= Using Key Pair (service-account.pub) loaded from state file 05-19 17:36:26.414 level=debug msg= Loading Release Image Pull Spec... 05-19 17:36:26.414 level=debug msg= Using Release Image Pull Spec loaded from state file 05-19 17:36:26.414 level=debug msg= Loading Image... 05-19 17:36:26.414 level=debug msg= Loading Bootstrap Ignition Config from both state file and target directory 05-19 17:36:26.414 level=debug msg= On-disk Bootstrap Ignition Config matches asset in state file 05-19 17:36:26.414 level=debug msg= Using Bootstrap Ignition Config loaded from state file 05-19 17:36:26.414 level=debug msg=Using Metadata loaded from state file 05-19 17:36:26.414 level=debug msg=Reusing previously-fetched Metadata 05-19 17:36:26.415 level=info msg=Consuming Worker Ignition Config from target directory 05-19 17:36:26.415 level=debug msg=Purging asset "Worker Ignition Config" from disk 05-19 17:36:26.415 level=info msg=Consuming Master Ignition Config from target directory 05-19 17:36:26.415 level=debug msg=Purging asset "Master Ignition Config" from disk 05-19 17:36:26.415 level=info msg=Consuming Bootstrap Ignition Config from target directory 05-19 17:36:26.415 level=debug msg=Purging asset "Bootstrap Ignition Config" from disk 05-19 17:36:26.415 level=debug msg=Fetching Master Ignition Customization Check... 05-19 17:36:26.415 level=debug msg=Reusing previously-fetched Master Ignition Customization Check 05-19 17:36:26.415 level=debug msg=Fetching Worker Ignition Customization Check... 05-19 17:36:26.415 level=debug msg=Reusing previously-fetched Worker Ignition Customization Check 05-19 17:36:26.415 level=debug msg=Fetching Terraform Variables... 05-19 17:36:26.415 level=debug msg=Loading Terraform Variables... 05-19 17:36:26.416 level=debug msg= Loading Cluster ID... 05-19 17:36:26.416 level=debug msg= Loading Install Config... 05-19 17:36:26.416 level=debug msg= Loading Image... 05-19 17:36:26.416 level=debug msg= Loading Release... 05-19 17:36:26.416 level=debug msg= Loading BootstrapImage... 05-19 17:36:26.416 level=debug msg= Loading Install Config... 05-19 17:36:26.416 level=debug msg= Loading Image... 05-19 17:36:26.416 level=debug msg= Loading Bootstrap Ignition Config... 05-19 17:36:26.416 level=debug msg= Loading Master Ignition Config... 05-19 17:36:26.416 level=debug msg= Loading Master Machines... 05-19 17:36:26.416 level=debug msg= Loading Worker Machines... 05-19 17:36:26.416 level=debug msg= Loading Ironic bootstrap credentials... 05-19 17:36:26.416 level=debug msg= Loading Platform Provisioning Check... 05-19 17:36:26.416 level=debug msg= Loading Install Config... 05-19 17:36:26.416 level=debug msg= Loading Common Manifests... 05-19 17:36:26.417 level=debug msg= Fetching Cluster ID... 05-19 17:36:26.417 level=debug msg= Reusing previously-fetched Cluster ID 05-19 17:36:26.417 level=debug msg= Fetching Install Config... 05-19 17:36:26.417 level=debug msg= Reusing previously-fetched Install Config 05-19 17:36:26.417 level=debug msg= Fetching Image... 05-19 17:36:26.417 level=debug msg= Reusing previously-fetched Image 05-19 17:36:26.417 level=debug msg= Fetching Release... 05-19 17:36:26.417 level=debug msg= Reusing previously-fetched Release 05-19 17:36:26.417 level=debug msg= Fetching BootstrapImage... 05-19 17:36:26.417 level=debug msg= Fetching Install Config... 05-19 17:36:26.417 level=debug msg= Reusing previously-fetched Install Config 05-19 17:36:26.417 level=debug msg= Fetching Image... 05-19 17:36:26.417 level=debug msg= Reusing previously-fetched Image 05-19 17:36:26.417 level=debug msg= Generating BootstrapImage... 05-19 17:36:26.417 level=debug msg= Fetching Bootstrap Ignition Config... 05-19 17:36:26.418 level=debug msg= Reusing previously-fetched Bootstrap Ignition Config 05-19 17:36:26.418 level=debug msg= Fetching Master Ignition Config... 05-19 17:36:26.418 level=debug msg= Reusing previously-fetched Master Ignition Config 05-19 17:36:26.418 level=debug msg= Fetching Master Machines... 05-19 17:36:26.418 level=debug msg= Reusing previously-fetched Master Machines 05-19 17:36:26.418 level=debug msg= Fetching Worker Machines... 05-19 17:36:26.418 level=debug msg= Reusing previously-fetched Worker Machines 05-19 17:36:26.418 level=debug msg= Fetching Ironic bootstrap credentials... 05-19 17:36:26.418 level=debug msg= Reusing previously-fetched Ironic bootstrap credentials 05-19 17:36:26.418 level=debug msg= Fetching Platform Provisioning Check... 05-19 17:36:26.418 level=debug msg= Fetching Install Config... 05-19 17:36:26.418 level=debug msg= Reusing previously-fetched Install Config 05-19 17:36:26.418 level=debug msg= Generating Platform Provisioning Check... 05-19 17:36:26.419 level=info msg=Credentials loaded from the "flexy-installer" profile in file "/home/installer1/workspace/ocp-common/Flexy-install@2/flexy/workdir/awscreds20240519-580673-bzyw8l" 05-19 17:36:32.935 level=fatal msg=failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": baseDomain: Invalid value: "qe.devcluster.openshift.com": the zone already has record sets for the domain of the cluster: [api.gpei-0519a.qe.devcluster.openshift.com. (A)]
Expected results:
Remove all TF checks on AWS/vSphere/Nutanix platforms
Additional info:
Description of problem: The [sig-arch] events should not repeat pathologically for ns/openshift-dns test is permafailing in the periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.15-e2e-openstack-ovn-serial job
{ 1 events happened too frequently event happened 114 times, something is wrong: namespace/openshift-dns hmsg/d0c68b9435 service/dns-default - reason/TopologyAwareHintsDisabled Insufficient Node information: allocatable CPU or zone not specified on one or more nodes, addressType: IPv4 From: 17:11:03Z To: 17:11:04Z result=reject }
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/csi-operator/pull/114
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Core CAPI CRDs not deployed on unsupported platforms even when explicitly needed by other operators. An example of this is on VSphere clusters. CAPI is not yet supported on VSphere clusters, but the CAPI IPAM CRDs, are needed by other operators than the usual consumer, cluster-capi-operator and the CAPI controllers.
Version-Release number of selected component (if applicable):
How reproducible:
Launch a techpreview cluster for an unsupported platform (e.g. vsphere/azure). Check that the Core CAPI CRDs are not present.
Steps to Reproduce:
$ oc get crds | grep cluster.x-k8s.io
Actual results:
Core CAPI CRDs are not present (only the metal ones)
Expected results:
Core CAPI CRDs should be present
Additional info:
Description of problem:
3 conformance tests are failing in 4.16 when running conformance/parallel testsuite in Openshift on Openstack D/S CI:
time="2024-04-05T16:44:56Z" level=info msg="Decoding provider" clusterState="<nil>" discover=false dryRun=false func=DecodeProvider providerType="{\"type\":\"skeleton\",\"ProjectID\":\"\",\"Region\":\"\",\"Zone\":\"nova\",\"NumNodes\":3,\"MultiMaster\":true,\"MultiZone\":false,\"Zones\":[\"nova\"],\"ConfigFile\":\"\",\"Disconnected\":false,\"SingleReplicaTopology\":false,\"NetworkPlugin\":\"OVNKubernetes\",\"HasIPv4\":true,\"HasIPv6\":true,\"IPFamily\":\"ipv4\",\"HasSCTP\":false,\"IsProxied\":false,\"IsIBMROKS\":false,\"HasNoOptionalCapabilities\":false}" time="2024-04-05T16:44:56Z" level=warning msg="config was nil" func=DecodeProvider Running Suite: OpenShift e2e suite - /home/stack ================================================ Random Seed: 1712335495 - will randomize all specs Will run 1 of 1 specs ------------------------------ [sig-arch][Late][Jira:"kube-apiserver"] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel] github.com/openshift/origin/test/extended/operators/certs.go:202 Apr 5 16:44:57.900: INFO: microshift-version configmap not found [FAILED] in [BeforeAll] - github.com/openshift/origin/test/extended/operators/certs.go:120 @ 04/05/24 16:45:01.242 • [FAILED] [3.378 seconds] [sig-arch][Late][Jira:"kube-apiserver"] [BeforeAll] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel] [BeforeAll] github.com/openshift/origin/test/extended/operators/certs.go:94 [It] github.com/openshift/origin/test/extended/operators/certs.go:202 [FAILED] Unexpected error: <*errors.errorString | 0xc00062c060>: unable to determine openshift-tests image: exit status 1: error: unable to read image registry.ci.openshift.org/ocp/release@sha256:8142e7b7720bd37879ec5919cb6bce0d436f119516694bcf0788372faf45a0e0: unauthorized: authentication required { s: "unable to determine openshift-tests image: exit status 1: error: unable to read image registry.ci.openshift.org/ocp/release@sha256:8142e7b7720bd37879ec5919cb6bce0d436f119516694bcf0788372faf45a0e0: unauthorized: authentication required\n", } occurred In [BeforeAll] at: github.com/openshift/origin/test/extended/operators/certs.go:120 @ 04/05/24 16:45:01.242 ------------------------------ Summarizing 1 Failure: [FAIL] [sig-arch][Late][Jira:"kube-apiserver"] [BeforeAll] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel] github.com/openshift/origin/test/extended/operators/certs.go:120 Ran 1 of 1 Specs in 3.378 seconds FAIL! -- 0 Passed | 1 Failed | 0 Pending | 0 Skipped fail [github.com/openshift/origin/test/extended/operators/certs.go:120]: Unexpected error: <*errors.errorString | 0xc00062c060>: unable to determine openshift-tests image: exit status 1: error: unable to read image registry.ci.openshift.org/ocp/release@sha256:8142e7b7720bd37879ec5919cb6bce0d436f119516694bcf0788372faf45a0e0: unauthorized: authentication required { s: "unable to determine openshift-tests image: exit status 1: error: unable to read image registry.ci.openshift.org/ocp/release@sha256:8142e7b7720bd37879ec5919cb6bce0d436f119516694bcf0788372faf45a0e0: unauthorized: authentication required\n", } occurred Ginkgo exit error 1: exit with code 1
Version-Release number of selected component (if applicable):
release-4.16 and master branchs in origin.
How reproducible:
Always
Description of problem:
As, the live migration process may take hours for a large cluster. The workload in the cluster may trigger cluster extension by adding new nodes. We need to support adding new nodes when an SDN live migration is running in progress.We need to backport this to 4.15.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Deployment cannot be scaled up/down when an HPA is associated with it.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
100%
Steps to Reproduce:
1. Create a test deployment $ oc new-app httpd 2. Create a HPA for the deployment $ oc autoscale deployment/httpd --min 1 --max 10 --cpu-percent 10 3. Scale down the deployment via script or manually to 0 replicas. $ oc scale deployment/httpd --replicas=0 4. The HPA shows below status that it cannot scale up until the deployment is scaled up. ~~~ - type: ScalingActive status: 'False' lastTransitionTime: '2023-10-24T10:00:01Z' reason: ScalingDisabled message: scaling is disabled since the replica count of the target is zero ~~~ 5. Since the scale up/down is disabled, the users will not be able to scale up the deployment from GUI. The only option is to do it from CLI.
Actual results:
The scale up/down arrows are disabled and users are unable to start the deployment.
Expected results:
The scale up/down arrows should be enabled or another option that can help to scale up the deployment.
Additional info:
Description of problem:
The unworkable filter component should not exist in resource section on the search page with Phone View
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2023-12-15-211129
How reproducible:
Always
Steps to Reproduce:
1. Change to phone view for the browser (Browser -> F12 - Toggle device toolbar) eg: iPhone 14 Pro Max 2. Navigate to Home -> Search page, select one resource eg: APIRequestCounts 3. Check the component in the resources panel
Actual results:
There is an unworkable filter icon under the 'Create APIREquestCount' button
Expected results:
Remove the filter component in Phone view OR, make sure the filter is workable in phone view if customer needed
Additional info:
https://drive.google.com/file/d/1Fwb8EGznWkA1z3cVVzGcJJjMFkjkuUhK/view?usp=drive_link
This is a clone of issue OCPBUGS-34538. The following is the description of the original issue:
—
Description of problem:
Since we aim for removing PF4 and ReactRouter5 in 4.18 we need to deprecate these shared modules in 4.16 to give plugin creators time to update their plugins.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The 4.13 CPO fails to reconcile
{"level":"error","ts":"2024-04-03T18:45:28Z","msg":"Reconciler error","controller":"hostedcontrolplane","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedControlPlane","hostedControlPlane":{"name":"sjenning-guest","namespace":"clusters-sjenning-guest"},"namespace":"clusters-sjenning-guest","name":"sjenning-guest","reconcileID":"35a91dd1-0066-4c81-a6a4-14770ffff61d","error":"failed to update control plane: failed to reconcile router: failed to reconcile router role: roles.rbac.authorization.k8s.io \"router\" is forbidden: user \"system:serviceaccount:clusters-sjenning-guest:control-plane-operator\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:clusters-sjenning-guest\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:\n{APIGroups:[\"security.openshift.io\"], Resources:[\"securitycontextconstraints\"], ResourceNames:[\"hostnetwork\"], Verbs:[\"use\"]}","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}
When installation fails, status_info is reporting an incorrect status
Most likely is one of those two scenarios:
Description of problem:
Update OWNERS file in route-controller-manager repository.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
n/a
Steps to Reproduce:
n/a
Actual results:
n/a
Expected results:
n/a
Additional info:
Description of the problem:
Before we create a cluster , we get in UI list of latest_only=false vs latest_only=true.
Looks like the CPU type is not the same for both .
Example:
Latest: "4.11.59": { "cpu_architectures": [ "x86_64" ], "display_name": "4.11.59", "support_level": "end-of-life" }, "4.12.53": { "cpu_architectures": [ "x86_64" ], "display_name": "4.12.53", "support_level": "maintenance" },
from all:
"4.11.59": { "cpu_architectures": [ "x86_64", "arm64" ], "display_name": "4.11.59", "support_level": "end-of-life" }, "4.12.53": { "cpu_architectures": [ "x86_64", "arm64" ], "display_name": "4.12.53", "support_level": "maintenance" },
How reproducible:
Always
Steps to reproduce:
Before creating a cluster , open browser debug and see openshift versions returned.
one as latest_only=true
and latest_only=false.
Expecting to get the same cpu type
Actual results:
Expected results:
This is a clone of issue OCPBUGS-35530. The following is the description of the original issue:
—
Description of problem:
Bootstrap destroy failed in CI with: level=fatal msg=error destroying bootstrap resources failed during the destroy bootstrap hook: failed to remove bootstrap SSH rule: failed to update AWSCluster during bootstrap destroy: Operation cannot be fulfilled on awsclusters.infrastructure.cluster.x-k8s.io "ci-op-nk1s6685-77004-4gb4d": the object has been modified; please apply your changes to the latest version and try again
Version-Release number of selected component (if applicable):
How reproducible:
Unclear. CI search returns no results. Observed it as a single failure (aws-ovn job, linked below) in the testing of https://amd64.ocp.releases.ci.openshift.org/releasestream/4.17.0-0.nightly/release/4.17.0-0.nightly-2024-06-15-004118
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Two possible solutions:
Description of problem:
Starting OpenShift 4.8 (https://docs.openshift.com/container-platform/4.8/release_notes/ocp-4-8-release-notes.html#ocp-4-8-notable-technical-changes), all pods are getting bound SA tokens. Currently, instead of expiring the token, we use the `service-account-extend-token-expiration` that extends a bound token validity to 1yr and warns in case of a use of a token that would've otherwise been expired. We want to disable this behavior in a future OpenShift release, which would break the OpenShift web console.
Version-Release number of selected component (if applicable):
4.8 - 4.14
How reproducible:
100%
Steps to Reproduce:
1. install a fresh cluster 2. wait ~1hr since console pods were deployed for the token rotation to occur 3. log in to the console and click around 4. check the kube-apiserver audit logs events for the "authentication.k8s.io/stale-token" annotation
Actual results:
many occurrences (I doubt I'll be able to upload a text file so I'll show a few audit events in the first comment.
Expected results:
The web-console re-reads the SA token regularly so that it never uses an expired token
Additional info:
In a theoretical case where a console pod lasts for a year, it's going to break and won't be able to authenticate to the kube-apiserver. We are planning on disallowing the use of stale tokens in a future release and we need to make sure that the core platform is not broken so that the metrics we collect from the clusters in the wild are not polluted.
I took a look at Component Readiness today and noticed that "[sig-cluster-lifecycle] cluster upgrade should complete in a reasonable time" is permafailing. I modified the sample start time to see that is appears to have started around February 19th.
Is this expected with 4.16 or do we have a problem?
Component Readiness has found a potential regression in [sig-cluster-lifecycle] cluster upgrade should complete in a reasonable time.
Probability of significant regression: 100.00%
Sample (being evaluated) Release: 4.16
Start Time: 2024-02-27T00:00:00Z
End Time: 2024-03-04T23:59:59Z
Success Rate: 0.00%
Successes: 0
Failures: 4
Flakes: 0
Base (historical) Release: 4.15
Start Time: 2024-02-01T00:00:00Z
End Time: 2024-02-28T23:59:59Z
Success Rate: 100.00%
Successes: 47
Failures: 0
Flakes: 0
Please review the following PR: https://github.com/openshift/operator-framework-operator-controller/pull/86
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
ResourceYAMLEditor don't have readOnly prop which will help to hide the Save button in YAML editor which don't allow user to edit resource. https://github.com/openshift/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#resourceyamleditor
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
periodic-ci-openshift-assisted-service-master-edge-e2e-ai-operator-disconnected-capi-periodic job is failing permanently. see - https://prow.ci.openshift.org/job-history/gs/test-platform-results/logs/periodic-ci-openshift-assisted-service-master-edge-e2e-ai-operator-disconnected-capi-periodic
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
tlsSecurityProfile definitions do not align with documentation. When using `oc explain` the field descriptions note that certain values are unsupported, but the same values are supported in the OpenShift Documentation. This needs to be clarified and the spacing should be fixed in the descriptions as they are hard to understand.
Version-Release number of selected component (if applicable):
4.14.1
How reproducible:
⇒ oc explain ingresscontroller.spec.tlsSecurityProfile.modern
Steps to Reproduce:
1. Check the `oc explain` output
Actual results:
⇒ oc explain ingresscontroller.spec.tlsSecurityProfile.modern KIND: IngressController VERSION: operator.openshift.io/v1DESCRIPTION: modern is a TLS security profile based on: https://wiki.mozilla.org/Security/Server_Side_TLS#Modern_compatibility and looks like this (yaml): ciphers: - TLS_AES_128_GCM_SHA256 - TLS_AES_256_GCM_SHA384 - TLS_CHACHA20_POLY1305_SHA256 minTLSVersion: TLSv1.3 NOTE: Currently unsupported.
Expected results:
An output that aligns with the documentation regarding support/unsupported TLS versions Additionally, fixing the output format would be useful as it is very hard to understand/read in it's current form. Here in the 4.14 Documentation, it states: ``` The HAProxy Ingress Controller image supports TLS 1.3 and the Modern profile. ```
Additional info:
The `apiserver` CR should also be checked for the same thing.
Description of problem:
During automation test execution for dev-console package it is observed that cypress intensely fails the ongoing test due to "uncaught:exception : ResizeObserver limit exceed", but there is no visible failure from UI.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info: Screenshot
This is a clone of issue OCPBUGS-34619. The following is the description of the original issue:
—
Description of problem:
`make test` is failing in openshift/coredns repo due to TestImportOrdering failure. This is due to the recent addition of the github.com/openshift/coredns-ocp-dnsnameresolver external plugin and the fact that CoreDNS doesn't generate zplugin.go formatted correctly so TestImportOrdering fails after generation.
Version-Release number of selected component (if applicable):
4.16-4.17
How reproducible:
100%
Steps to Reproduce:
1. make test
Actual results:
TestImportOrdering failure
Expected results:
TestImportOrdering should not fail
Additional info:
I created an upstream issue and PR: https://github.com/coredns/coredns/pull/6692 which recently merged. We will just need to carry-patch this in 4.17 and 4.16. The CoreDNS 1.11.3 rebase https://github.com/openshift/coredns/pull/118 is blocked on this.
Description of problem:
- Pods that reside in a namespace utilizing EgressIP are experiencing intermittent TCP IO timeouts when attempting to communicate with external services.
❯ oc exec gitlab-runner-aj-02-56998875b-n6xxb -- bash -c 'while true; do timeout 3 bash -c "</dev/tcp/10.135.108.56/443" && echo "Connection success" || echo "Connection timeout"; sleep 0.5; done'
Connection success
Connection timeout
Connection timeout
Connection timeout
Connection timeout
Connection timeout
Connection success
Connection timeout
Connection success
# Get pod node and podIP variable for the problematic pod ❯ oc get pod gitlab-runner-aj-02-56998875b-n6xxb -ojson 2>/dev/null | jq -r '"\(.metadata.name) \(.spec.nodeName) \(.status.podIP)"' | read -r pod node podip # Find the ovn-kubernetes pod running on the same node as gitlab-runner-aj-02-56998875b-n6xxb ❯ oc get pods -n openshift-ovn-kubernetes -lapp=ovnkube-node -ojson | jq --arg node "$node" -r '.items[] | select(.spec.nodeName == $node)| .metadata.name' | read -r ovn_pod # Collect each possible logical switch port address into variable LSP_ADDRESSES ❯ LSP_ADDRESSES=$(oc -n openshift-ovn-kubernetes exec ${ovn_pod} -it -c northd -- bash -c 'ovn-nbctl lsp-list transit_switch | while read guid name; do printf "%s " "${name}"; ovn-nbctl lsp-get-addresses "${guid}"; done') # List the logical router policy for the problematic pod ❯ oc -n openshift-ovn-kubernetes exec ${ovn_pod} -c northd -- ovn-nbctl find logical_router_policy match="\"ip4.src == ${podip}\"" _uuid : c55bec59-6f9a-4f01-a0b1-67157039edb8 action : reroute external_ids : {name=gitlab-runner-caasandpaas-egress} match : "ip4.src == 172.40.114.40" nexthop : [] nexthops : ["100.88.0.22", "100.88.0.57"] options : {} priority : 100 # Check whether each nexthop entry exists in the LSP addresses table ❯ echo $LSP_ADDRESSES | grep 100.88.0.22 (tstor-c1nmedi01-9x2g9-worker-cloud-paks-m9t6b) 0a:58:64:58:00:16 100.88.0.22/16 ❯ echo $LSP_ADDRESSES | grep 100.88.0.57
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Description of problem:
When installing a new vSphere cluster with static IPs, control plane machine sets (CPMS) are also enabled in TechPreviewNoUpgrade and the installer applies the incorrect config to the CPMS resulting in masters being recreated.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
always
Steps to Reproduce:
1. create install-config.yaml with static IPs following documentation 2. run `openshift-install create cluster` 3. as install progresses, watch the machines definitions
Actual results:
new master machines are created
Expected results:
all machines are the same as what was created by the installer.
Additional info:
This is a clone of issue OCPBUGS-36378. The following is the description of the original issue:
—
Description of problem:
When creating cluster with service principal certificate, as known issues OCPBUGS-36360, installer exited with error. # ./openshift-install create cluster --dir ipi6 INFO Credentials loaded from file "/root/.azure/osServicePrincipal.json" WARNING Using client certs to authenticate. Please be warned cluster does not support certs and only the installer does. INFO Consuming Install Config from target directory WARNING FeatureSet "CustomNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster. INFO Creating infrastructure resources... INFO Started local control plane with envtest INFO Stored kubeconfig for envtest in: /tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig WARNING Using client certs to authenticate. Please be warned cluster does not support certs and only the installer does. INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:36847 --webhook-port=38905 --webhook-cert-dir=/tmp/envtest-serving-certs-941163289 --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig] INFO Running process: azure infrastructure provider with args [-v=2 --health-addr=127.0.0.1:44743 --webhook-port=35373 --webhook-cert-dir=/tmp/envtest-serving-certs-3807817663 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig] INFO Running process: azureaso infrastructure provider with args [-v=0 -metrics-addr=0 -health-addr=127.0.0.1:45179 -webhook-port=37401 -webhook-cert-dir=/tmp/envtest-serving-certs-1364466879 -crd-pattern= -crd-management=none] ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to run cluster api system: failed to run controller "azureaso infrastructure provider": failed to start controller "azureaso infrastructure provider": timeout waiting for process cluster-api-provider-azureaso to start successfully (it may have failed to start, or stopped unexpectedly before becoming ready) INFO Shutting down local Cluster API control plane... INFO Local Cluster API system has completed operations From output, local cluster API system is shut down. But when checking processes, only parent process installer exit, CAPI related processes are still running. When local control plane is running: # ps -ef|grep cluster | grep -v grep root 13355 6900 39 08:07 pts/1 00:00:13 ./openshift-install create cluster --dir ipi6 root 13365 13355 2 08:08 pts/1 00:00:00 ipi6/cluster-api/etcd --advertise-client-urls=http://127.0.0.1:41341 --data-dir=ipi6/.clusterapi_output/etcd --listen-client-urls=http://127.0.0.1:41341 --listen-peer-urls=http://127.0.0.1:34081 --unsafe-no-fsync=true root 13373 13355 55 08:08 pts/1 00:00:10 ipi6/cluster-api/kube-apiserver --allow-privileged=true --authorization-mode=RBAC --bind-address=127.0.0.1 --cert-dir=/tmp/k8s_test_framework_50606349 --client-ca-file=/tmp/k8s_test_framework_50606349/client-cert-auth-ca.crt --disable-admission-plugins=ServiceAccount --etcd-servers=http://127.0.0.1:41341 --secure-port=38483 --service-account-issuer=https://127.0.0.1:38483/ --service-account-key-file=/tmp/k8s_test_framework_50606349/sa-signer.crt --service-account-signing-key-file=/tmp/k8s_test_framework_50606349/sa-signer.key --service-cluster-ip-range=10.0.0.0/24 root 13385 13355 0 08:08 pts/1 00:00:00 ipi6/cluster-api/cluster-api -v=2 --diagnostics-address=0 --health-addr=127.0.0.1:36847 --webhook-port=38905 --webhook-cert-dir=/tmp/envtest-serving-certs-941163289 --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig root 13394 13355 6 08:08 pts/1 00:00:00 ipi6/cluster-api/cluster-api-provider-azure -v=2 --health-addr=127.0.0.1:44743 --webhook-port=35373 --webhook-cert-dir=/tmp/envtest-serving-certs-3807817663 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig After installer exited: # ps -ef|grep cluster | grep -v grep root 13365 1 1 08:08 pts/1 00:00:01 ipi6/cluster-api/etcd --advertise-client-urls=http://127.0.0.1:41341 --data-dir=ipi6/.clusterapi_output/etcd --listen-client-urls=http://127.0.0.1:41341 --listen-peer-urls=http://127.0.0.1:34081 --unsafe-no-fsync=true root 13373 1 45 08:08 pts/1 00:00:35 ipi6/cluster-api/kube-apiserver --allow-privileged=true --authorization-mode=RBAC --bind-address=127.0.0.1 --cert-dir=/tmp/k8s_test_framework_50606349 --client-ca-file=/tmp/k8s_test_framework_50606349/client-cert-auth-ca.crt --disable-admission-plugins=ServiceAccount --etcd-servers=http://127.0.0.1:41341 --secure-port=38483 --service-account-issuer=https://127.0.0.1:38483/ --service-account-key-file=/tmp/k8s_test_framework_50606349/sa-signer.crt --service-account-signing-key-file=/tmp/k8s_test_framework_50606349/sa-signer.key --service-cluster-ip-range=10.0.0.0/24 root 13385 1 0 08:08 pts/1 00:00:00 ipi6/cluster-api/cluster-api -v=2 --diagnostics-address=0 --health-addr=127.0.0.1:36847 --webhook-port=38905 --webhook-cert-dir=/tmp/envtest-serving-certs-941163289 --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig root 13394 1 0 08:08 pts/1 00:00:00 ipi6/cluster-api/cluster-api-provider-azure -v=2 --health-addr=127.0.0.1:44743 --webhook-port=35373 --webhook-cert-dir=/tmp/envtest-serving-certs-3807817663 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig Another scenario, ran capi-based installer on the small disk, and installer stuck there and didn't exit until interrupted until <Ctrl> + C. Then checked that all CAPI related processes were still running, only installer process was killed. [root@jima09id-vm-1 jima]# ./openshift-install create cluster --dir ipi4 INFO Credentials loaded from file "/root/.azure/osServicePrincipal.json" INFO Consuming Install Config from target directory WARNING FeatureSet "CustomNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster. INFO Creating infrastructure resources... INFO Started local control plane with envtest INFO Stored kubeconfig for envtest in: /tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:42017 --webhook-port=41085 --webhook-cert-dir=/tmp/envtest-serving-certs-1774658110 --kubeconfig=/tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig] INFO Running process: azure infrastructure provider with args [-v=2 --health-addr=127.0.0.1:38387 --webhook-port=37783 --webhook-cert-dir=/tmp/envtest-serving-certs-1319713198 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig] FATAL failed to extract "ipi4/cluster-api/cluster-api-provider-azureaso": write ipi4/cluster-api/cluster-api-provider-azureaso: no space left on device ^CWARNING Received interrupt signal ^C[root@jima09id-vm-1 jima]# [root@jima09id-vm-1 jima]# ps -ef|grep cluster | grep -v grep root 12752 1 0 07:38 pts/1 00:00:00 ipi4/cluster-api/etcd --advertise-client-urls=http://127.0.0.1:38889 --data-dir=ipi4/.clusterapi_output/etcd --listen-client-urls=http://127.0.0.1:38889 --listen-peer-urls=http://127.0.0.1:38859 --unsafe-no-fsync=true root 12760 1 4 07:38 pts/1 00:00:09 ipi4/cluster-api/kube-apiserver --allow-privileged=true --authorization-mode=RBAC --bind-address=127.0.0.1 --cert-dir=/tmp/k8s_test_framework_3790461974 --client-ca-file=/tmp/k8s_test_framework_3790461974/client-cert-auth-ca.crt --disable-admission-plugins=ServiceAccount --etcd-servers=http://127.0.0.1:38889 --secure-port=44429 --service-account-issuer=https://127.0.0.1:44429/ --service-account-key-file=/tmp/k8s_test_framework_3790461974/sa-signer.crt --service-account-signing-key-file=/tmp/k8s_test_framework_3790461974/sa-signer.key --service-cluster-ip-range=10.0.0.0/24 root 12769 1 0 07:38 pts/1 00:00:00 ipi4/cluster-api/cluster-api -v=2 --diagnostics-address=0 --health-addr=127.0.0.1:42017 --webhook-port=41085 --webhook-cert-dir=/tmp/envtest-serving-certs-1774658110 --kubeconfig=/tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig root 12781 1 0 07:38 pts/1 00:00:00 ipi4/cluster-api/cluster-api-provider-azure -v=2 --health-addr=127.0.0.1:38387 --webhook-port=37783 --webhook-cert-dir=/tmp/envtest-serving-certs-1319713198 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig root 12851 6900 1 07:41 pts/1 00:00:00 ./openshift-install destroy cluster --dir ipi4
Version-Release number of selected component (if applicable):
4.17 nightly build
How reproducible:
Always
Steps to Reproduce:
1. Run capi-based installer 2. Installer failed to start some capi process and exited 3.
Actual results:
Installer process exited, but capi related processes are still running
Expected results:
Both installer and all capi related processes are exited.
Additional info:
Owing to the older path being referenced in the prow workflow, we saw consistent failure for the /test versions job.
RHOCP installation on RHOSP fails with an error
~~~
$ ansible-playbook -i inventory.yaml security-groups.yaml
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Incompatible openstacksdk library found: Version MUST be >=1.0 and <=None, but 0.36.5 is smaller than minimum version 1.0."}
~~~
Packages Installed :
ansible-2.9.27-1.el8ae.noarch Fri Oct 13 06:56:05 2023
python3-netaddr-0.7.19-8.el8.noarch Fri Oct 13 06:55:44 2023
python3-openstackclient-4.0.2-2.20230404115110.54bf2c0.el8ost.noarch Tue Nov 21 01:38:32 2023
python3-openstacksdk-0.36.5-2.20220111021051.feda828.el8ost.noarch Fri Oct 13 06:55:52 2023
Document followed :
https://docs.openshift.com/container-platform/4.13/installing/installing_openstack/installing-openstack-user.html#installation-osp-downloading-modules_installing-openstack-user
This is a clone of issue OCPBUGS-31073. The following is the description of the original issue:
—
Description of problem
I had a version of MTC installed on my cluster when it was running a prior version. I had deleted it some time ago, long before upgrading to 4.15. I upgraded it to 4.15 and needed to reinstall to take a look at something, but found the operator would not install.
I originally tried with 4.15.0, but on failure upgraded to 4.15.3 to see if it would resolve the issue, but it did no.
Version-Release number of selected component (if applicable):
$ oc version Client Version: 4.15.3 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: 4.15.3 Kubernetes Version: v1.28.7+6e2789b
How reproducible:
Always as far as I can tell. I have at least two clusters where I was able to reproduce it.
Steps to Reproduce:
1. Install Migration Toolkit for Containers on OpenShift 4.14 2. Uninstall it 3. Upgrade to 4.15 4. Try to install it again
Actual results:
The operator never installs. UI just shows "Upgrade status: Unkown Failure" Observe the catalog operator logs and note errors like: E0319 21:35:57.350591 1 queueinformer_operator.go:319] sync {"update" "openshift-migration"} failed: bundle unpacking failed with an error: [roles.rbac.authorization.k8s.io "c1572438804f004fb90b6768c203caad96c47331f7ecc4f68c3cf6b43b0acfd" already exists, roles.rbac.authorization.k8s.io "724788f6766aa5ba19b24ef4619b6a8e8e856b8b5fb96e1380f0d3f5b9dcb7a" already exists] If you delete the roles, you'll get the same for rolebindings, then the same for jobs.batch, and then configmaps.
Expected results:
Operator just installs
Additional info:
If you clean up all these resources the operator will install successfully.
Description of problem:
When creating an IAM role with a "path" (https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_identifiers.html#identifiers-friendly-names) its "principal" when applied to trust policies or VPC Endpoint Service allowed principals confusingly does not include the path. That is, for the folowing rolesRef on a hostedcluster:
spec: platform: aws: rolesRef: controlPlaneOperatorARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-kube-system-control-plane-operator imageRegistryARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-openshift-image-registry-installer-cloud-crede ingressARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-openshift-ingress-operator-cloud-credentials kubeCloudControllerARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-kube-system-kube-controller-manager networkARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-openshift-cloud-network-config-controller-clou nodePoolManagementARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-kube-system-capa-controller-manager storageARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-openshift-cluster-csi-drivers-ebs-cloud-creden
The actual valid principal that should be added to the VPC Endpoint Service's allowed principals is:
arn:aws:iam::765374464689:role/ad-int-path1-y4y2-kube-system-control-plane-operator
instead of
arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-kube-system-control-plane-operator
However, for all other cases, the full ARN including the path should be used, e.g. https://github.com/openshift/hypershift/blob/082e880d0a492a357663d620fa58314a4a477730/hypershift-operator/controllers/hostedcluster/internal/platform/aws/aws.go#L237-L273
Version-Release number of selected component (if applicable):
4.14.1
How reproducible:
100%
Steps to Reproduce:
ROSA HCP-specific steps:
1. rosa create account-roles --path /anything/ -m auto -y 2. rosa create cluster --hosted-cp 3. ...etc 4a. Observe on the hosted cluster AWS Account that the VPC Endpoint cannot be created with the error: 'failed to create vpc endpoint: InvalidServiceName' 4b. Observe on the management cluster that CPO is failing to update the VPC Endpoint Service's allowed principals with the error: Client.InvalidPrincipal 5. If the contents of .spec.platform.aws.rolesRef.controlPlaneOperatorARN are manually applied to the additional allowed principals with the path component removed, then the problems are largely fixed on the hosted cluster side. VPC Endpoint is created, worker nodes can spin up, etc.
Actual results:
The VPC Endpoint Service is attempting and failing to get this applied to its additional allowed principals: arn:aws:iam::${ACCOUNT_ID}:role/path/name
Expected results:
The VPC Endpoint Service gets this applied to its additional allowed principals: arn:aws:iam::${ACCOUNT_ID}:role/name
Additional info:
As part of our investigation into GCP disruption we want an endpoint separate from the cluster under test but inside GCP to monitor for connectivity.
One approach is to use a GCP Cloud Function with a HTTP Trigger
Another alternative it to standup our own server and collect logging
We need to consider cost of implementation, cost of maintaining and how well the implementation lines up with our overall test scenario (we are wanting to use this as a control to compare with reaching a pod within a cluster under test)
We may want to also consider standing up similar endpoints in AWS and Azure in the future.
A separate story will cover monitoring the endpoint from within Origin
Description of problem:
From a test run in [1] we can't be sure whether the call to etcd was really deadlocked or just waiting for a result. Currently the CheckingSyncWrapper only defines "alive" as a sync func that has not returned an error. This can be wrong in scenarios where a member is down and perpetually not reachable. Instead, we wanted to detect deadlock situations where the sync loop is just stuck for a prolonged period of time. [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1762965898773139456/
Version-Release number of selected component (if applicable):
>4.14
How reproducible:
Always
Steps to Reproduce:
1. create a healthy cluster 2. make sure one etcd member never responds, but the node is still there (ie kubelet shutdown, blocking the etcd ports on a firewall) 3. wait for the CEO to restart pod on failing health probe and dump its stack
Actual results:
CEO controllers are returning errors, but might not deadlock, which currently results in a restart
Expected results:
CEO should mark the member as unhealthy and continue its service without getting deadlocked and should not restart its pod by failing the health probe
Additional info:
highly related to OCPBUGS-30169
This is a clone of issue OCPBUGS-42108. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-38012. The following is the description of the original issue:
—
Description of problem:
Customers are unable to scale-up the OCP nodes when the initial setup is done with OCP 4.8/4.9 and then upgraded to 4.15.22/4.15.23 At first customer observed that the node scale-up failed and the /etc/resolv.conf was empty on the nodes. As a workaround, customer copy/paste the resolv.conf content from a correct resolv.conf and then it continued with setting up the new node. However then they observed the rendered MachineConfig assembled with the 00-worker, and suspected that something can be wrong with the on-prem-resolv-prepender.service service definition. As a workaround, customer manually changed this service definition which helped them to scale up new nodes.
Version-Release number of selected component (if applicable):
4.15 , 4.16
How reproducible:
100%
Steps to Reproduce:
1. Install OCP vSphere IPI cluster version 4.8 or 4.9 2. Check "on-prem-resolv-prepender.service" service definition 3. Upgrade it to 4.15.22 or 4.15.23 4. Check if the node scaling is working 5. Check "on-prem-resolv-prepender.service" service definition
Actual results:
Unable to scaleup node with default service definition. After manually making changes in the service definition , scaling is working.
Expected results:
Node sclaing should work without making any manual changes in the service definition.
Additional info:
on-prem-resolv-prepender.service content on the clusters build with 4.8 / 4.9 version and then upgraded to 4.15.22 / 4.25.23 : ~~~ [Unit] Description=Populates resolv.conf according to on-prem IPI needs # Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe After=crio-wipe.service [Service] Type=oneshot Restart=on-failure RestartSec=10 StartLimitIntervalSec=0 ExecStart=/usr/local/bin/resolv-prepender.sh EnvironmentFile=/run/resolv-prepender/env ~~~ After manually correcting the service definition as below, scaling works on 4.15.22 / 4.15.23 : ~~~ [Unit] Description=Populates resolv.conf according to on-prem IPI needs # Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe After=crio-wipe.service StartLimitIntervalSec=0 -----------> this [Service] Type=oneshot #Restart=on-failure -----------> this RestartSec=10 ExecStart=/usr/local/bin/resolv-prepender.sh EnvironmentFile=/run/resolv-prepender/env ~~~ Below is the on-prem-resolv-prepender.service on a freshly intsalled 4.15.23 where sclaing is working fine : ~~~ [Unit] Description=Populates resolv.conf according to on-prem IPI needs # Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe After=crio-wipe.service StartLimitIntervalSec=0 [Service] Type=oneshot Restart=on-failure RestartSec=10 ExecStart=/usr/local/bin/resolv-prepender.sh EnvironmentFile=/run/resolv-prepender/env ~~~ Observed this in the rendered MachineConfig which is assembled with the 00-worker
Please review the following PR: https://github.com/openshift/cloud-provider-vsphere/pull/59
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-33661. The following is the description of the original issue:
—
Description of problem:
`preserveBootstrapIgnition` was named after the implementation details in terraform for how to make deleting S3 objects optional. The motivation behind the change was that some customers run installs in subscriptions where policies do not allow deleting s3 objects. They didn't want the install to fail because of that. With the move from terraform to capi/capa, this is now implemented differently: capa always tries to delete the s3 objects but will ignore any permission errors if `preserveBootstrapIgnition` is set. We should rename this option so it's clear that the objects will be deleted if there are enough permissions. My suggestion is to name something similar to what's named in CAPA: `allowBestEffortDeleteIgnition`. Ideally we should deprecate `preserveBootstrapIgnition` in 4.16 and remove it in 4.17.
Version-Release number of selected component (if applicable):
4.14+ but I don't think we want to change this for terraform-based installs
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
https://github.com/openshift/installer/pull/7288
Please review the following PR: https://github.com/openshift/cluster-autoscaler-operator/pull/305
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-35533. The following is the description of the original issue:
—
Description of problem:
Failed to deploy the cluster with the following error: time="2024-06-13T14:01:11Z" level=debug msg="Creating the security group rules"time="2024-06-13T14:01:19Z" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: failed to create security groups: failed to create the security group rule on group \"cb9a607c-9799-4186-bc22-26f141ce91aa\" for IPv4 tcp on ports 1936-1936: Bad request with: [POST https://10.46.44.159:13696/v2.0/security-group-rules], error message: {\"NeutronError\": {\"type\": \"SecurityGroupRuleParameterConflict\", \"message\": \"Conflicting value ethertype IPv4 for CIDR fd2e:6f44:5dd8:c956::/64\", \"detail\": \"\"}}"time="2024-06-13T14:01:20Z" level=debug msg="OpenShift Installer 4.17.0-0.nightly-2024-06-13-083330"time="2024-06-13T14:01:20Z" level=debug msg="Built from commit 6bc75dfebaca79ecf302263af7d32d50c31f371a"time="2024-06-13T14:01:20Z" level=debug msg="Loading Install Config..."time="2024-06-13T14:01:20Z" level=debug msg=" Loading SSH Key..."time="2024-06-13T14:01:20Z" level=debug msg=" Loading Base Domain..."time="2024-06-13T14:01:20Z" level=debug msg=" Loading Platform..."time="2024-06-13T14:01:20Z" level=debug msg=" Loading Cluster Name..."time="2024-06-13T14:01:20Z" level=debug msg=" Loading Base Domain..."time="2024-06-13T14:01:20Z" level=debug msg=" Loading Platform..."time="2024-06-13T14:01:20Z" level=debug msg=" Loading Pull Secret..."time="2024-06-13T14:01:20Z" level=debug msg=" Loading Platform..."time="2024-06-13T14:01:20Z" level=debug msg="Using Install Config loaded from state file"time="2024-06-13T14:01:20Z" level=debug msg="Loading Agent Config..."time="2024-06-13T14:01:20Z" level=info msg="Waiting up to 40m0s (until 2:41PM UTC) for the cluster at https://api.ostest.shiftstack.com:6443 to initialize..."
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-06-13-083330
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Customer is asking for this flag so they can keep using v1 code even when v2 will be the default.
This change is killing payloads and thus the org is blocked at a fairly critical time.
Hitting the test: [OLM][invariant] alert/KubePodNotReady should not be at or above info in ns/openshift-marketplace
Possibly more.
Suspect related to: https://github.com/openshift/origin/pull/28741 which merged Apr 25 20:11 UTC, and sippy db shows the failures start showing at 21:19, before that nothing for months.
Looks related to:
Failed to pull image "registry.redhat.io/redhat/redhat-marketplace-index:v4.16": copying system image from manifest list: reading signatures: parsing signature https://registry.redhat.io/containers/sigstore/redhat/redhat-marketplace-index@sha256=7ff75c6598abd1a2abe9fa3db8a805fa552798361272b983ea07c9e9ef22d686/signature-2: unrecognized signature format, starting with binary 0x3c
We suspect there is a problem with the images and the failure may be legitimate.
Use the sippy test details to view the pass rate for the test which exposes this day by day.
Please review the following PR: https://github.com/openshift/node_exporter/pull/140
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Component Readiness has found a potential regression in [sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel].
Probability of significant regression: 99.79%
Sample (being evaluated) Release: 4.16
Start Time: 2024-05-08T00:00:00Z
End Time: 2024-05-14T23:59:59Z
Success Rate: 83.33%
Successes: 25
Failures: 5
Flakes: 0
Base (historical) Release: 4.15
Start Time: 2024-02-01T00:00:00Z
End Time: 2024-02-28T23:59:59Z
Success Rate: 100.00%
Successes: 67
Failures: 0
Flakes: 1
Notes:
We need to fix the following error:
I0514 12:31:46.014919 1 vsphere_check.go:272] CheckAccountPermissions failed: specified folder not found: folder '/IBMCdatacenter/vm/ci-op-1qvr0jdj-10b01' not found
This seems to be caused by the new powercli script. It is creating the folder based on infra id instead of cluster name. We'll change this to match expected and verify error is resolved.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-33592. The following is the description of the original issue:
—
Description of problem:
While investigating a problem with OpenShift Container Platform 4 - Node scaling, I found the below messages reported in my OpenShift Container Platform 4 - Cluster. E0513 11:15:09.331353 1 orchestrator.go:450] Couldn't get autoscaling options for ng: MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c E0513 11:15:09.331365 1 orchestrator.go:450] Couldn't get autoscaling options for ng: MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c I0513 11:15:09.331529 1 orchestrator.go:546] Pod project-100/curl-67f84bd857-h92wb can't be scheduled on MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo= I0513 11:15:09.331684 1 orchestrator.go:157] No pod can fit to MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c E0513 11:15:09.332076 1 orchestrator.go:507] Failed to get autoscaling options for node group MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c: Not implemented I0513 11:15:09.332100 1 orchestrator.go:185] Best option to resize: MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c I0513 11:15:09.332110 1 orchestrator.go:189] Estimated 1 nodes needed in MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c I0513 11:15:09.332135 1 orchestrator.go:295] Final scale-up plan: [{MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c 0->1 (max: 12)}] The same events are reported in must-gather reviewed from customers. Given that we have https://github.com/kubernetes/autoscaler/issues/6037 and https://github.com/kubernetes/autoscaler/issues/6676 that appear to be solved via https://github.com/kubernetes/autoscaler/pull/6677 and https://github.com/kubernetes/autoscaler/pull/6038 I'm wondering whether we should pull in those changes as they seem to eventually impact automated scaling of OpenShift Container Platform 4 - Node(s).
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.15
How reproducible:
Always
Steps to Reproduce:
1. Setup OpenShift Container Platform 4 with ClusterAutoscaler configured 2. Trigger scaling activity and verify the cluster-autoscaler-default logs
Actual results:
Logs like the below are being reported. E0513 11:15:09.331353 1 orchestrator.go:450] Couldn't get autoscaling options for ng: MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c E0513 11:15:09.331365 1 orchestrator.go:450] Couldn't get autoscaling options for ng: MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c I0513 11:15:09.331529 1 orchestrator.go:546] Pod project-100/curl-67f84bd857-h92wb can't be scheduled on MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo= I0513 11:15:09.331684 1 orchestrator.go:157] No pod can fit to MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c E0513 11:15:09.332076 1 orchestrator.go:507] Failed to get autoscaling options for node group MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c: Not implemented I0513 11:15:09.332100 1 orchestrator.go:185] Best option to resize: MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c I0513 11:15:09.332110 1 orchestrator.go:189] Estimated 1 nodes needed in MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c I0513 11:15:09.332135 1 orchestrator.go:295] Final scale-up plan: [{MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c 0->1 (max: 12)}]
Expected results:
Scale-up of OpenShift Container Platform 4 - Node to happen without error being reported I0513 11:15:09.331529 1 orchestrator.go:546] Pod project-100/curl-67f84bd857-h92wb can't be scheduled on MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo= I0513 11:15:09.331684 1 orchestrator.go:157] No pod can fit to MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c I0513 11:15:09.332100 1 orchestrator.go:185] Best option to resize: MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c I0513 11:15:09.332110 1 orchestrator.go:189] Estimated 1 nodes needed in MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c I0513 11:15:09.332135 1 orchestrator.go:295] Final scale-up plan: [{MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c 0->1 (max: 12)}]
Additional info:
Please review https://github.com/kubernetes/autoscaler/issues/6037 and https://github.com/kubernetes/autoscaler/issues/6676 as they seem to document the problem and also have a solution linked/merged
Description of problem:
New spot VMs fail to be created by machinesets defining providerSpec.value.spotVMOptions in Azure regions without Availability Zones. Azure-controller logs the error: Azure Spot Virtual Machine is not supported in Availability Set. A new availabilitySet is created for each machineset in non-zonal regions, but this only works with normal nodes. Spot VMs and availabilitySets are incompatible as per Microsoft docs for this error: You need to choose to either use an Azure Spot Virtual Machine or use a VM in an availability set, you can't choose both. From: https://learn.microsoft.com/en-us/azure/virtual-machines/error-codes-spot
Version-Release number of selected component (if applicable):
n/a
How reproducible:
Always
Steps to Reproduce:
1. Follow the instructions to create a machineset to provision spot VMs: https://docs.openshift.com/container-platform/4.12/machine_management/creating_machinesets/creating-machineset-azure.html#machineset-creating-non-guaranteed-instance_creating-machineset-azure 2. New machines will be in Failed state: $ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api mabad-test-l5x58-worker-southindia-spot-c4qr5 Failed 7m17s openshift-machine-api mabad-test-l5x58-worker-southindia-spot-dtzsn Failed 7m17s openshift-machine-api mabad-test-l5x58-worker-southindia-spot-tzrhw Failed 7m28s 3. Events in the failed machines show errors creating spot VMs with availabilitySets: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreate 28s azure-controller InvalidConfiguration: failed to reconcile machine "mabad-test-l5x58-worker-southindia-spot-dx78z": failed to create vm mabad-test-l5x58-worker-southindia-spot-dx78z: failure sending request for machine mabad-test-l5x58-worker-southindia-spot-dx78z: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="OperationNotAllowed" Message="Azure Spot Virtual Machine is not supported in Availability Set. For more information, see http://aka.ms/AzureSpot/errormessages."
Actual results:
Machines stay in Failed state and nodes are not created
Expected results:
Machines get created and new spot VM nodes added to the cluster.
Additional info:
This problem was identified from a customer alert in an ARO cluster. ICM for ref (requires b- MSFT account): https://portal.microsofticm.com/imp/v3/incidents/incident/455463992/summary
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
when extracting `oc` and `openshift-install` from release payload below warnings are shown which might be confusing for the user, to make this clear please help update the warning to add image names into the kubectl version mismatch message in addition to the version list Version-Release number of selected component (if applicable):{code:none} always How reproducible:{code:none} Always
Steps to Reproduce:
1. Run command to extract oc & openshift-install using `oc adm extract` 2. Run oc adm release info --commits <payload> 3.
Actual results:
$ oc adm release info --commits registry.ci.openshift.org/ocp/release:4.16.0-0.ci-2024-03-05-032119 warning: multiple versions reported for the kubectl: 1.29.1,1.28.2,1.29.0
Expected results:
show image names which needs kubernetes bump along with kubectl version
Additional info:
Thread here: https://redhat-internal.slack.com/archives/GK58XC2G2/p1709565188855519
This is a clone of issue OCPBUGS-39573. The following is the description of the original issue:
—
Description of problem:
Enabling the topology tests in CI
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-34037. The following is the description of the original issue:
—
Open Github Security Advisory for: containers/image
https://github.com/advisories/GHSA-6wvf-f2vw-3425
The ARO SRE team became aware of this advisory against our installer fork. Upstream installer is also pinning a vulnerable version of containerd.
Advisory recommends to update to versions 5.30.1
Description of problem:
New hypershift scheduler is not replacing '.' with ',' in subnet label values, resulting in invalid subnet annotations for load balancer services.
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
Always
Steps to Reproduce:
1. Using new hypershift request serving node scheduler, create a HostedCluster. 2. Use nodes that are labeled with subnets separated by periods instead of commas.
Actual results:
HostedCluster fails to roll out because router services are not deployed.
Expected results:
HostedCluster provisions successfully.
Additional info:
Build02, a years old cluster currently running 4.15.0-ec.2 with TechPreviewNoUpgrade, has been Available=False for days:
$ oc get -o json clusteroperator monitoring | jq '.status.conditions[] | select(.type == "Available")' { "lastTransitionTime": "2024-01-14T04:09:52Z", "message": "UpdatingMetricsServer: reconciling MetricsServer Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/metrics-server: context deadline exceeded", "reason": "UpdatingMetricsServerFailed", "status": "False", "type": "Available" }
Both pods had been having CA trust issues. We deleted one pod, and it's replacement is happy:
$ oc -n openshift-monitoring get -l app.kubernetes.io/component=metrics-server pods NAME READY STATUS RESTARTS AGE metrics-server-9cc8bfd56-dd5tx 1/1 Running 0 136m metrics-server-9cc8bfd56-k2lpv 0/1 Running 0 36d
The young, happy pod has occasional node-removed noise, which is expected in this cluster with high levels of compute-node autoscaling:
$ oc -n openshift-monitoring logs --tail 3 metrics-server-9cc8bfd56-dd5tx E0117 17:16:13.492646 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.32.33:10250/metrics/resource\": dial tcp 10.0.32.33:10250: connect: connection refused" node="build0-gstfj-ci-builds-worker-b-srjk5" E0117 17:16:28.611052 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.32.33:10250/metrics/resource\": dial tcp 10.0.32.33:10250: connect: connection refused" node="build0-gstfj-ci-builds-worker-b-srjk5" E0117 17:16:56.898453 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.32.33:10250/metrics/resource\": context deadline exceeded" node="build0-gstfj-ci-builds-worker-b-srjk5"
While the old, sad pod is complaining about unknown authorities:
$ oc -n openshift-monitoring logs --tail 3 metrics-server-9cc8bfd56-k2lpv E0117 17:19:09.612161 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.0.3:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate signed by unknown authority" node="build0-gstfj-m-2.c.openshift-ci-build-farm.internal" E0117 17:19:09.620872 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.32.90:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate signed by unknown authority" node="build0-gstfj-ci-prowjobs-worker-b-cg7qd" I0117 17:19:14.538837 1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
More details in the Additional details section, but the timeline seems to have been something like:
So addressing the metrics-server /etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt change detection should resolve this use-case. And triggering a container or pod restart would be an aggressive-but-sufficient mechanism, although loading the new data without rolling the process would be less invasive.
4.15.0-ec.3, which has fast CA rotation, see discussion in API-1687.
Unclear.
Unclear.
metrics-server pods having trouble with CA trust when attempting to scrape nodes.
metrics-server pods successfully trusting kubelets when scraping nodes.
The monitoring operator sets up the metrics server with --kubelet-certificate-authority=/etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt, which is the "Path to the CA to use to validate the Kubelet's serving certificates" and is mounted from the kubelet-serving-ca-bundle ConfigMap. But that mount point only contains openshift-kube-controller-manager-operator_csr-signer-signer@... CAs:
$ oc --as system:admin -n openshift-monitoring debug pod/metrics-server-9cc8bfd56-k2lpv -- cat /etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt | while openssl x509 -noout -text; do :; done | grep '^Certificate:\|Issuer\|Subject:\|Not ' Starting pod/metrics-server-9cc8bfd56-k2lpv-debug-gtctn ... Removing debug pod ... Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554 Not Before: Dec 3 14:42:33 2023 GMT Not After : Feb 1 14:42:34 2024 GMT Subject: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554 Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554 Not Before: Dec 20 03:16:35 2023 GMT Not After : Jan 19 03:16:36 2024 GMT Subject: CN = kube-csr-signer_@1703042196 Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 Not Before: Jan 4 03:16:35 2024 GMT Not After : Feb 3 03:16:36 2024 GMT Subject: CN = kube-csr-signer_@1704338196 Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 Not Before: Jan 2 14:42:34 2024 GMT Not After : Mar 2 14:42:35 2024 GMT Subject: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 unable to load certificate 137730753918272:error:0909006C:PEM routines:get_name:no start line:../crypto/pem/pem_lib.c:745:Expecting: TRUSTED CERTIFICATE
While actual kubelets seem to be using certs signed by kube-csr-signer_@1704338196 (which is one of the Subjects in /etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt):
$ oc get -o wide -l node-role.kubernetes.io/master= nodes NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME build0-gstfj-m-0.c.openshift-ci-build-farm.internal Ready master 3y240d v1.28.3+20a5764 10.0.0.4 <none> Red Hat Enterprise Linux CoreOS 415.92.202311271112-0 (Plow) 5.14.0-284.41.1.el9_2.x86_64 cri-o://1.28.2-2.rhaos4.15.gite7be4e1.el9 build0-gstfj-m-1.c.openshift-ci-build-farm.internal Ready master 3y240d v1.28.3+20a5764 10.0.0.5 <none> Red Hat Enterprise Linux CoreOS 415.92.202311271112-0 (Plow) 5.14.0-284.41.1.el9_2.x86_64 cri-o://1.28.2-2.rhaos4.15.gite7be4e1.el9 build0-gstfj-m-2.c.openshift-ci-build-farm.internal Ready master 3y240d v1.28.3+20a5764 10.0.0.3 <none> Red Hat Enterprise Linux CoreOS 415.92.202311271112-0 (Plow) 5.14.0-284.41.1.el9_2.x86_64 cri-o://1.28.2-2.rhaos4.15.gite7be4e1.el9 $ oc --as system:admin -n openshift-monitoring debug pod/metrics-server-9cc8bfd56-k2lpv -- openssl s_client -connect 10.0.0.3:10250 -showcerts </dev/null Starting pod/metrics-server-9cc8bfd56-k2lpv-debug-ksl2k ... Can't use SSL_get_servername depth=0 O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal verify error:num=20:unable to get local issuer certificate verify return:1 depth=0 O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal verify error:num=21:unable to verify the first certificate verify return:1 depth=0 O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal verify return:1 CONNECTED(00000003) --- Certificate chain 0 s:O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal i:CN = kube-csr-signer_@1704338196 -----BEGIN CERTIFICATE----- MIIC5DCCAcygAwIBAgIQAbKVl+GS6s2H20EHAWl4WzANBgkqhkiG9w0BAQsFADAm MSQwIgYDVQQDDBtrdWJlLWNzci1zaWduZXJfQDE3MDQzMzgxOTYwHhcNMjQwMTE3 MDMxNDMwWhcNMjQwMjAzMDMxNjM2WjBhMRUwEwYDVQQKEwxzeXN0ZW06bm9kZXMx SDBGBgNVBAMTP3N5c3RlbTpub2RlOmJ1aWxkMC1nc3Rmai1tLTIuYy5vcGVuc2hp ZnQtY2ktYnVpbGQtZmFybS5pbnRlcm5hbDBZMBMGByqGSM49AgEGCCqGSM49AwEH A0IABFqT+UgohFAxJrGYQUeYsEhNB+ufFo14xYDedKBCeNzMhaC+5/I4UN1e1u2X PH7J4ncmH+M/LXI7v+YfEIG7cH+jgZ0wgZowDgYDVR0PAQH/BAQDAgeAMBMGA1Ud JQQMMAoGCCsGAQUFBwMBMAwGA1UdEwEB/wQCMAAwHwYDVR0jBBgwFoAU394ABuS2 9i0qss9AKk/mQ9lhJ88wRAYDVR0RBD0wO4IzYnVpbGQwLWdzdGZqLW0tMi5jLm9w ZW5zaGlmdC1jaS1idWlsZC1mYXJtLmludGVybmFshwQKAAADMA0GCSqGSIb3DQEB CwUAA4IBAQCiKelqlgK0OHFqDPdIR+RRdjXoCfFDa0JGCG0z60LYJV6Of5EPv0F/ vGZdM/TyGnPT80lnLCh2JGUvneWlzQEZ7LEOgXX8OrAobijiFqDZFlvVwvkwWNON rfucLQWDFLHUf/yY0EfB0ZlM8Sz4XE8PYB6BXYvgmUIXS1qkV9eGWa6RPLsOnkkb q/dTLE/tg8cz24IooDC8lmMt/wCBPgsq9AnORgNdZUdjCdh9DpDWCw0E4csSxlx2 H1qlH5TpTGKS8Ox9JAfdAU05p/mEhY9PEPSMfdvBZep1xazrZyQIN9ckR2+11Syw JlbEJmapdSjIzuuKBakqHkDgoq4XN0KM -----END CERTIFICATE----- --- Server certificate subject=O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal issuer=CN = kube-csr-signer_@1704338196 --- Acceptable client certificate CA names OU = openshift, CN = admin-kubeconfig-signer CN = openshift-kube-controller-manager-operator_csr-signer-signer@1699022534 CN = kube-csr-signer_@1700450189 CN = kube-csr-signer_@1701746196 CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554 CN = openshift-kube-apiserver-operator_kube-apiserver-to-kubelet-signer@1691004449 CN = openshift-kube-apiserver-operator_kube-control-plane-signer@1702234292 CN = openshift-kube-apiserver-operator_kube-control-plane-signer@1699642292 OU = openshift, CN = kubelet-bootstrap-kubeconfig-signer CN = openshift-kube-apiserver-operator_node-system-admin-signer@1678905372 Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512:RSA+SHA1:ECDSA+SHA1 Shared Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512 Peer signing digest: SHA256 Peer signature type: ECDSA Server Temp Key: X25519, 253 bits --- SSL handshake has read 1902 bytes and written 383 bytes Verification error: unable to verify the first certificate --- New, TLSv1.3, Cipher is TLS_AES_128_GCM_SHA256 Server public key is 256 bit Secure Renegotiation IS NOT supported Compression: NONE Expansion: NONE No ALPN negotiated Early data was not sent Verify return code: 21 (unable to verify the first certificate) --- DONE Removing debug pod ... $ openssl x509 -noout -text <<EOF 2>/dev/null > -----BEGIN CERTIFICATE----- MIIC5DCCAcygAwIBAgIQAbKVl+GS6s2H20EHAWl4WzANBgkqhkiG9w0BAQsFADAm MSQwIgYDVQQDDBtrdWJlLWNzci1zaWduZXJfQDE3MDQzMzgxOTYwHhcNMjQwMTE3 MDMxNDMwWhcNMjQwMjAzMDMxNjM2WjBhMRUwEwYDVQQKEwxzeXN0ZW06bm9kZXMx SDBGBgNVBAMTP3N5c3RlbTpub2RlOmJ1aWxkMC1nc3Rmai1tLTIuYy5vcGVuc2hp ZnQtY2ktYnVpbGQtZmFybS5pbnRlcm5hbDBZMBMGByqGSM49AgEGCCqGSM49AwEH A0IABFqT+UgohFAxJrGYQUeYsEhNB+ufFo14xYDedKBCeNzMhaC+5/I4UN1e1u2X PH7J4ncmH+M/LXI7v+YfEIG7cH+jgZ0wgZowDgYDVR0PAQH/BAQDAgeAMBMGA1Ud JQQMMAoGCCsGAQUFBwMBMAwGA1UdEwEB/wQCMAAwHwYDVR0jBBgwFoAU394ABuS2 9i0qss9AKk/mQ9lhJ88wRAYDVR0RBD0wO4IzYnVpbGQwLWdzdGZqLW0tMi5jLm9w ZW5zaGlmdC1jaS1idWlsZC1mYXJtLmludGVybmFshwQKAAADMA0GCSqGSIb3DQEB CwUAA4IBAQCiKelqlgK0OHFqDPdIR+RRdjXoCfFDa0JGCG0z60LYJV6Of5EPv0F/ vGZdM/TyGnPT80lnLCh2JGUvneWlzQEZ7LEOgXX8OrAobijiFqDZFlvVwvkwWNON rfucLQWDFLHUf/yY0EfB0ZlM8Sz4XE8PYB6BXYvgmUIXS1qkV9eGWa6RPLsOnkkb q/dTLE/tg8cz24IooDC8lmMt/wCBPgsq9AnORgNdZUdjCdh9DpDWCw0E4csSxlx2 H1qlH5TpTGKS8Ox9JAfdAU05p/mEhY9PEPSMfdvBZep1xazrZyQIN9ckR2+11Syw JlbEJmapdSjIzuuKBakqHkDgoq4XN0KM -----END CERTIFICATE----- > EOF ... Issuer: CN = kube-csr-signer_@1704338196 Validity Not Before: Jan 17 03:14:30 2024 GMT Not After : Feb 3 03:16:36 2024 GMT Subject: O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal ...
The monitoring operator populates the openshift-monitoring kubelet-serving-ca-bundle} ConfigMap using data from the openshift-config-managed kubelet-serving-ca ConfigMap, and that propagation is working, but does not contain the kube-csr-signer_ CA:
$ oc -n openshift-config-managed get -o json configmap kubelet-serving-ca | jq -r '.data["ca-bundle.crt"]' | while openssl x509 -noout -text; do :; done | grep '^Certificate:\|Issuer\|Subject:\|Not ' Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554 Not Before: Dec 3 14:42:33 2023 GMT Not After : Feb 1 14:42:34 2024 GMT Subject: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554 Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554 Not Before: Dec 20 03:16:35 2023 GMT Not After : Jan 19 03:16:36 2024 GMT Subject: CN = kube-csr-signer_@1703042196 Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 Not Before: Jan 4 03:16:35 2024 GMT Not After : Feb 3 03:16:36 2024 GMT Subject: CN = kube-csr-signer_@1704338196 Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 Not Before: Jan 2 14:42:34 2024 GMT Not After : Mar 2 14:42:35 2024 GMT Subject: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 unable to load certificate 140531510617408:error:0909006C:PEM routines:get_name:no start line:../crypto/pem/pem_lib.c:745:Expecting: TRUSTED CERTIFICATE $ oc -n openshift-config-managed get -o json configmap kubelet-serving-ca | jq -r '.data["ca-bundle.crt"]' | sha1sum a32ab44dff8030c548087d70fea599b0d3fab8af - $ oc -n openshift-monitoring get -o json configmap kubelet-serving-ca-bundle | jq -r '.data["ca-bundle.crt"]' | sha1sum a32ab44dff8030c548087d70fea599b0d3fab8af -
Flipping over to the kubelet side, nothing in the machine-config operator's template is jumping out at me as a key/cert pair for serving on 10250. The kubelet seems to set up server certs via serverTLSBootstrap: true. But we don't seem to set the beta RotateKubeletServerCertificate, so I'm not clear on how these are supposed to rotate on the kubelet side. But there are CSRs from kubelets requesting serving certs:
$ oc get certificatesigningrequests | grep 'NAME\|kubelet-serving' NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-8stgd 51m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-builds-worker-b-xkdw2 <none> Approved,Issued csr-blbjx 9m1s kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-longtests-worker-b-5w9dz <none> Approved,Issued csr-ghxh5 64m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-builds-worker-b-sdwdn <none> Approved,Issued csr-hng85 33m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-longtests-worker-d-7d7h2 <none> Approved,Issued csr-hvqxz 24m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-builds-worker-b-fp6wb <none> Approved,Issued csr-vc52m 50m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-builds-worker-b-xlmt6 <none> Approved,Issued csr-vflcm 40m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-builds-worker-b-djpgq <none> Approved,Issued csr-xfr7d 51m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-builds-worker-b-8v4vk <none> Approved,Issued csr-zhzbs 51m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-builds-worker-b-rqr68 <none> Approved,Issued $ oc get -o json certificatesigningrequests csr-blbjx { "apiVersion": "certificates.k8s.io/v1", "kind": "CertificateSigningRequest", "metadata": { "creationTimestamp": "2024-01-17T19:20:43Z", "generateName": "csr-", "name": "csr-blbjx", "resourceVersion": "4719586144", "uid": "5f12d236-3472-485f-8037-3896f51a809c" }, "spec": { "groups": [ "system:nodes", "system:authenticated" ], "request": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQlh6Q0NBUVFDQVFBd1ZqRVZNQk1HQTFVRUNoTU1jM2x6ZEdWdE9tNXZaR1Z6TVQwd093WURWUVFERXpSegplWE4wWlcwNmJtOWtaVHBpZFdsc1pEQXRaM04wWm1vdFkya3RiRzl1WjNSbGMzUnpMWGR2Y210bGNpMWlMVFYzCk9XUjZNRmt3RXdZSEtvWkl6ajBDQVFZSUtvWkl6ajBEQVFjRFFnQUV5Y0dhSDMvZ3F4ZHNZWkdmQXovTEpoZVgKd1o0Z1VRbjB6TlZUenJncHpvd1VPOGR6NTN4UUZTOTRibm40NldlZFg3Q2xidUpVSUpUN2pCblV1WEdnZktCTQpNRW9HQ1NxR1NJYjNEUUVKRGpFOU1Ec3dPUVlEVlIwUkJESXdNSUlvWW5WcGJHUXdMV2R6ZEdacUxXTnBMV3h2CmJtZDBaWE4wY3kxM2IzSnJaWEl0WWkwMWR6bGtlb2NFQ2dBZ0F6QUtCZ2dxaGtqT1BRUURBZ05KQURCR0FpRUEKMHlRVzZQOGtkeWw5ZEEzM3ppQTJjYXVJdlhidTVhczNXcUZLYWN2bi9NSUNJUURycEQyVEtScHJOU1I5dExKTQpjZ0ZpajN1dVNieVJBcEJ5NEE1QldEZm02UT09Ci0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo=", "signerName": "kubernetes.io/kubelet-serving", "usages": [ "digital signature", "server auth" ], "username": "system:node:build0-gstfj-ci-longtests-worker-b-5w9dz" }, "status": { "certificate": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUN6ekNDQWJlZ0F3SUJBZ0lSQUlGZ1NUd0ovVUJLaE1hWlE4V01KcEl3RFFZSktvWklodmNOQVFFTEJRQXcKSmpFa01DSUdBMVVFQXd3YmEzVmlaUzFqYzNJdGMybG5ibVZ5WDBBeE56QTBNek00TVRrMk1CNFhEVEkwTURFeApOekU1TVRVME0xb1hEVEkwTURJd016QXpNVFl6Tmxvd1ZqRVZNQk1HQTFVRUNoTU1jM2x6ZEdWdE9tNXZaR1Z6Ck1UMHdPd1lEVlFRREV6UnplWE4wWlcwNmJtOWtaVHBpZFdsc1pEQXRaM04wWm1vdFkya3RiRzl1WjNSbGMzUnoKTFhkdmNtdGxjaTFpTFRWM09XUjZNRmt3RXdZSEtvWkl6ajBDQVFZSUtvWkl6ajBEQVFjRFFnQUV5Y0dhSDMvZwpxeGRzWVpHZkF6L0xKaGVYd1o0Z1VRbjB6TlZUenJncHpvd1VPOGR6NTN4UUZTOTRibm40NldlZFg3Q2xidUpVCklKVDdqQm5VdVhHZ2ZLT0JrakNCanpBT0JnTlZIUThCQWY4RUJBTUNCNEF3RXdZRFZSMGxCQXd3Q2dZSUt3WUIKQlFVSEF3RXdEQVlEVlIwVEFRSC9CQUl3QURBZkJnTlZIU01FR0RBV2dCVGYzZ0FHNUxiMkxTcXl6MEFxVCtaRAoyV0VuenpBNUJnTlZIUkVFTWpBd2dpaGlkV2xzWkRBdFozTjBabW90WTJrdGJHOXVaM1JsYzNSekxYZHZjbXRsCmNpMWlMVFYzT1dSNmh3UUtBQ0FETUEwR0NTcUdTSWIzRFFFQkN3VUFBNElCQVFBRE5ad0pMdkp4WWNta2RHV08KUm5ocC9rc3V6akJHQnVHbC9VTmF0RjZScml3eW9mdmpVNW5Kb0RFbGlLeHlDQ2wyL1d5VXl5a2hMSElBK1drOQoxZjRWajIrYmZFd0IwaGpuTndxQThudFFabS90TDhwalZ5ZzFXM0VwR2FvRjNsZzRybDA1cXBwcjVuM2l4WURJClFFY2ZuNmhQUnlKN056dlFCS0RwQ09lbU8yTFllcGhqbWZGY2h5VGRZVGU0aE9IOW9TWTNMdDdwQURIM2kzYzYKK3hpMDhhV09LZmhvT3IybTVBSFBVN0FkTjhpVUV0M0dsYzI0SGRTLzlLT05tT2E5RDBSSk9DMC8zWk5sKzcvNAoyZDlZbnYwaTZNaWI3OGxhNk5scFB0L2hmOWo5TlNnMDN4OFZYRVFtV21zN29xY1FWTHMxRHMvWVJ4VERqZFphCnEwMnIKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=", "conditions": [ { "lastTransitionTime": "2024-01-17T19:20:43Z", "lastUpdateTime": "2024-01-17T19:20:43Z", "message": "This CSR was approved by the Node CSR Approver (cluster-machine-approver)", "reason": "NodeCSRApprove", "status": "True", "type": "Approved" } ] } } $ oc get -o json certificatesigningrequests csr-blbjx | jq -r '.status.certificate | @base64d' | openssl x509 -noout -text | grep '^Certificate:\|Issuer\|Subject:\|Not ' Certificate: Issuer: CN = kube-csr-signer_@1704338196 Not Before: Jan 17 19:15:43 2024 GMT Not After : Feb 3 03:16:36 2024 GMT Subject: O = system:nodes, CN = system:node:build0-gstfj-ci-longtests-worker-b-5w9dz
So that's approved by cluster-machine-approver, but signerName: kubernetes.io/kubelet-serving is an upstream Kubernetes component documented here, and the signer is implemented by kube-controller-manager.
This is a clone of issue OCPBUGS-36406. The following is the description of the original issue:
—
Seen in a 4.15.19 cluster, the PrometheusOperatorRejectedResources alert was firing, but did not link a runbook, despite the runbook existing since MON-2358.
Seen in 4.15.19, but likely applies to all versions where the PrometheusOperatorRejectedResources alert exists.
Every time.
Check the cluster console at /monitoring/alertrules?rowFilter-alerting-rule-source=platform&name=PrometheusOperatorRejectedResources, and click through to the alert definition.
No mention of runbooks.
A Runbook section linking the runbook.
I haven't dug into the upstream/downstream sync process, but the runbook information likely needs to at least show up here, although that may or may not be the root location for injecting our canonical runbook into the upstream-sourced alert.
This is a clone of issue OCPBUGS-34986. The following is the description of the original issue:
—
Description of problem:
non-existing oauth.config.openshift.io resource is listed on Global Configuration page
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-05-082646
How reproducible:
Always
Steps to Reproduce:
1. visit global configuration page /settings/cluster/globalconfig 2. check listed items on the page 3.
Actual results:
2. There are two OAuth.config.openshift.io entries, one is linking to /k8s/cluster/config.openshift.io~v1~OAuth/oauth-config, this will return 404: Not Found $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-06-05-082646 True False 171m Cluster version is 4.16.0-0.nightly-2024-06-05-082646 $ oc get oauth.config.openshift.io NAME AGE cluster 3h26m
Expected results:
from CLI output we can see there is only one oauth.config.openshift.io resource, but we are showing one more 'oauth-config' Only one oauth.config.openshift.io resource should be listed
Additional info:
Description of problem:
Failed to create RHCOS image when creating Azure infrastructure
Steps to Reproduce & actual results:
fxie-mac:hypershift fxie$ hypershift create infra azure --name $CLUSTER_NAME --azure-creds $HOME/.azure/osServicePrincipal.json --base-domain $BASE_DOMAIN --infra-id $INFRA_ID --location eastus --output-file $OUTPUT_INFRA_FILE 2024-03-20T14:26:23+08:00 INFO Using credentials from file {"path": "/Users/fxie/.azure/osServicePrincipal.json"} 2024-03-20T14:26:30+08:00 INFO Successfully created resource group {"name": "fxie-hcp-1-fxie-hcp-1-13639"} 2024-03-20T14:26:32+08:00 INFO Successfully created managed identity {"name": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourcegroups/fxie-hcp-1-fxie-hcp-1-13639/providers/Microsoft.ManagedIdentity/userAssignedIdentities/fxie-hcp-1-fxie-hcp-1-13639"} 2024-03-20T14:26:32+08:00 INFO Assigning role to managed identity, this may take some time 2024-03-20T14:26:51+08:00 INFO Successfully assigned contributor role to managed identity {"name": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourcegroups/fxie-hcp-1-fxie-hcp-1-13639/providers/Microsoft.ManagedIdentity/userAssignedIdentities/fxie-hcp-1-fxie-hcp-1-13639"} 2024-03-20T14:26:55+08:00 INFO Successfully created network security group {"name": "fxie-hcp-1-fxie-hcp-1-13639-nsg"} 2024-03-20T14:27:01+08:00 INFO Successfully created vnet {"name": "fxie-hcp-1-fxie-hcp-1-13639"} 2024-03-20T14:27:35+08:00 INFO Successfully created private DNS zone {"name": "fxie-hcp-1-azurecluster.qe.azure.devcluster.openshift.com"} 2024-03-20T14:28:09+08:00 INFO Successfully created private DNS zone link 2024-03-20T14:28:12+08:00 INFO Successfully created public IP address for guest cluster egress load balancer 2024-03-20T14:28:15+08:00 INFO Successfully created guest cluster egress load balancer 2024-03-20T14:28:37+08:00 INFO Successfully created storage account {"name": "clusterzw22c"} 2024-03-20T14:28:38+08:00 INFO Successfully created blob container {"name": "vhd"} 2024-03-20T14:28:38+08:00 ERROR Failed to create infrastructure {"error": "failed to create RHCOS image: the image source url must be from an azure blob storage, otherwise upload will fail with an `One of the request inputs is out of range` error"} github.com/openshift/hypershift/cmd/infra/azure.NewCreateCommand.func2 /Users/fxie/Projects/hypershift/cmd/infra/azure/create.go:114 github.com/spf13/cobra.(*Command).execute /Users/fxie/Projects/hypershift/vendor/github.com/spf13/cobra/command.go:983 github.com/spf13/cobra.(*Command).ExecuteC /Users/fxie/Projects/hypershift/vendor/github.com/spf13/cobra/command.go:1115 github.com/spf13/cobra.(*Command).Execute /Users/fxie/Projects/hypershift/vendor/github.com/spf13/cobra/command.go:1039 github.com/spf13/cobra.(*Command).ExecuteContext /Users/fxie/Projects/hypershift/vendor/github.com/spf13/cobra/command.go:1032 main.main /Users/fxie/Projects/hypershift/main.go:78 runtime.main /usr/local/go/src/runtime/proc.go:267 Error: failed to create RHCOS image: the image source url must be from an azure blob storage, otherwise upload will fail with an `One of the request inputs is out of range` error failed to create RHCOS image: the image source url must be from an azure blob storage, otherwise upload will fail with an `One of the request inputs is out of range` error
Please review the following PR: https://github.com/openshift/cloud-provider-ibm/pull/61
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/129
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/aws-pod-identity-webhook/pull/180
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Find in QE's CI (with vsphere-agent profile), storage CO is not avaliable and vsphere-problem-detector-operator pod is CrashLoopBackOff with panic. (Find must-garther here: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-vsphere-agent-disconnected-ha-f14/1734850632575094784/artifacts/vsphere-agent-disconnected-ha-f14/gather-must-gather/) The storage CO reports "unable to find VM by UUID": - lastTransitionTime: "2023-12-13T09:15:27Z" message: "VSphereCSIDriverOperatorCRAvailable: VMwareVSphereControllerAvailable: unable to find VM ci-op-782gwsbd-b3d4e-master-2 by UUID \nVSphereProblemDetectorDeploymentControllerAvailable: Waiting for Deployment" reason: VSphereCSIDriverOperatorCR_VMwareVSphereController_vcenter_api_error::VSphereProblemDetectorDeploymentController_Deploying status: "False" type: Available (But I did not see the "unable to find VM by UUID" from vsphere-problem-detector-operator log in must-gather) The vsphere-problem-detector-operator log: 2023-12-13T10:10:56.620216117Z I1213 10:10:56.620159 1 vsphere_check.go:149] Connected to vcenter.devqe.ibmc.devcluster.openshift.com as ci_user_01@devqe.ibmc.devcluster.openshift.com 2023-12-13T10:10:56.625161719Z I1213 10:10:56.625108 1 vsphere_check.go:271] CountVolumeTypes passed 2023-12-13T10:10:56.625291631Z I1213 10:10:56.625258 1 zones.go:124] Checking tags for multi-zone support. 2023-12-13T10:10:56.625449771Z I1213 10:10:56.625433 1 zones.go:202] No FailureDomains configured. Skipping check. 2023-12-13T10:10:56.625497726Z I1213 10:10:56.625487 1 vsphere_check.go:271] CheckZoneTags passed 2023-12-13T10:10:56.625531795Z I1213 10:10:56.625522 1 info.go:44] vCenter version is 8.0.2, apiVersion is 8.0.2.0 and build is 22617221 2023-12-13T10:10:56.625562833Z I1213 10:10:56.625555 1 vsphere_check.go:271] ClusterInfo passed 2023-12-13T10:10:56.625603236Z I1213 10:10:56.625594 1 datastore.go:312] checking datastore /DEVQEdatacenter/datastore/vsanDatastore for permissions 2023-12-13T10:10:56.669205822Z panic: runtime error: invalid memory address or nil pointer dereference 2023-12-13T10:10:56.669338411Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x23096cb] 2023-12-13T10:10:56.669565413Z 2023-12-13T10:10:56.669591144Z goroutine 550 [running]: 2023-12-13T10:10:56.669838383Z github.com/openshift/vsphere-problem-detector/pkg/operator.getVM(0xc0005da6c0, 0xc0002d3b80) 2023-12-13T10:10:56.669991749Z github.com/openshift/vsphere-problem-detector/pkg/operator/vsphere_check.go:319 +0x3eb 2023-12-13T10:10:56.670212441Z github.com/openshift/vsphere-problem-detector/pkg/operator.(*vSphereChecker).enqueueSingleNodeChecks.func1() 2023-12-13T10:10:56.670289644Z github.com/openshift/vsphere-problem-detector/pkg/operator/vsphere_check.go:238 +0x55 2023-12-13T10:10:56.670490453Z github.com/openshift/vsphere-problem-detector/pkg/operator.(*CheckThreadPool).worker.func1(0xc000c88760?, 0x0?) 2023-12-13T10:10:56.670702592Z github.com/openshift/vsphere-problem-detector/pkg/operator/pool.go:40 +0x55 2023-12-13T10:10:56.671142070Z github.com/openshift/vsphere-problem-detector/pkg/operator.(*CheckThreadPool).worker(0xc000c78660, 0xc000c887a0?) 2023-12-13T10:10:56.671331852Z github.com/openshift/vsphere-problem-detector/pkg/operator/pool.go:41 +0xe7 2023-12-13T10:10:56.671529761Z github.com/openshift/vsphere-problem-detector/pkg/operator.NewCheckThreadPool.func1() 2023-12-13T10:10:56.671589925Z github.com/openshift/vsphere-problem-detector/pkg/operator/pool.go:28 +0x25 2023-12-13T10:10:56.671776328Z created by github.com/openshift/vsphere-problem-detector/pkg/operator.NewCheckThreadPool 2023-12-13T10:10:56.671847478Z github.com/openshift/vsphere-problem-detector/pkg/operator/pool.go:27 +0x73
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-11-033133
How reproducible:
Steps to Reproduce:
1. See description 2. 3.
Actual results:
vpd is panic
Expected results:
vpd should not panic
Additional info:
I guess it is privileges issue, but our pod should not be panic.
Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/658
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-34706. The following is the description of the original issue:
—
Description of problem:
Regression of OCPBUGS-12739
level=warning msg="Couldn't unmarshall OVN annotations: ''. Skipping." err="unexpected end of JSON input"
Upstream OVN changed the node annotation from "k8s.ovn.org/host-addresses" to "k8s.ovn.org/host-cidrs" in OpenShift 4.14
https://github.com/ovn-org/ovn-kubernetes/pull/3915
We might need to fix baremetal-runtimecfg
diff --git a/pkg/config/node.go b/pkg/config/node.go index 491dd4f..078ad77 100644 --- a/pkg/config/node.go +++ b/pkg/config/node.go @@ -367,10 +367,10 @@ func getNodeIpForRequestedIpStack(node v1.Node, filterIps []string, machineNetwo log.Debugf("For node %s can't find address using NodeInternalIP. Fallback to OVN annotation.", node.Name) var ovnHostAddresses []string - if err := json.Unmarshal([]byte(node.Annotations["k8s.ovn.org/host-addresses"]), &ovnHostAddresses); err != nil { + if err := json.Unmarshal([]byte(node.Annotations["k8s.ovn.org/host-cidrs"]), &ovnHostAddresses); err != nil { log.WithFields(logrus.Fields{ "err": err, - }).Warnf("Couldn't unmarshall OVN annotations: '%s'. Skipping.", node.Annotations["k8s.ovn.org/host-addresses"]) + }).Warnf("Couldn't unmarshall OVN annotations: '%s'. Skipping.", node.Annotations["k8s.ovn.org/host-cidrs"]) }
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-30-130713
How reproducible:
Frequent
Steps to Reproduce:
1. Deploy vsphere IPv4 cluster 2. Convert to Dualstack IPv4/IPv6 3. Add machine network and IPv6 apiServerInternalIPs and ingressIPs 4. Check keepalived.conf for f in $(oc get pods -n openshift-vsphere-infra -l app=vsphere-infra-vrrp --no-headers -o custom-columns=N:.metadata.name ) ; do oc -n openshift-vsphere-infra exec -c keepalived $f -- cat /etc/keepalived/keepalived.conf | tee $f-keepalived.conf ; done
Actual results:
IPv6 VIP is not in keepalived.conf
Expected results:
Something like:
vrrp_instance rbrattai_INGRESS_1 { state BACKUP interface br-ex virtual_router_id 129 priority 20 advert_int 1 unicast_src_ip fd65:a1a8:60ad:271c::cc unicast_peer { fd65:a1a8:60ad:271c:9af:16a9:cb4f:d75c fd65:a1a8:60ad:271c:86ec:8104:1bc2:ab12 fd65:a1a8:60ad:271c:5f93:c9cf:95f:9a6d fd65:a1a8:60ad:271c:bb4:de9e:6d58:89e7 fd65:a1a8:60ad:271c:3072:2921:890:9263 } ... virtual_ipaddress { fd65:a1a8:60ad:271c::1117/128 } ... }
Description of problem:
TuneD unnecessarily restarts twice when both current TuneD profile changes and when a new TuneD profile is selected.
Version-Release number of selected component (if applicable):
All NTO versions are affected.
How reproducible:
Depends on the order of k8s object updates (races), but nearly 100% reproducible.
Steps to Reproduce:
1. Install SNO 2. Label your SNO node with label "profile" 3. Create the following CR: apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: openshift-profile namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Custom OpenShift profile 1 include=openshift-node [sysctl] kernel.pty.max=4096 name: openshift-profile-1 - data: | [main] summary=Custom OpenShift profile 2 include=openshift-node [sysctl] kernel.pty.max=8192 name: openshift-profile-2 recommend: - match: - label: profile priority: 20 profile: openshift-profile-1 4. Apply the following CR: apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: openshift-profile namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Custom OpenShift profile 1 include=openshift-node [sysctl] kernel.pty.max=8192 name: openshift-profile-1 - data: | [main] summary=Custom OpenShift profile 2 include=openshift-node [sysctl] kernel.pty.max=8192 name: openshift-profile-2 recommend: - match: - label: profile priority: 20 profile: openshift-profile-2
Actual results:
You'll see two restarts/applications of the openshift-profile-1 $ cat tuned-operand.log |grep "profile-1' applied" 2024-04-19 06:10:54,685 INFO tuned.daemon.daemon: static tuning from profile 'openshift-profile-1' applied 2024-04-19 06:13:23,627 INFO tuned.daemon.daemon: static tuning from profile 'openshift-profile-1' applied
Expected results:
Only 1 application of openshift-profile-1: $ cat tuned-operand.log |grep "profile-1' applied" 2024-04-19 07:20:31,600 INFO tuned.daemon.daemon: static tuning from profile 'openshift-profile-1' applied
Additional info:
This is a clone of issue OCPBUGS-37821. The following is the description of the original issue:
—
Openshift Dedicated is in the process of developing an offering of GCP clusters that uses only short-lived credentials from the end user. For these clusters to be deployed, the pod running the Openshift Installer needs to function with GCP credentials that fit the short-lived credential formats. This worked in prior Installer versions, such as 4.14, but was not an explicit requirement.
Description of problem:
There is a problem with the logic change in https://github.com/openshift/machine-config-operator/pull/4196 that is causing Kubelet to fail to start after a reboot on OpenShiftSDN deployments. This is currently breaking all of the v4 metal jobs.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Deploy baremetal cluster with OpenShiftSDN 2. 3.
Actual results:
Nodes fail to join cluster
Expected results:
Successful cluster deployment
Additional info:
Description of problem: In an environment with the following zones, topology was disabled while it should be enabled by default
$ openstack availability zone list --compute +-----------+-------------+ | Zone Name | Zone Status | +-----------+-------------+ | AZ-0 | available | | AZ-1 | available | | AZ-2 | available | +-----------+-------------+ $ openstack availability zone list --volume +-----------+-------------+ | Zone Name | Zone Status | +-----------+-------------+ | nova | available | | AZ-0 | available | | AZ-1 | available | | AZ-2 | available | +-----------+-------------+
We have a check that verify the number of zones is identical for compute and volumes. This check should be removed. We want however to ensure that for every compute zone we have a matching volume zone.
Description of problem:
If you allow the installer to provision a Power VS Workspace instead of bringing your own, it can sometimes fail when creating a network. This is because Power Edge Router can sometimes take up to a minute to configure.
Version-Release number of selected component (if applicable):
How reproducible:
Infrequent, but will probably hit it within 50-100 runs
Steps to Reproduce:
1. Install on Power VS with IPI with serviceInstanceGUID not set in the install-config.yaml 2. Occasionally you'll observe a failure due to the workspace not being ready for networks
Actual results:
Failure
Expected results:
Success
Additional info:
Not consistently reproducible
Began permafailing somewhere in https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.ci/release/4.16.0-0.ci-2024-03-14-214308
{ fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:199]: cluster is still being upgraded: registry.build02.ci.openshift.org/ci-op-588yb1d9/release@sha256:98f570afbb8492d9b393eecc929266e987ba75088af72b234b81d2702d63f75e Ginkgo exit error 1: exit with code 1} {Cluster did not complete upgrade: timed out waiting for the condition: Could not update customresourcedefinition "infrastructures.config.openshift.io" (48 of 887): the object is invalid, possibly due to local cluster configuration }
We suspect the latter message implicates https://github.com/openshift/api/pull/1802 and a revert is open now.
Slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1710501463301079
Please review the following PR: https://github.com/openshift/machine-api-provider-gcp/pull/73
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-39220. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-37491. The following is the description of the original issue:
—
Description of problem:
co/ingress is always good even operator pod log error: 2024-07-24T06:42:09.580Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "error sending canary HTTP Request: Timeout: Get \"https://canary-openshift-ingress-canary.apps.hongli-aws.qe.devcluster.openshift.com\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-07-20-191204
How reproducible:
100%
Steps to Reproduce:
1. install AWS cluster 2. update ingresscontroller/default and adding "endpointPublishingStrategy.loadBalancer.allowedSourceRanges", eg spec: endpointPublishingStrategy: loadBalancer: allowedSourceRanges: - 1.1.1.2/32 3. above setting drop most traffic to LB, so some operator degraded
Actual results:
co/authentication and console degraded but co/ingress is still good $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.17.0-0.nightly-2024-07-20-191204 False False True 22m OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.hongli-aws.qe.devcluster.openshift.com/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) console 4.17.0-0.nightly-2024-07-20-191204 False False True 22m RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.hongli-aws.qe.devcluster.openshift.com): Get "https://console-openshift-console.apps.hongli-aws.qe.devcluster.openshift.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers) ingress 4.17.0-0.nightly-2024-07-20-191204 True False False 3h58m check the ingress operator log and see: 2024-07-24T06:59:09.588Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "error sending canary HTTP Request: Timeout: Get \"https://canary-openshift-ingress-canary.apps.hongli-aws.qe.devcluster.openshift.com\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
Expected results:
co/ingress status should reflect the real condition timely
Additional info:
even co/ingress status can be updated in some scenarios, but it is always less sensitive than authentication and console, we always rely on authentication/console to know the route healthy, the purpose of ingress canary route becomes meaningless.
Align with the version-less-ness of `rhel-coreos` and `fedora-coreos` and shorten the overall tag.
Both tags are currently aliases, `centos-stream-coreos-9` will be removed in the future.
The story is to track i18n upload/download routine tasks which are perform every sprint.
A.C.
Please review the following PR: https://github.com/openshift/bond-cni/pull/60
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Whenever I click on a card in the operator hub and developer hub the console window refreshes.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Every time
Steps to Reproduce:
1. Go to operator hub or developer hub 2. Select any card
Actual results:
Window refreshes
Expected results:
The window should not refresh and show the side panel for the card
Additional info:
Functionality around this is still inconsistent.
The correct format for a patch is `something.patch_something_else`
Valid patch filename examples would be
`something.patch`
`something.patch_something_else`
`something.patch.patch_something`
Invalid patch filename examples would be
`something.patch.something`
`something.patch.something.else`
Code and validation needs to be consistent in how this is respected.
Please review the following PR: https://github.com/openshift/machine-api-provider-azure/pull/90
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-41500. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-39081. The following is the description of the original issue:
—
If the network to the bootstrap VM is slow, the extract-machine-os.service can time out (after 180s). If this happens, it will be restarted but services that depend on it (like ironic) will never be started even once it succeeds. systemd added support for Restart:on-failure for Type:oneshot services, but they still don't behave the same way as other types of services.
This can be simulated in dev-scripts by doing:
sudo tc qdisc add dev ostestbm root netem rate 33Mbit
Please review the following PR: https://github.com/openshift/cloud-provider-kubevirt/pull/30
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-33090. The following is the description of the original issue:
—
Description of problem:
When application grouping is unchecked in display filters under the expand section the topology display is distorted and Application name is also missing.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Have some deployments 2. In topology unselect the application grouping in the display filter 3.
Actual results:
Topology shows distorted UI and Application name is missing.
Expected results:
UI should be in the correct condition and Apllication name should present.
Additional info:
Screenshot:
https://drive.google.com/file/d/1z80qLrr5v-K8ZFDa3P-n7SoDMaFtuxI7/view?usp=sharing
This is a clone of issue OCPBUGS-39414. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-30811. The following is the description of the original issue:
—
Description of problem:
On CI all the software for openstack and ansible related pieces are taken from pip and ansible-glalaxy instead of OS repository.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/openstack-cinder-csi-driver-operator/pull/149
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-33662. The following is the description of the original issue:
—
Description of problem:
We should not require the s3:DeleteObject permission for installs when the `preserveBootstrapIgnition` option is set in the install-config.
Version-Release number of selected component (if applicable):
4.14+
How reproducible:
always
Steps to Reproduce:
1. Use an account without the permission 2. Set `preserveBootstrapIgnition: true` in the install-config.yaml 3. Try to deploy a cluster
Actual results:
INFO Credentials loaded from the "denys3" profile in file "/home/cloud-user/.aws/credentials" INFO Consuming Install Config from target directory WARNING Action not allowed with tested creds action=s3:DeleteBucket WARNING Action not allowed with tested creds action=s3:DeleteObject WARNING Action not allowed with tested creds action=s3:DeleteObject WARNING Tested creds not able to perform all requested actions FATAL failed to fetch Cluster: failed to fetch dependency of "Cluster": failed to generate asset "Platform Permissions Check": validate AWS credentials: current credentials insufficient for performing cluster installation
Expected results:
No permission errors.
Additional info:
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/106
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-storage-operator/pull/435
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Two quality-of-life improvements for e2e charts:
Tracker issue for bootimage bump in 4.16. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-30600.
Description of problem:
Multiline EC private and public keys are causing systemd services to fail during cluster installations. While this issue does not result in a complete cluster failure, it generates warnings in the `journalctl` logs, which could confuse users when diagnosing installation issues. The root cause is systemd's inability to properly parse multiline keys, leading to service crashes and unnecessary log noise. This should be addressed to improve the clarity of logs and prevent misleading warnings during the cluster setup process.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create a 4.16 cluster using ABI 2. Observ the journaltctl logs 3. You will see warning/errors mentioning the EC keys(EC_PUBLIC_KEY_PEM) are not parsed correctly
Actual results:
Parsing errors related to EC keys
Expected results:
1. No parsing errors related to EC keys. Example: agent-register-infraenv.service: Ignoring invalid environment assignment 2. No multiline public key in /usr/local/share/assisted-service/assisted-service.env. Public key should be base64 encoded
Additional info:
Description of problem:
After installing an OpenShift IPI vSPhere cluter the coredns-monitor containers in the "openshift-vsphere-infra" namespace continuously report the message: "Failed to read ip from file /run/nodeip-configuration/ipv4" error="open /run/nodeip-configuration/ipv4: no such file or directory". The file "/run/nodeip-configuration/ipv4" present on the nodes is not actually moutned on the coredns pods. Apparently doesn't look to have any impact on the functionality of the cluster, but having a "failed" message on the container can triggers allarm or reserach for problem in the cluster.
Version-Release number of selected component (if applicable):
Any 4.12, 4.13, 4.14
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift IPI vSphere cluster 2. Wait forthe installation to complete 3. Read the logs of any coredns-monitor container in the "openshift-vsphere-infra" namespace
Actual results:
coredns-monitor continuously report the failed message, mesleading a cluster administartor for searching if there is a real issue.
Expected results:
coredns-monitor should not report this failed message if is not needed to fix it.
Additional info:
The same issue happens in Baremetal IPI clusters.
This is a clone of issue OCPBUGS-32696. The following is the description of the original issue:
—
Description of problem:
In Openshift web console, Dashboards tab, data is not getting loaded for "Prometheus/Overview" Dashboard
Version-Release number of selected component (if applicable):
4.16.0-ec.5
How reproducible:
OCP 4.16.0-ec.5 cluster deployed on Power using UPI installer
Steps to Reproduce:
1. Deploy 4.16.0-ec.5 cluster using UPI installer 2. Login to web console 3. Select "Dashboards" panel under "Observe" tab 4. Select "Prometheus/Overview" from the "Dashboard" drop down
Actual results:
Data/graphs are not getting loaded. "No datapoints found." message is being displayed in all panels
Expected results:
Data/Graphs should be displayed
Additional info:
Screenshots and must-gather.log are available at https://drive.google.com/drive/folders/1XnotzYBC_UDN97j_LNVygwrc77Tmmbtx?usp=drive_link Status of Prometheus pods: [root@ha-416-sajam-bastion-0 ~]# oc get pods -n openshift-monitoring | grep prometheus prometheus-adapter-dc7f96748-mczvq 1/1 Running 0 3h18m prometheus-adapter-dc7f96748-vl4n8 1/1 Running 0 3h18m prometheus-k8s-0 6/6 Running 0 7d2h prometheus-k8s-1 6/6 Running 0 7d2h prometheus-operator-677d4c87bd-8prnx 2/2 Running 0 7d2h prometheus-operator-admission-webhook-54549595bb-gp9bw 1/1 Running 0 7d3h prometheus-operator-admission-webhook-54549595bb-lsb2p 1/1 Running 0 7d3h [root@ha-416-sajam-bastion-0 ~]# Logs of Prometheus pods are available at https://drive.google.com/drive/folders/13DhLsQYneYpouuSsxYJ4VFhVrdJfQx8P?usp=drive_link
This is a clone of issue OCPBUGS-34959. The following is the description of the original issue:
—
Description of problem:
The tech preview jobs can sometimes fail: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-ovn-techpreview-serial/1787262709813743616 It seems early on the pinnedimageset controller can panic: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-ovn-techpreview-serial/1787262709813743616/artifacts/e2e-vsphere-ovn-techpreview-serial/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-66559c9856-58g4w_machine-config-controller_previous.log Although it is fine on future syncs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-ovn-techpreview-serial/1787262709813743616/artifacts/e2e-vsphere-ovn-techpreview-serial/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-66559c9856-58g4w_machine-config-controller.log
Version-Release number of selected component (if applicable):
4.16.0 techpreview only
How reproducible:
Unsure
Steps to Reproduce:
See CI
Actual results:
Expected results:
Don't panic
Additional info:
This is a clone of issue OCPBUGS-37427. The following is the description of the original issue:
—
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_installer/8758/pull-ci-openshift-installer-master-e2e-vsphere-ovn-zones/1815430759268225024/artifacts/e2e-vsphere-ovn-zones/ipi-install-install/artifacts/.openshift_install-1721673823.log
Description of problem:
OKD/FCOS uses FCOS for its bootimage which lacks several tools and services such as oc and crio that the rendezvous host of the Agent-based Installer needs to set up a bootstrap control plane.
Version-Release number of selected component (if applicable):
4.13.0 4.14.0 4.15.0
Description of the problem:
Environment: ACM running on a SNO. ACM is dual stack, but spoke cluster is IPv6 only
Attempting to deploy the spoke cluster via ZTP fails with proxy errors appearing in the ironic-python-agent logs:
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 CRITICAL ironic-python-agent [-] Unhandled error: requests.exceptions.ProxyError: HTTPSConnectionPool(host='10.240.92.11', port=5050): Max retries exceeded with url: /v1/continue (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 403 Forbidden'))) Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent Traceback (most recent call last): Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 696, in urlopen Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent self._prepare_proxy(conn) Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 964, in _prepare_proxy Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent conn.connect() Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connection.py", line 366, in connect Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent self._tunnel() Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent File "/usr/lib64/python3.9/http/client.py", line 930, in _tunnel Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent raise OSError(f"Tunnel connection failed:") Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent OSError: Tunnel connection failed: 403 Forbidden Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent During handling of the above exception, another exception occurred: Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent Traceback (most recent call last): Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/requests/adapters.py", line 439, in send Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent resp = conn.urlopen( Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 755, in urlopen Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent retries = retries.increment( Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/util/retry.py", line 574, in increment Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent raise MaxRetryError(_pool, url, error or ResponseError(cause)) Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='10.240.92.11', port=5050): Max retries exceeded with url: /v1/continue (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 403 Forbidden')))
How reproducible:
Always
Steps to reproduce:
1. Deploy ACM on a dual-stack SNO
2. Configure ACM for ZTP/GitOps
3. Use ZTP to deploy a spoke IPv6 only SNO
Actual results:
Proxy errors when the ironic agent attempts to communicate with ACM. The ironic-python-agent.conf incorrectly specifies the IPv4 endpoint:
$ cat /etc/ironic-python-agent.conf [DEFAULT] api_url = https://192.168.92.11:6385 inspection_callback_url = https://192.168.92.11:5050/v1/continue insecure = True enable_vlan_interfaces = all
Expected results:
Spoke cluster is successfully deployed via ZTP.
Please review the following PR: https://github.com/openshift/prometheus-alertmanager/pull/87
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-capi-operator/pull/150
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a regression due to the fix for https://issues.redhat.com/browse/OCPBUGS-23069.
When using dual-stack networks with networks other than OVN or SDN a validation failure results. For example when using this networking config:
networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 25 - cidr: fd01::/48 hostPrefix: 64 networkType: Calico
The following error will be returned:
{ "id": "network-prefix-valid", "status": "failure", "message": "Unexpected status ValidationError" },
When the clusterNetwork prefixes are removed the following error will result:
{ "id": "network-prefix-valid", "status": "failure", "message": "Invalid Cluster Network prefix: Host prefix, now 0, must be a positive integer." },
This is a clone of issue OCPBUGS-35039. The following is the description of the original issue:
—
Description of problem:
If there was no DHCP Network Name, then the destroy code would skip deleting the DHCP resource. Now we add a test to see if the DHCP backing VM is in ERROR state. And, if so, delete it.
Please review the following PR: https://github.com/openshift/must-gather/pull/395
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is an extension of https://issues.redhat.com/browse/HOSTEDCP-190, in which we are adding container resource preservation to more hosted control plane components.
Please review the following PR: https://github.com/openshift/prometheus/pull/195
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-35400. The following is the description of the original issue:
—
Description of problem:
without specifying "kmsKeyServiceAccount" for controlPlane leads to creating bootstrap and control-plane machines failure
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-multi-2024-06-12-211551
How reproducible:
Always
Steps to Reproduce:
1. "create install-config" and then insert disk encryption settings, but not set "kmsKeyServiceAccount" for controlPlane (see [2]) 2. "create cluster" (see [3])
Actual results:
"create cluster" failed with below error: ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to create control-plane manifest: GCPMachine.infrastructure.cluster.x-k8s.io "jiwei-0613d-capi-84z69-bootstrap" is invalid: spec.rootDiskEncryptionKey.kmsKeyServiceAccount: Invalid value: "": spec.rootDiskEncryptionKey.kmsKeyServiceAccount in body should match '[-_[A-Za-z0-9]+@[-_[A-Za-z0-9]+.iam.gserviceaccount.com
Expected results:
Installation should succeed.
Additional info:
FYI the QE test case: OCP-61160 - [IPI-on-GCP] install cluster with different custom managed keys for control-plane and compute nodes https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-61160
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Router are restarting due to memory issues
Version-Release number of selected component (if applicable):
OCP 4.12.45
How reproducible:
not easy
Router restart due to memory issues: ~~~ 3h40m Warning ProbeError pod/router-default-56c9f67f66-j8xwn Readiness probe error: Get "http://localhost:1936/healthz/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)... 3h40m Warning Unhealthy pod/router-default-56c9f67f66-j8xwn Readiness probe failed: Get "http://localhost:1936/healthz/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 3h40m Warning ProbeError pod/router-default-56c9f67f66-j8xwn Liveness probe error: Get "http://localhost:1936/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)... 3h40m Warning Unhealthy pod/router-default-56c9f67f66-j8xwn Liveness probe failed: Get "http://localhost:1936/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 3h40m Normal Killing pod/router-default-56c9f67f66-j8xwn Container router failed liveness probe, will be restarted 3h40m Warning ProbeError pod/router-default-56c9f67f66-j8xwn Readiness probe error: HTTP probe failed with statuscode: 500... 3h40m Warning Unhealthy pod/router-default-56c9f67f66-j8xwn Readiness probe failed: HTTP probe failed with statuscode: 500 ~~~ The node only host the router replica, and from prometheus it can be verified that routers are consumming all the memory in a short period of time ~20G with an hour. At some point, the number of haproxy are increasing and ending consuming all memory resources leading in a service disruption in a productive environment. As console is one of the service with highest activity as per router stats, so far customer is deleting the console pod and process decreasing from 45 to 12. Customer is willing to have a guidance about how to identify the process that is consuming the memory, haproxy monitoring is enabled but no dashboard available. Router stats from when the router has 8g-6g-3g of memory available has been requested.
Additional info:
Customer is claiming that this is a happening only in OCP 4.12.45, as other active cluster is still in version 4.10.39 and this is not happening. Upgrade is blocked because of this . Requested action: * hard-stop-after might be an option but customer expect information about side effects of this configuration. * How to reset console connection from haproxy? * Is there any documentation about haproxy prometheus queries?
Description of problem:
Fix spelling "Rememeber" to "Remember"
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/operator-framework/operator-marketplace/pull/554
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/98
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Caught by the test: Undiagnosed panic detected in pod
Sample job run:
Error message
{ pods/openshift-controller-manager_controller-manager-6b66bf5587-6ghjk_controller-manager.log.gz:E0426 23:06:02.367266 1 runtime.go:79] Observed a panic: &runtime.TypeAssertionError{_interface:(*abi.Type)(0x3c6a2a0), concrete:(*abi.Type)(0x3e612c0), asserted:(*abi.Type)(0x419cdc0), missingMethod:""} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Secret) pods/openshift-controller-manager_controller-manager-6b66bf5587-6ghjk_controller-manager.log.gz:E0426 23:06:03.368403 1 runtime.go:79] Observed a panic: &runtime.TypeAssertionError{_interface:(*abi.Type)(0x3c6a2a0), concrete:(*abi.Type)(0x3e612c0), asserted:(*abi.Type)(0x419cdc0), missingMethod:""} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Secret) pods/openshift-controller-manager_controller-manager-6b66bf5587-6ghjk_controller-manager.log.gz:E0426 23:06:04.370157 1 runtime.go:79] Observed a panic: &runtime.TypeAssertionError{_interface:(*abi.Type)(0x3c6a2a0), concrete:(*abi.Type)(0x3e612c0), asserted:(*abi.Type)(0x419cdc0), missingMethod:""} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Secret)}
Sippy indicates it's happening a small percentage of the time since around Apr 25th.
Took out the last payload so labeling trt-incident for now.
See the linked OCPBUG for the actual component.
Description of problem:
Install private cluster by using azure workload identity, and failed due to no worker machines being provisioned. install-config: ---------------------- platform: azure: region: eastus networkResourceGroupName: jima971b-12015319-rg virtualNetwork: jima971b-vnet controlPlaneSubnet: jima971b-master-subnet computeSubnet: jima971b-worker-subnet resourceGroupName: jima971b-rg publish: Internal credentialsMode: Manual Detailed check on cluster and found machine-api/ingress/image-registry operators reported permissions issues and have no access to customer vnet. $ oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE jima971b-qqjb7-master-0 Running Standard_D8s_v3 eastus 2 5h14m jima971b-qqjb7-master-1 Running Standard_D8s_v3 eastus 3 5h14m jima971b-qqjb7-master-2 Running Standard_D8s_v3 eastus 1 5h15m jima971b-qqjb7-worker-eastus1-mtc47 Failed 4h52m jima971b-qqjb7-worker-eastus2-ph8bk Failed 4h52m jima971b-qqjb7-worker-eastus3-hpmvj Failed 4h52m Errors on worker machine: -------------------- errorMessage: 'failed to reconcile machine "jima971b-qqjb7-worker-eastus1-mtc47": network.SubnetsClient#Get: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code="AuthorizationFailed" Message="The client ''705eb743-7c91-4a16-a7cf-97164edc0341'' with object id ''705eb743-7c91-4a16-a7cf-97164edc0341'' does not have authorization to perform action ''Microsoft.Network/virtualNetworks/subnets/read'' over scope ''/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima971b-12015319-rg/providers/Microsoft.Network/virtualNetworks/jima971b-vnet/subnets/jima971b-worker-subnet'' or the scope is invalid. If access was recently granted, please refresh your credentials."' errorReason: InvalidConfiguration After manually creating customer role with missed permissions for machine-api/ingress/cloud-controller-manager/image-registry, and assigning it to machine-api/ingress/cloud-controller-manager/image-registry user-assigned identity on scope of customer vnet, cluster was recovered and became running. Permissions for machine-api/cloud-controller-manager/ingress on customer vnet: "Microsoft.Network/virtualNetworks/subnets/read", "Microsoft.Network/virtualNetworks/subnets/join/action" Permissions for image-registry on customer vnet: "Microsoft.Network/virtualNetworks/subnets/read", "Microsoft.Network/virtualNetworks/subnets/join/action" "Microsoft.Network/virtualNetworks/join/action"
Version-Release number of selected component (if applicable):
4.15 nightly build
How reproducible:
always on recent 4.15 payload
Steps to Reproduce:
1. prepare install-config with private cluster configuration + credentialsMode: Manual 2. using ccoctl tool to create workload identity 3. install cluster
Actual results:
Installation failed due to permission issues
Expected results:
ccoctl also needs to assign customer role to machine-api/ccm/image-registry user-assigned identity on scope of customer vnet if it is configured in install-config
Additional info:
Issue is only detected on 4.15, it works on 4.14.
This is a clone of issue OCPBUGS-39109. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-38011. The following is the description of the original issue:
—
Description of problem:
This seems to be a requirement to set Project/namespace.However, in the CLI, RoleBinding objects can be created without namespace with no issues.
$ oc describe rolebinding.rbac.authorization.k8s.io/monitor
Name: monitor
Labels: <none>
Annotations: <none>
Role:
Kind: ClusterRole
Name: view
Subjects:
Kind Name Namespace
---- ---- ---------
ServiceAccount monitor
—
This is inconsistent with the dev console, causing confusion for developers and administrators and making things cumbersome for administrators.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Login to the web console for Developer. 2. Select Project on the left. 3. Select 'Project Access' tab. 4. Add access -> Select Sevice Account on the dropdown
Actual results:
Save button is not active when no project is selected
Expected results:
The Save button is enabled even though the Project is not selected, so that it can be created just as it is handled in the CLI.
Additional info:
Take a look to this document: https://docs.google.com/spreadsheets/d/16L3_0Jvug2XGCzOva7IH7YRIdCbqrIHJ2M9yyr9K6gk/edit#gid=1106887856 and remove unnecessary unleash features
Description of problem:
Backport volumegroupsnapshot fixes to OCP 4.16, below are the PR's that need to be backported to external-snapshotter for OCP 4.16
Description of problem:
When external TCP traffic is IP fragmented with no DF flag set and is targeted to a pod external IP, the fragmented packets are responded by RST and are not delivered to the PODs application socket.
Version-Release number of selected component (if applicable):
$ oc version
Client Version: 4.14.8
Kustomize Version: v5.0.1
Server Version: 4.14.7
Kubernetes Version: v1.27.8+4fab27b
How reproducible:
I built a reproducer for this issue on KVM hosted OCP claster.
I can simulate the same traffic as can be seen in the customer's network.
So we do have a solid reproducer for the issue.
Details are in the JIRA updates.
Steps to Reproduce:
I wrote a simple C-based tcp_server/tcp_client application for testing.
The client simply sends a file towards the server from a networking namespace with
disabled pmtu. The server app runs in a pod and simply waits for connections then reads the data from the socket and stores the received file into /tmp .
There is along the way from the client namespace a veth pair with MTU 1000 since the
path MTU is 1500.
This is enough to get ip packets fragmented along the way from the client to the server.
Details of the setup and testing steps are in the JIRA comments.
Actual results:
$ oc get network.operator -o yaml | grep routingViaHost
routingViaHost: false
All fragmented packets are responded causing a TCP RST and are not delivered to the
application socket in the pod.
Expected results:
Fragmented packets are delivered to the application socket running in a pod with
$ oc get network.operator -o yaml | grep routingViaHost
routingViaHost: false
Additional info:
There is a WA to prevent the issue.
$ oc get network.operator -o yaml | grep routingViaHost
routingViaHost: true
Makes the fragmented traffic arrive at the application socket in the pod.
I can assist with the reproducer and testing on the test env.
Regards Michal Tesar
This is a clone of issue OCPBUGS-30889. The following is the description of the original issue:
—
Description of problem:
Trying to delete the application depleyed using Serveless, with a user with limited permission, caused the "Delete application" form to complain:
pipelines.tekton.dev is forbidden: User "uitesting" cannot list resource "pipelines" in API group "tekton.dev" in the namespace "test-cluster-local"
This prevents the deletion. Worth adding that the cluster doesn't have Pipelines installed.
See the sceenshot: https://drive.google.com/file/d/1bsQ_NFO_grj_fE-UInUJXum39bPsHJh1
Version-Release number of selected component (if applicable):
4.15.0
How reproducible:
Always
Steps to Reproduce:
1. Create a limited user 2. Deploy some application, not nececcerly a Serverless one 3. Try to delete the "application" using the Dev Console
Actual results:
And unrevelant error is shown, preventing the deletetion: pipelines.tekton.dev is forbidden: User "uitesting" cannot list resource "pipelines" in API group "tekton.dev" in the namespace "test-cluster-local"
Expected results:
The app should be removed, with everything that's labelled with it.
Please review the following PR: https://github.com/openshift/cloud-provider-gcp/pull/48
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Hypershift CLI requires security group id when creating a nodepool, otherwise it fails to create a nodepool. jiezhao-mac:hypershift jiezhao$ ./bin/hypershift create nodepool aws --name=test --cluster-name=jie-test --node-count=3 2024-02-20T11:29:19-05:00 ERROR Failed to create nodepool {"error": "security group ID was not specified and cannot be determined from default nodepool"} github.com/openshift/hypershift/cmd/nodepool/core.(*CreateNodePoolOptions).CreateRunFunc.func1 /Users/jiezhao/hypershift-test/hypershift/cmd/nodepool/core/create.go:39 github.com/spf13/cobra.(*Command).execute /Users/jiezhao/hypershift-test/hypershift/vendor/github.com/spf13/cobra/command.go:983 github.com/spf13/cobra.(*Command).ExecuteC /Users/jiezhao/hypershift-test/hypershift/vendor/github.com/spf13/cobra/command.go:1115 github.com/spf13/cobra.(*Command).Execute /Users/jiezhao/hypershift-test/hypershift/vendor/github.com/spf13/cobra/command.go:1039 github.com/spf13/cobra.(*Command).ExecuteContext /Users/jiezhao/hypershift-test/hypershift/vendor/github.com/spf13/cobra/command.go:1032 main.main /Users/jiezhao/hypershift-test/hypershift/main.go:78 runtime.main /usr/local/Cellar/go/1.20.4/libexec/src/runtime/proc.go:250 Error: security group ID was not specified and cannot be determined from default nodepool security group ID was not specified and cannot be determined from default nodepool jiezhao-mac:hypershift jiezhao${code} Version-Release number of selected component (if applicable): {code:none}
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
nodepool creation should succeed without security group specified in hypershift cli
Additional info:
Please review the following PR: https://github.com/openshift/csi-driver-shared-resource-operator/pull/94
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-33410. The following is the description of the original issue:
—
Description of problem:
for ipi on vSphere. enable capi in installer and install the cluster.after destroy the cluster, according to the destroy log: it display "all folder deleted". but actually the cluster folder still exists in vSphere Client. example: 05-08 20:24:38.765 level=debug msg=Delete Folder*05-08 20:24:40.649* level=debug msg=All folders deleted*05-08 20:24:40.649* level=debug msg=Delete StoragePolicy=openshift-storage-policy-wwei-0429g-fdwqc*05-08 20:24:41.576* level=info msg=Destroyed StoragePolicy=openshift-storage-policy-wwei-0429g-fdwqc*05-08 20:24:41.576* level=debug msg=Delete Tag=wwei-0429g-fdwqc*05-08 20:24:43.463* level=info msg=Deleted Tag=wwei-0429g-fdwqc*05-08 20:24:43.463* level=debug msg=Delete TagCategory=openshift-wwei-0429g-fdwqc*05-08 20:24:44.825* level=info msg=Deleted TagCategory=openshift-wwei-0429g-fdwqc govc ls /DEVQEdatacenter/vm |grep wwei-0429g-fdwqc/DEVQEdatacenter/vm/wwei-0429g-fdwqc
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-07-025557
How reproducible:
destroy a the cluster with capi
Steps to Reproduce:
1.install cluster with capi 2.destroy cluster and check cluster folder in vSphere client
Actual results:
cluster folder still exists.
Expected results:
cluster folder should not exists in vSphere client after successful destroy.
Additional info:
This is a clone of issue OCPBUGS-33963. The following is the description of the original issue:
—
Description of problem:
kube-apiserver was stuck in updating versions when upgrade from 4.1 to 4.16 with AWS ipi installation
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-01-111315
How reproducible:
always
Steps to Reproduce:
1. IPI Install an AWS 4.1 cluster, upgrade it to 4.16 2. Upgrade was stuck in 4.15 to 4.16, waiting on etcd, kube-apiserver updating
Actual results:
1. Upgrade was stuck in 4.15 to 4.16, waiting on etcd, kube-apiserver updating $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.0-0.nightly-2024-05-16-091947 True True 39m Working towards 4.16.0-0.nightly-2024-05-16-092402: 111 of 894 done (12% complete)
Expected results:
Upgrade should be successful.
Additional info:
Must-gather: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.1-aws-ipi-f30/1791391925467615232/artifacts/aws-ipi-f30/gather-must-gather/artifacts/must-gather.tar Checked the must-gather logs, $ omg get clusterversion -oyaml ... conditions: - lastTransitionTime: '2024-05-17T09:35:29Z' message: Done applying 4.15.0-0.nightly-2024-05-16-091947 status: 'True' type: Available - lastTransitionTime: '2024-05-18T06:31:41Z' message: 'Multiple errors are preventing progress: * Cluster operator kube-apiserver is updating versions * Could not update flowschema "openshift-etcd-operator" (82 of 894): the server does not recognize this resource, check extension API servers' reason: MultipleErrors status: 'True' type: Failing $ omg get co | grep -v '.*True.*False.*False' NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE kube-apiserver 4.15.0-0.nightly-2024-05-16-091947 True True False 10m $ omg get pod -n openshift-kube-apiserver NAME READY STATUS RESTARTS AGE installer-40-ip-10-0-136-146.ec2.internal 0/1 Succeeded 0 2h29m installer-41-ip-10-0-143-206.ec2.internal 0/1 Succeeded 0 2h25m installer-43-ip-10-0-154-116.ec2.internal 0/1 Succeeded 0 2h22m installer-44-ip-10-0-154-116.ec2.internal 0/1 Succeeded 0 1h35m kube-apiserver-guard-ip-10-0-136-146.ec2.internal 1/1 Running 0 2h24m kube-apiserver-guard-ip-10-0-143-206.ec2.internal 1/1 Running 0 2h24m kube-apiserver-guard-ip-10-0-154-116.ec2.internal 0/1 Running 0 2h24m kube-apiserver-ip-10-0-136-146.ec2.internal 5/5 Running 0 2h27m kube-apiserver-ip-10-0-143-206.ec2.internal 5/5 Running 0 2h24m kube-apiserver-ip-10-0-154-116.ec2.internal 4/5 Running 17 1h34m revision-pruner-39-ip-10-0-136-146.ec2.internal 0/1 Succeeded 0 2h44m revision-pruner-39-ip-10-0-143-206.ec2.internal 0/1 Succeeded 0 2h50m revision-pruner-39-ip-10-0-154-116.ec2.internal 0/1 Succeeded 0 2h52m revision-pruner-40-ip-10-0-136-146.ec2.internal 0/1 Succeeded 0 2h29m revision-pruner-40-ip-10-0-143-206.ec2.internal 0/1 Succeeded 0 2h29m revision-pruner-40-ip-10-0-154-116.ec2.internal 0/1 Succeeded 0 2h29m revision-pruner-41-ip-10-0-136-146.ec2.internal 0/1 Succeeded 0 2h26m revision-pruner-41-ip-10-0-143-206.ec2.internal 0/1 Succeeded 0 2h26m revision-pruner-41-ip-10-0-154-116.ec2.internal 0/1 Succeeded 0 2h26m revision-pruner-42-ip-10-0-136-146.ec2.internal 0/1 Succeeded 0 2h24m revision-pruner-42-ip-10-0-143-206.ec2.internal 0/1 Succeeded 0 2h23m revision-pruner-42-ip-10-0-154-116.ec2.internal 0/1 Succeeded 0 2h23m revision-pruner-43-ip-10-0-136-146.ec2.internal 0/1 Succeeded 0 2h23m revision-pruner-43-ip-10-0-143-206.ec2.internal 0/1 Succeeded 0 2h23m revision-pruner-43-ip-10-0-154-116.ec2.internal 0/1 Succeeded 0 2h23m revision-pruner-44-ip-10-0-136-146.ec2.internal 0/1 Succeeded 0 1h35m revision-pruner-44-ip-10-0-143-206.ec2.internal 0/1 Succeeded 0 1h35m revision-pruner-44-ip-10-0-154-116.ec2.internal 0/1 Succeeded 0 1h35m Checked the kube-apiserver kube-apiserver-ip-10-0-154-116.ec2.internal logs, seems something wring with informers, $ grep 'informers not started yet' current.log | wc -l 360 $ grep 'informers not started yet' current.log 2024-05-18T06:34:51.888804183Z [-]informer-sync failed: 4 informers not started yet: [*v1.PriorityLevelConfiguration *v1.Secret *v1.FlowSchema *v1.ConfigMap] 2024-05-18T06:34:51.889350484Z [-]informer-sync failed: 4 informers not started yet: [*v1.PriorityLevelConfiguration *v1.FlowSchema *v1.Secret *v1.ConfigMap] 2024-05-18T06:34:52.004808401Z [-]informer-sync failed: 2 informers not started yet: [*v1.FlowSchema *v1.PriorityLevelConfiguration] 2024-05-18T06:34:52.095516498Z [-]informer-sync failed: 2 informers not started yet: [*v1.PriorityLevelConfiguration *v1.FlowSchema] ...
Tracker issue for bootimage bump in 4.16. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-34693.
OCP version: 4.15.0
We have monitoring alerts configured against a cluster in our longevity setup.
After receiving alerts for metal3 - we examined the graph for the pod.
The graph indicates a continuous steady growth of memory consumption.
openshift/cluster-dns-operator#394 removed these flags and the standalone manifests does not use them.
Currently causing an e2e outage.
Description of problem:
Creation of a second hostedcluster in the same namespace fails with the error "failed to set secret''s owner reference" in the status of the second hostedlcuster's yaml. ~~~ conditions: - lastTransitionTime: "2024-04-02T06:57:18Z" message: 'failed to reconcile the CLI secrets: failed to set secret''s owner reference' observedGeneration: 1 reason: ReconciliationError status: "False" type: ReconciliationSucceeded ~~~ Note that the hosted control plane namespace is still different for both clusters. Customer is just following the doc - https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.9/html/clusters/cluster_mce_overview#creating-a-hosted-cluster-bm for both the clusters and only the hostedcluster CR is created in the same namespace.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Steps to Reproduce:
1. Create a hostedcluster as per the doc https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.9/html/clusters/cluster_mce_overview#creating-a-hosted-cluster-bm 2. Create another hostedcluster in the same namespace where the first hostedcluster was created. 3. Second hostedcluster fails to proceed with the said error.
Actual results:
The hostedcluster creation fails
Expected results:
The hostedcluster creation should succeed
Additional info:
This is a clone of issue OCPBUGS-33758. The following is the description of the original issue:
—
Description of problem:
We have runbook for OVNKubernetesNorthdInactive: https://github.com/openshift/runbooks/blob/master/alerts/cluster-network-operator/OVNKubernetesNorthdInactive.md But the runbook url is not added for alert OVNKubernetesNorthdInactive: 4.12: https://github.com/openshift/cluster-network-operator/blob/c1a891129c310d01b8d6940f1eefd26058c0f5b6/bindata/network/ovn-kubernetes/managed/alert-rules-control-plane.yaml#L350 4.13: https://github.com/openshift/cluster-network-operator/blob/257435702312e418be694f4b98b8fe89557030c6/bindata/network/ovn-kubernetes/managed/alert-rules-control-plane.yaml#L350
Version-Release number of selected component (if applicable):
4.12.z, 4.13.z
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The following "error" shows up when running a gcp destroy: Invalid instance ci-op-nlm7chi8-8411c-4tl9r-master-0 in target pool af84a3203fc714c64a8043fdc814386f, target pool will not be destroyed" It is a bit misleading as this alerts when the resource is simply not part of the cluster.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
openshift-install is unable to generate an aarch64 iso: FATAL failed to write asset (Agent Installer ISO) to disk: missing boot.catalog file
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100 %
Steps to Reproduce:
1. Create an install_config.yaml with controlplane.architecture and compute.architecture = arm64 2. openshift-install agent create image --log-level debug
Actual results:
DEBUG Generating Agent Installer ISO... INFO Consuming Install Config from target directory DEBUG Purging asset "Install Config" from disk INFO Consuming Agent Config from target directory DEBUG Purging asset "Agent Config" from disk DEBUG initDisk(): start DEBUG initDisk(): regular file FATAL failed to write asset (Agent Installer ISO) to disk: missing boot.catalog file
Expected results:
agent.aarch64.iso is created
Additional info:
Seems to be related to this PR: https://github.com/openshift/installer/pull/7896 boot.catalog is also referenced in the assisted-image-service here: https://github.com/openshift/installer/blob/master/vendor/github.com/openshift/assisted-image-service/pkg/isoeditor/isoutil.go#L155
Display a warning message from kube-apiserver when creating resource integration tests
A.C.
Implement integration tests for the display a warning toast notification after creating/updating resource action
Additional:
Use Cypress Feature to stub resource response and `cy.intecerpt` method to invoke resource creation endpoint
Please review the following PR: https://github.com/openshift/cloud-provider-ibm/pull/62
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
e2e tests are unable to create a prometheus client when legacy service account API tokens are not auto-generated.
Cloned for TRT incident tracking:
During jobs that upgrade to 4.16 from 4.15, the testing of unauthenticated build webhook invocation fails (I suspect due to the existing rolebindings from 4.15 surviving the upgrade).
[sig-builds][Feature:Builds][webhook] TestWebhook [apigroup:build.openshift.io][apigroup:image.openshift.io] [Suite:openshift/conformance/parallel] . . . STEP: testing unauthenticated forbidden webhooks @ 05/07/24 20:03:20.024 STEP: executing the webhook to get the build object @ 05/07/24 20:03:20.024 [FAILED] in [It] - github.com/openshift/origin/test/extended/builds/webhook.go:36 @ 05/07/24 20:03:20.148
We need to reenable the e2e integration tests as soon as the operator is available again.
Please review the following PR: https://github.com/openshift/egress-router-cni/pull/80
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-36479. The following is the description of the original issue:
—
Description of problem:
As part of https://issues.redhat.com/browse/CFE-811, we added a featuregate "RouteExternalCertificate" to release the feature as TP, and all the code implementations were behind this gate. However, it seems https://github.com/openshift/api/pull/1731 inadvertently duplicated "ExternalRouteCertificate" as "RouteExternalCertificate".
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100%
Steps to Reproduce:
$ oc get featuregates.config.openshift.io cluster -oyaml <......> spec: featureSet: TechPreviewNoUpgrade status: featureGates: enabled: - name: ExternalRouteCertificate - name: RouteExternalCertificate <......>
Actual results:
Both RouteExternalCertificate and ExternalRouteCertificate were added in the API
Expected results:
We should have only one featuregate "RouteExternalCertificate" and the same should be displayed in https://docs.openshift.com/container-platform/4.16/nodes/clusters/nodes-cluster-enabling-features.html
Additional info:
Git commits https://github.com/openshift/api/commit/11f491c2c64c3f47cea6c12cc58611301bac10b3 https://github.com/openshift/api/commit/ff31f9c1a0e4553cb63c3e530e46a3e8d2e30930 Slack thread: https://redhat-internal.slack.com/archives/C06EK9ZH3Q8/p1719867937186219
Please review the following PR: https://github.com/openshift/operator-framework-catalogd/pull/36
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Upon debugging, nodes are stuck in NotReady state and CNI is not initialised on them.
Seeing the following error log in cluster network operator
failed parsing certificate data from ConfigMap "openshift-service-ca.crt": failed to parse certificate PEM
CNO operator logs: https://docs.google.com/document/d/1hor1r9ue4gnetkXm9mh8AKa7vm8zNBPhUQqWCbbnnUc/edit?usp=sharing
This is happening on a management cluster that is configured to use legacy service CA's:
$ oc get kubecontrollermanager/cluster -o yaml --as system:admin apiVersion: operator.openshift.io/v1 kind: KubeControllerManager metadata: name: cluster spec: logLevel: Normal managementState: Managed operatorLogLevel: Normal unsupportedConfigOverrides: null useMoreSecureServiceCA: false
In newer clusters, useMoreSecureServiceCA is set to true.
Please review the following PR: https://github.com/openshift/prom-label-proxy/pull/364
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Flake's gonna flake
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Run e2e test 2. ... 3. Profit
Actual results:
Red
Expected results:
Green
Additional info:
Conditional risks have looser naming restrictions:
// +kubebuilder:validation:Required // +kubebuilder:validation:MinLength=1 // +required Name string `json:"name"`
...than condition Reason field:
// +required // +kubebuilder:validation:Required // +kubebuilder:validation:MaxLength=1024 // +kubebuilder:validation:MinLength=1 // +kubebuilder:validation:Pattern=`^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$` Reason string `json:"reason" protobuf:"bytes,5,opt,name=reason"`
CVO can use a name risk as a reason so when a name of the applying risk is invalid, CVO will start trying to update the ClusterVersion resource with an invalid one.
4.14
always
Make the cluster consume update graph data containing a conditional edge with a risk with a name that does not follow the Condition.Reason restriction, e.g. uses a - character. The risk needs to apply on the cluster. For example:
{ "nodes": [ {"version": "CLUSTER-BOT-VERSION", "payload": "CLUSTER-BOT-PAYLOAD"}, {"version": "4.12.22", "payload": "quay.io/openshift-release-dev/ocp-release@sha256:1111111111111111111111111111111111111111111111111111111111111111"} ], "conditionalEdges": [ { "edges": [{"from": "CLUSTER-BOT-VERSION", "to": "4.12.22"}], "risks": [ { "url": "https://github.com/openshift/api/blob/8891815aa476232109dccf6c11b8611d209445d9/vendor/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L1515C4-L1520", "name": "OCPBUGS-9050", "message": "THere is no validation that risk name is a valid Condition.Reason so let's just use a - character in it.", "matchingRules": [{"type": "PromQL", "promql": { "promql": "group by (type) (cluster_infrastructure_provider)"}}] } ] } ] }
Then, observe the ClusterVersion status field after the cluster has a chance to evaluate the risk:
$ oc get clusterversion version -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="Progressing" or .type=="Available" or .type=="Failing")'
{ "lastTransitionTime": "2023-09-01T13:21:49Z", "status": "False", "type": "Available" } { "lastTransitionTime": "2023-09-01T13:21:49Z", "message": "ClusterVersion.config.openshift.io \"version\" is invalid: status.conditionalUpdates[0].conditions[0].reason: Invalid value: \"OCPBUGS-9050\": status.conditionalUpdates[0].conditions[0].reason in body should match '^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$'", "status": "True", "type": "Failing" } { "lastTransitionTime": "2023-09-01T13:14:34Z", "message": "Error ensuring the cluster version is up to date: ClusterVersion.config.openshift.io \"version\" is invalid: status.conditionalUpdates[0].conditions[0].reason: Invalid value: \"OCPBUGS-9050\": status.conditionalUpdates[0].conditions[0].reason in body should match '^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$'", "status": "False", "type": "Progressing" }
No errors, CVO continues to work and either uses some sanitized version of the name as the reason, or maybe uses something generic, like RiskApplies.
CVO does not get stuck after consuming data from external source
1. We should CI PRs to o/cincinnati-graph-data so we never create invalid data
2. We should sanitize the field in CVO code so that CVO never attempts to submit an invalid ClusterVersion.status.conditionalUpdates.condition.reason
3. We should further restrict the conditional update risk name in the CRD so it is guaranteed compatible with Condition.Reason
4. We should sanitize the field in CVO code after being read from OSUS so that CVO never attempts to submit an invalid (after we do 3) ClusterVersion.conditinalUpdates.name
Please review the following PR: https://github.com/openshift/whereabouts-cni/pull/212
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
On ipv6-primary dualstack, it is observed that the test:
"[sig-installer][Suite:openshift/openstack][lb][Serial] The Openstack platform should re-use an existing UDP Amphora LoadBalancer when new svc is created on Openshift with the proper annotation"
fails, because CCM is considering it as "internal":
I0216 10:13:07.053922 1 loadbalancer.go:2113] "EnsureLoadBalancer" cluster="kubernetes" service="e2e-test-openstack-sprfn/udp-lb-shared2-svc" E0216 10:13:07.124915 1 controller.go:298] error processing service e2e-test-openstack-sprfn/udp-lb-shared2-svc (retrying with exponential backoff): failed to ensure load balancer: internal Service cannot share a load balancer I0216 10:13:07.125445 1 event.go:307] "Event occurred" object="e2e-test-openstack-sprfn/udp-lb-shared2-svc" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: internal Service cannot share a load balancer"
However, both LBs do not have the below annotation:
"service.beta.kubernetes.io/openstack-internal-load-balancer": "true"
Versions:
4.15.0-0.nightly-2024-02-14-052317
RHOS-16.2-RHEL-8-20230510.n.1
This is a clone of issue OCPBUGS-39313. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-29497. The following is the description of the original issue:
—
While updating an HC with controllerAvailabilityPolicy of SingleReplica, the HCP doesn't fully rollout with 3 pod stuck in Pending
multus-admission-controller-5b5c95684b-v5qgd 0/2 Pending 0 4m36s network-node-identity-7b54d84df4-dxx27 0/3 Pending 0 4m12s ovnkube-control-plane-647ffb5f4d-hk6fg 0/3 Pending 0 4m21s
This is because these deployment all have requiredDuringSchedulingIgnoredDuringExecution zone anti-affinity and maxUnavailable: 25% (i.e. 1)
Thus the old pod blocks scheduling of the new pod.
Description of problem:
While attempting to provision 300 clusters every hour of mixed cluster sizes (SNO, Compact, and standard cluster sizes) It appears that the metal3 baremetal operator has his a failure to provision any clusters. Out of the 1850 attempted clusters, only 282 successfully provisioned (Mostly SNO size). There seems to be many errors in the baremetal operator log, some of which are actual stack traces but it is unclear if this is the actually reason why the clusters began to fail to install with 100% not installing on the 3rd wave and beyond.
Version-Release number of selected component (if applicable):
Hub OCP - 4.14.0-rc.2 Deployed Cluster OCP - 4.14.0-rc.2 ACM - 2.9.0-DOWNSTREAM-2023-09-27-22-12-46
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Some of the errors found in the logs: {"level":"error","ts":"2023-09-28T22:39:56Z","msg":"Reconciler error","controller":"baremetalhost","controllerGroup":"metal3.io","controllerKind":"BareMetalHost","BareMetalHost":{"name":"vm01343","namespace":"compact-00046"},"namespace":"compact-00046","name":"vm01343","reconcileID":"4bbfa52f-12a6-4983-b86b-01086491de9f","error":"action \"provisioning\" failed: failed to provision: failed to change provisioning state to \"active\": Internal Server Error","errorVerbose":"Internal Server Error\nfailed to change provisioning state to \"active\"\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).tryChangeNodeProvisionState\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:740\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).changeNodeProvisionState\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:750\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).Provision\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:1604\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).actionProvisioning\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:1179\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).handleProvisioning\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:527\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:202\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:225\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1598\nfailed to provision\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).actionProvisioning\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:1188\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).handleProvisioning\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:527\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:202\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:225\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1598\naction \"provisioning\" failed\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:229\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1598","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226"} {"level":"info","ts":"2023-09-29T16:11:24Z","logger":"provisioner.ironic","msg":"error caught while checking endpoint","host":"standard-00241~vm03618","endpoint":"https://metal3-state.openshift-machine-api.svc.cluster.local:6388/v1/","error":"Bad Gateway"}
Please review the following PR: https://github.com/openshift/installer/pull/7819
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-33694. The following is the description of the original issue:
—
Description of problem:
kubelet does not start after reboot due to dependency issue
Version-Release number of selected component (if applicable):
OCP 4.14.23
How reproducible:
Every time at customer end
Steps to Reproduce:
1. Upgrade Openshift cluster (OVN based) with kdump enabled to OCP 4.14.23 2. Check kubelet and crio status
Actual results:
kubelet and crio services are in dead state and do not start automatically after reboot, manual intervention is needed. $ cat sos_commands/crio/systemctl_status_crio ○ crio.service - Container Runtime Interface for OCI (CRI-O) Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─01-kubens.conf, 05-mco-ordering.conf, 10-mco-default-env.conf, 10-mco-default-madv.conf, 10-mco-profile-unix-socket.conf, 20-nodenet.conf Active: inactive (dead) Docs: https://github.com/cri-o/cri-o$ cat sos_commands/openshift/systemctl_status_kubelet ○ kubelet.service - Kubernetes Kubelet Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; preset: disabled) Drop-In: /etc/systemd/system/kubelet.service.d └─01-kubens.conf, 10-mco-default-env.conf, 10-mco-default-madv.conf, 20-logging.conf, 20-nodenet.conf Active: inactive (dead)
Expected results:
kubelet and crio should start automatically.
Additional info:
I feel the recent patch to wait till kdump starts has broken the ordering cycle. https://github.com/openshift/machine-config-operator/pull/4213/files May 09 19:12:05 network01 systemd[1]: network-online.target: Found dependency on kdump.service/start May 09 19:13:48 network01 systemd[1]: ovs-configuration.service: Found ordering cycle on kdump.service/start May 09 19:13:48 network01 systemd[1]: ovs-configuration.service: Job kdump.service/start deleted to break ordering cycle starting with ovs-configuration.service/start May 12 21:20:57 network01 systemd[1]: node-valid-hostname.service: Found dependency on kdump.service/start May 12 21:21:00 network01 kdumpctl[1389]: kdump: kexec: loaded kdump kernel May 12 21:21:00 network01 kdumpctl[1389]: kdump: Starting kdump: [OK] May 12 21:25:28 network01 systemd[1]: kdump.service: Found ordering cycle on network-online.target/start May 12 21:25:28 network01 systemd[1]: kdump.service: Found dependency on node-valid-hostname.service/start May 12 21:25:28 network01 systemd[1]: kdump.service: Found dependency on ovs-configuration.service/start May 12 21:25:28 network01 systemd[1]: kdump.service: Found dependency on kdump.service/start May 12 21:25:28 network01 systemd[1]: kdump.service: Job network-online.target/start deleted to break ordering cycle starting with kdump.service/start May 12 21:25:31 network01 kdumpctl[1284]: kdump: kexec: loaded kdump kernel May 12 21:25:31 network01 kdumpctl[1284]: kdump: Starting kdump: [OK] To break a cycle, systemd deletes a job part of the cycle, making the corresponding service not to be started. Disabling kdump and rebooting the node helps, kubelet and crio start automatically. # systemctl disable kdump # systemctl reboot Make sure systemctl list-jobs do not have any pending jobs, once it is completed, we can check status of kubelet. # systemctl list-jobs # systemctl status kubelet
Add documentation on how to debug Azure nodes
Description of problem:
Specify "metadataService.authentication: Required" config for cluster: platform.aws.defaultMachinePlatform.metadataService.authentication: Required Or compute.plarform.aws.metadataService.authentication: Required controlPlane.plarform.aws.metadataService.authentication: Required Creating cluster got following error: INFO Created manifest *v1.Namespace, namespace= name=openshift-cluster-api-guests ... INFO Creating Route53 records for control plane load balancer ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to create control-plane manifest: AWSMachine.infrastructure.cluster.x-k8s.io "yunjiang-cap4-fp88c-bootstrap" is invalid: spec.instanceMetadataOptions.httpTokens: Unsupported value: "Required": supported values: "optional", "required" INFO Shutting down local Cluster API control plane... ... "Required" is a valid value: openshift-install explain installconfig.platform.aws.defaultMachinePlatform.metadataService.authentication KIND: InstallConfig VERSION: v1 RESOURCE: <string> Valid Values: "Required","Optional" ...
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-08-222442
How reproducible:
Steps to Reproduce:
1.See description 2. 3.
Actual results:
See description
Expected results:
No issues with above configurations
Additional info:
This is a clone of issue OCPBUGS-37837. The following is the description of the original issue:
—
In our vertical scaling test, after we delete a machine, we rely on the `status.readyReplicas` field of the ControlPlaneMachineSet (CPMS) to indicate that it has successfully created a new machine that let's us scale up before we scale down.
https://github.com/openshift/origin/blob/3deedee4ae147a03afdc3d4ba86bc175bc6fc5a8/test/extended/etcd/vertical_scaling.go#L76-L87
As we've seen in the past as well, that status field isn't a reliable indicator of the scale up of machines, as status.readyReplicas might stay at 3 as the soon-to-be-removed node that is pending deletion can go Ready=Unknown in runs such as the following: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1286/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling/1808186565449486336
Which then ends up the test timing out on waiting for status.readyReplicas=4 while the scale-up and down may already have happened.
This shows up across scaling tests on all platforms as:
fail [github.com/openshift/origin/test/extended/etcd/vertical_scaling.go:81]: Unexpected error: <*errors.withStack | 0xc002182a50>: scale-up: timed out waiting for CPMS to show 4 ready replicas: timed out waiting for the condition { error: <*errors.withMessage | 0xc00304c3a0>{ cause: <wait.errInterrupted>{ cause: <*errors.errorString | 0xc0003ca800>{ s: "timed out waiting for the condition", }, }, msg: "scale-up: timed out waiting for CPMS to show 4 ready replicas", },
In hindsight all we care about is whether the deleted machine's member is replaced by another machine's member and can ignore the flapping of node and machine statuses while we wait for the scale-up then down of members to happen. So we can relax or replace that check on status.readyReplicas with just looking at the membership change.
PS: We can also update the outdated Godoc comments for the test to mention that it relies on CPMSO to create a machine for us https://github.com/openshift/origin/blob/3deedee4ae147a03afdc3d4ba86bc175bc6fc5a8/test/extended/etcd/vertical_scaling.go#L34-L38
Description of problem:
HyperShift control plane pods that support auditing (i.e. Kubernetes API server, OpenShift API server, and OpenShift oauth API server) maintain auditing log files that may consume many GB of container ephemeral storage in short period of time. We need to reduce the size of logs in these containers by modifying audit-log-maxbackup and audit-log-maxsize. This should not change the functionality of the audit logs since all we do is output to stdout in the containerd logs.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/node_exporter/pull/141
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-node-driver-registrar/pull/65
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-34683. The following is the description of the original issue:
—
Description of problem:
When building https://github.com/kubevirt-ui/kubevirt-plugin from its release-4.16 branch, following warnings are issued during the webpack build:
WARNING in shared module react No required version specified and unable to automatically determine one. Unable to find required version for "react" in description file (/home/vszocs/work/kubevirt-plugin/node_modules/react/package.json). It need to be in dependencies, devDependencies or peerDependencies.
These warnings should not appear during the plugin build.
Root cause seems to be webpack module federation code which attempts to auto-detect actual build version of shared modules, but this code seems to be unreliable and warnings such as the one above are anything but helpful.
How reproducible: always on kubevirt-plugin branch release-4.16
Steps to Reproduce:
1. git clone https://github.com/kubevirt-ui/kubevirt-plugin
2. cd kubevirt-plugin
3. yarn && yarn dev
Our test that watches for alerts to appear we've never seen before has picked something up on https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.ci/release/4.16.0-0.ci-2024-02-02-031544
[sig-trt][invariant] No new alerts should be firing expand_less 0s
{ Found alerts firing which are new or less than two weeks old, which should not be firing: PrometheusOperatorRejectedResources has no test data, this alert appears new and should not be firing}It hit about 3-4 out of 10 on both azure and aws agg jobs. Could be a regression, could be something really rare.
For years, the TechPreviewNoUpgrade alert has used:
cluster_feature_set{name!="", namespace="openshift-kube-apiserver-operator"} == 0
But recently testing 4.12.19, I saw the alert pending with name="LatencySensitive". Alerting on that not-useful-since-4.6 feature set is fine (OCPBUGS-14497), but TechPreviewNoUpgrade isn't a great name when the actual feature set is LatencySensitive. And the summary and descripition don't apply to LatencySensitive either.
The buggy expr / alertname pair shipped in 4.3.0.
All the time.
1. Install a cluster like 4.12.19.
2. Set the LatencySensitive feature set:
$ oc patch featuregate cluster --type=json --patch='[{"op":"add","path":"/spec/featureSet","value":"LatencySensitive"}]'
3. Check alerts with /monitoring/alerts?rowFilter-alert-source=platform&resource-list-text=TechPreviewNoUpgrade in the web console.
TechPreviewNoUpgrade is pending or firing.
Something appropriate to LatencySensitive, like a generic alert that covers all non-default feature sets, is pending or firing.
Please review the following PR: https://github.com/openshift/azure-workload-identity/pull/8
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
ovnkube-node doesn't issue a CSR to get new certificates when node is suspended for 30 days
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Setup a libvirt cluster on machine 2. Disable chronyd on all nodes and host machine 3. Suspend nodes 4. Change time on host 30 days forward 5. Resume nodes 6. Wait for API server to come up 7. Wait for all operators to become ready
Actual results:
ovnkube-node would attempt to use expired certs: 2024-01-21T01:24:41.576365431+00:00 stderr F I0121 01:24:41.573615 8852 master.go:740] Adding or Updating Node "test-infra-cluster-4832ebf8-master-0" 2024-04-20T01:25:08.519622252+00:00 stderr F I0420 01:25:08.516550 8852 services_controller.go:567] Deleting service openshift-operator-lifecycle-manager/packageserver-service 2024-04-20T01:25:08.900228370+00:00 stderr F I0420 01:25:08.898580 8852 services_controller.go:567] Deleting service openshift-operator-lifecycle-manager/packageserver-service 2024-04-20T01:25:17.137956433+00:00 stderr F I0420 01:25:17.137891 8852 obj_retry.go:296] Retry object setup: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp 2024-04-20T01:25:17.137956433+00:00 stderr F I0420 01:25:17.137933 8852 obj_retry.go:358] Adding new object: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp 2024-04-20T01:25:17.137997952+00:00 stderr F I0420 01:25:17.137979 8852 obj_retry.go:370] Retry add failed for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp, will try again later: failed to obtain IPs to add remote pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: suppressed error logged: pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: no pod IPs found 2024-04-20T01:25:19.099635059+00:00 stderr F I0420 01:25:19.099057 8852 egressservice_zone_node.go:110] Processing sync for Egress Service node test-infra-cluster-4832ebf8-master-1 2024-04-20T01:25:19.099635059+00:00 stderr F I0420 01:25:19.099080 8852 egressservice_zone_node.go:113] Finished syncing Egress Service node test-infra-cluster-4832ebf8-master-1: 35.077µs 2024-04-20T01:25:22.245550966+00:00 stderr F W0420 01:25:22.242774 8852 base_network_controller_namespace.go:458] Unable to remove remote zone pod's openshift-controller-manager/controller-manager-5485d88c84-xztxq IP address from the namespace address-set, err: pod openshift-controller-manager/controller-manager-5485d88c84-xztxq: no pod IPs found 2024-04-20T01:25:22.262446336+00:00 stderr F W0420 01:25:22.261351 8852 base_network_controller_namespace.go:458] Unable to remove remote zone pod's openshift-route-controller-manager/route-controller-manager-6b5868f887-n6jj9 IP address from the namespace address-set, err: pod openshift-route-controller-manager/route-controller-manager-6b5868f887-n6jj9: no pod IPs found 2024-04-20T01:25:27.154790226+00:00 stderr F I0420 01:25:27.154744 8852 egressservice_zone_node.go:110] Processing sync for Egress Service node test-infra-cluster-4832ebf8-worker-0 2024-04-20T01:25:27.154790226+00:00 stderr F I0420 01:25:27.154770 8852 egressservice_zone_node.go:113] Finished syncing Egress Service node test-infra-cluster-4832ebf8-worker-0: 31.72µs 2024-04-20T01:25:27.172301639+00:00 stderr F I0420 01:25:27.168666 8852 egressservice_zone_node.go:110] Processing sync for Egress Service node test-infra-cluster-4832ebf8-master-2 2024-04-20T01:25:27.172301639+00:00 stderr F I0420 01:25:27.168692 8852 egressservice_zone_node.go:113] Finished syncing Egress Service node test-infra-cluster-4832ebf8-master-2: 34.346µs 2024-04-20T01:25:27.196078022+00:00 stderr F I0420 01:25:27.194311 8852 egressservice_zone_node.go:110] Processing sync for Egress Service node test-infra-cluster-4832ebf8-master-0 2024-04-20T01:25:27.196078022+00:00 stderr F I0420 01:25:27.194339 8852 egressservice_zone_node.go:113] Finished syncing Egress Service node test-infra-cluster-4832ebf8-master-0: 40.027µs 2024-04-20T01:25:27.196078022+00:00 stderr F I0420 01:25:27.194582 8852 master.go:740] Adding or Updating Node "test-infra-cluster-4832ebf8-master-0" 2024-04-20T01:25:27.215435944+00:00 stderr F I0420 01:25:27.215387 8852 master.go:740] Adding or Updating Node "test-infra-cluster-4832ebf8-master-0" 2024-04-20T01:25:35.789830706+00:00 stderr F I0420 01:25:35.789782 8852 egressservice_zone_node.go:110] Processing sync for Egress Service node test-infra-cluster-4832ebf8-worker-1 2024-04-20T01:25:35.790044794+00:00 stderr F I0420 01:25:35.790025 8852 egressservice_zone_node.go:113] Finished syncing Egress Service node test-infra-cluster-4832ebf8-worker-1: 250.227µs 2024-04-20T01:25:37.596875642+00:00 stderr F I0420 01:25:37.596834 8852 iptables.go:358] "Running" command="iptables-save" arguments=["-t","nat"] 2024-04-20T01:25:47.138312366+00:00 stderr F I0420 01:25:47.138266 8852 obj_retry.go:296] Retry object setup: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp 2024-04-20T01:25:47.138382299+00:00 stderr F I0420 01:25:47.138370 8852 obj_retry.go:358] Adding new object: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp 2024-04-20T01:25:47.138453866+00:00 stderr F I0420 01:25:47.138440 8852 obj_retry.go:370] Retry add failed for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp, will try again later: failed to obtain IPs to add remote pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: suppressed error logged: pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: no pod IPs found 2024-04-20T01:26:17.138583468+00:00 stderr F I0420 01:26:17.138544 8852 obj_retry.go:296] Retry object setup: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp 2024-04-20T01:26:17.138640587+00:00 stderr F I0420 01:26:17.138629 8852 obj_retry.go:358] Adding new object: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp 2024-04-20T01:26:17.138708817+00:00 stderr F I0420 01:26:17.138696 8852 obj_retry.go:370] Retry add failed for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp, will try again later: failed to obtain IPs to add remote pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: suppressed error logged: pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: no pod IPs found 2024-04-20T01:26:39.474787436+00:00 stderr F I0420 01:26:39.474744 8852 reflector.go:790] k8s.io/client-go/informers/factory.go:159: Watch close - *v1.EndpointSlice total 130 items received 2024-04-20T01:26:39.475670148+00:00 stderr F E0420 01:26:39.475653 8852 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.EndpointSlice: the server has asked for the client to provide credentials (get endpointslices.discovery.k8s.io) 2024-04-20T01:26:40.786339334+00:00 stderr F I0420 01:26:40.786255 8852 reflector.go:325] Listing and watching *v1.EndpointSlice from k8s.io/client-go/informers/factory.go:159 2024-04-20T01:26:40.806238387+00:00 stderr F W0420 01:26:40.804542 8852 reflector.go:535] k8s.io/client-go/informers/factory.go:159: failed to list *v1.EndpointSlice: Unauthorized 2024-04-20T01:26:40.806238387+00:00 stderr F E0420 01:26:40.804571 8852 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Unauthorized
Expected results:
ovnkube-node detects that cert is expired, requests new certs via CSR flow and reloads them
Additional info:
CI periodic to check this flow: https://prow.ci.openshift.org/job-history/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ovn-sno-cert-rotation-suspend-30d artifacts contain sosreport Applies to SNO and HA clusters, works as expected when nodes are being properly shutdown instead of suspended
This is a clone of issue OCPBUGS-33733. The following is the description of the original issue:
—
Description of problem:
When creating a Serverless Function via Web Console from GIT repository the validation claims that builder strategy is not s2i. However if the build strategy is not set in func.yaml, then the s2i should be assumed implicitly. There should be no error. There should be error only if the strategy is explicitly set to something other than s2i in func.yaml.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Try to create Serverless function from git repository where func.yaml does not explicitly specify builder. 2. The Serverless Function cannot be created because of the validation.
Actual results:
The Function cannot be created.
Expected results:
The function can be created.
Additional info:
This is a clone of issue OCPBUGS-34877. The following is the description of the original issue:
—
Description of problem:
oc adm prune deployments` does not work and giving below error when using --replica-set option.
[root@weyb1525 ~]# oc adm prune deployments --orphans --keep-complete=1 --keep-failed=0 --keep-younger-than=1440m --replica-sets --v=6 I0603 09:55:39.588085 1540280 loader.go:373] Config loaded from file: /root/openshift-install/paas-03.build.net.intra.laposte.fr/auth/kubeconfig I0603 09:55:39.890672 1540280 round_trippers.go:553] GET https://api-int.paas-03.build.net.intra.laposte.fr:6443/apis/apps.openshift.io/v1/deploymentconfigs 200 OK in 301 milliseconds Warning: apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+ I0603 09:55:40.529367 1540280 round_trippers.go:553] GET https://api-int.paas-03.build.net.intra.laposte.fr:6443/apis/apps/v1/deployments 200 OK in 65 milliseconds I0603 09:55:41.369413 1540280 round_trippers.go:553] GET https://api-int.paas-03.build.net.intra.laposte.fr:6443/api/v1/replicationcontrollers 200 OK in 706 milliseconds I0603 09:55:43.083804 1540280 round_trippers.go:553] GET https://api-int.paas-03.build.net.intra.laposte.fr:6443/apis/apps/v1/replicasets 200 OK in 118 milliseconds I0603 09:55:43.320700 1540280 prune.go:58] Creating deployment pruner with keepYoungerThan=24h0m0s, orphans=true, replicaSets=true, keepComplete=1, keepFailed=0 Dry run enabled - no modifications will be made. Add --confirm to remove deployments panic: interface conversion: interface {} is *v1.Deployment, not *v1.DeploymentConfig goroutine 1 [running]: github.com/openshift/oc/pkg/cli/admin/prune/deployments.(*dataSet).GetDeployment(0xc007fa9bc0, {0x5052780?, 0xc00a0b67b0?}) /go/src/github.com/openshift/oc/pkg/cli/admin/prune/deployments/data.go:171 +0x3d6 github.com/openshift/oc/pkg/cli/admin/prune/deployments.(*orphanReplicaResolver).Resolve(0xc006ec87f8) /go/src/github.com/openshift/oc/pkg/cli/admin/prune/deployments/resolvers.go:78 +0x1a6 github.com/openshift/oc/pkg/cli/admin/prune/deployments.(*mergeResolver).Resolve(0x55?) /go/src/github.com/openshift/oc/pkg/cli/admin/prune/deployments/resolvers.go:28 +0xcf github.com/openshift/oc/pkg/cli/admin/prune/deployments.(*pruner).Prune(0x5007c40?, {0x50033e0, 0xc0083c19e0}) /go/src/github.com/openshift/oc/pkg/cli/admin/prune/deployments/prune.go:96 +0x2f github.com/openshift/oc/pkg/cli/admin/prune/deployments.PruneDeploymentsOptions.Run({0x0, 0x1, 0x1, 0x4e94914f0000, 0x1, 0x0, {0x0, 0x0}, {0x5002d00, 0xc000ba78c0}, ...}) /go/src/github.com/openshift/oc/pkg/cli/admin/prune/deployments/deployments.go:206 +0xa03 github.com/openshift/oc/pkg/cli/admin/prune/deployments.NewCmdPruneDeployments.func1(0xc0005f4900?, {0xc0006db020?, 0x0?, 0x6?}) /go/src/github.com/openshift/oc/pkg/cli/admin/prune/deployments/deployments.go:78 +0x118 github.com/spf13/cobra.(*Command).execute(0xc0005f4900, {0xc0006dafc0, 0x6, 0x6}) /go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:944 +0x847 github.com/spf13/cobra.(*Command).ExecuteC(0xc000e5b800) /go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:1068 +0x3bd github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:992 k8s.io/component-base/cli.run(0xc000e5b800) /go/src/github.com/openshift/oc/vendor/k8s.io/component-base/cli/run.go:146 +0x317 k8s.io/component-base/cli.RunNoErrOutput(...) /go/src/github.com/openshift/oc/vendor/k8s.io/component-base/cli/run.go:84 main.main() /go/src/github.com/openshift/oc/cmd/oc/oc.go:77 +0x365
Version-Release number of selected component (if applicable):
How reproducible:
Run oc adm prune deployments command with --replica-sets option
# oc adm prune deployments --keep-younger-than=168h --orphans --keep-complete=5 --keep-failed=1 --replica-sets=true
Actual results:
Its failing with below error:panic: interface conversion: interface {} is *v1.Deployment, not *v1.DeploymentConfig
Expected results:
Its should not fail and work as expected.
Additional info:
Slack thread https://redhat-internal.slack.com/archives/CKJR6200N/p1717519017531979
Description of problem:
`ensureSigningCertKeyPair` and `ensureTargetCertKeyPair` are always updating secret type. if the secret requires metadata update, its previous content will not be retained
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Install 4.6 cluster (or make sure installer-generated secrets have `type: SecretTypeTLS` instead of `type: kubernetes.io/tls` 2. Run secret sync 3. Check secret contents
Actual results:
Secret was regenerated with new content
Expected results:
Existing content should be preserved, content is not modified
Additional info:
This causes api-int CA update for clusters born in 4.6 or earlier.
Description of problem:
When trying to run console in local development with auth, the run-bridge.sh script fails out.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Follow step for local development of console with auth - https://github.com/openshift/console/tree/master?tab=readme-ov-file#openshift-with-authentication 2. 3.
Actual results:
The run-bridge.sh scripts fails with:
$ ./examples/run-bridge.sh ++ oc whoami --show-server ++ oc -n openshift-config-managed get configmap monitoring-shared-config -o 'jsonpath={.data.alertmanagerPublicURL}' ++ oc -n openshift-config-managed get configmap monitoring-shared-config -o 'jsonpath={.data.thanosPublicURL}' + ./bin/bridge --base-address=http://localhost:9000 --ca-file=examples/ca.crt --k8s-auth=openshift --k8s-mode=off-cluster --k8s-mode-off-cluster-endpoint=https://api.lprabhu-030420240903.devcluster.openshift.com:6443 --k8s-mode-off-cluster-skip-verify-tls=true --listen=http://127.0.0.1:9000 --public-dir=./frontend/public/dist --user-auth=openshift --user-auth-oidc-client-id=console-oauth-client --user-auth-oidc-client-secret-file=examples/console-client-secret --user-auth-oidc-ca-file=examples/ca.crt --k8s-mode-off-cluster-alertmanager=https://alertmanager-main-openshift-monitoring.apps.lprabhu-030420240903.devcluster.openshift.com --k8s-mode-off-cluster-thanos=https://thanos-querier-openshift-monitoring.apps.lprabhu-030420240903.devcluster.openshift.com W0403 14:25:07.936281 49352 authoptions.go:99] Flag inactivity-timeout is set to less then 300 seconds and will be ignored! F0403 14:25:07.936827 49352 main.go:539] Failed to create k8s HTTP client: failed to read token file "/var/run/secrets/kubernetes.io/serviceaccount/token": open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
Expected results:
Bridge runs fine
Additional info:
Description of problem:
Creating a ovn-k8s-cni-overlay generates incorrect YAML
Version-Release number of selected component (if applicable):
4.16.9
How reproducible:
100%
Steps to Reproduce:
1. Console 2.Networking 3. NAD 4. Create 5.Network type = OVN localnet
Actual results:
apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: annotations: k8s.v1.cni.cncf.io/resourceName: openshift.io/ <--------- wrong creationTimestamp: "2024-09-13T04:58:15Z" generation: 1 name: network-aware-ermine namespace: default resourceVersion: "43545754" uid: 543537e3-6981-4d43-a2cb-4f77b9b70824 spec: config: '{"name":"asdasdsa","type":"ovn-k8s-cni-overlay","cniVersion":"0.4.0","topology":"localnet","vlanID":3000,"mtu":1500,"netAttachDefName":"default/network-aware-ermine"}'
Expected results:
Without that annotation
apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: creationTimestamp: "2024-09-13T04:58:15Z" generation: 1 name: network-aware-ermine namespace: default resourceVersion: "43545754" uid: 543537e3-6981-4d43-a2cb-4f77b9b70824 spec: config: '{"name":"asdasdsa","type":"ovn-k8s-cni-overlay","cniVersion":"0.4.0","topology":"localnet","vlanID":3000,"mtu":1500,"netAttachDefName":"default/network-aware-ermine"}'
Additional info:
This makes pods and virtual machines using the NAD fail to start with "`Invalid value: openshift.io/: name part must be non-empty`"
Description of problem:
Dev console buildconfig got [the server does not allow this method on the requested resource] error when not setting metadate.namespace
How reproducible:
Test case is shown in below
Steps to Reproduce:
Using below to create a Buildconfig in GUI page of openshift console -> Developer -> Builds -> Create BuildConfig -> yaml view ~~~ apiVersion: build.openshift.io/v1 kind: BuildConfig metadata: name: mywebsite labels: name: mywebsite spec: triggers: - type: ImageChange imageChange: {} - type: ConfigChange source: type: Git git: uri: https://github.com/monodot/container-up contextDir: httpd-hello-world strategy: type: Docker dockerStrategy: dockerfilePath: Dockerfile from: kind: ImageStreamTag name: httpd:latest namespace: testbuild output: to: kind: ImageStreamTag name: mywebsite:latest ~~~
Actual results:
Get [the server does not allow this method on the requested resource] error
Expected results:
we can find even not setting metadata.namespace in CLI mode or by contact from the customer that in 4.11 GUI console will not trigger this error, Does that mean the code changed in 4.13 ?
Creating a Serverless Deployment with "Scaling" "Min Pods"/"Max Pods" options set, uses deprecated knative annotations "autoscaling.knative.dev/minScale" / "maxScale",
the correct current ones are "autoscaling.knative.dev/min-scale" / "max-scale"
The same problem with "autoscaling.knative.dev/targetUtilizationPercentage" , which should be "autoscaling.knative.dev/target-utilization-percentage"
Serverless operator
The created ksvc resource has
spec: template: metadata: annotations: autoscaling.knative.dev/maxScale: "3" autoscaling.knative.dev/minScale: "2" autoscaling.knative.dev/targetUtilizationPercentage: "70"
The created ksvc should have
spec: template: metadata: annotations: autoscaling.knative.dev/max-scale: "3" autoscaling.knative.dev/min-scale: "2" autoscaling.knative.dev/target-utilization-percentage: "70"
4.14.8
none required ATM, current serverless still supports the deprecated "minScale"/"maxScale" annotations.
https://issues.redhat.com/browse/SRVKS-910
Description of problem:
When new nodes are scaled up, the PinnedImageSets status in the MCP takes a long while to be updated. Nevertheless, after a long time (sometimes about 30mins) they are correctly updated.
Version-Release number of selected component (if applicable):
pre-merge: https://github.com/openshift/machine-config-operator/pull/4303
How reproducible:
Always
Steps to Reproduce:
1. Create a pinnedimageset for the worker node with one only image (so that it is fast) 2. Wait until the pinned image is pinned in all worker nodes 3. Scale up 10 new worker nodes, or thereabouts
Actual results:
When all new nodes are created, the image is correctly pinned in all of them, but the status in the MCP is not fully synced until a long time later. Sometimes even 25 or 30 minutes. oc get mcp worker -o yaml .... poolSynchronizersStatus: - availableMachineCount: 12 machineCount: 12 poolSynchronizerType: PinnedImageSets readyMachineCount: 12 unavailableMachineCount: 0 updatedMachineCount: 10 readyMachineCount: 12
Expected results:
The status in the MCP should be updated earlier once all nodes have finished pinning the image.
Additional info:
Description of the problem:
Up to latest decision RH won't going to support installation OCP cluster on Nutanix with
nested virtualization. Thus the check box "Install OpenShift Virtualization" on page "Operators" should be disabled when select platform "Nutanix" on page "Cluster Details"
Slack discussion thread
https://redhat-internal.slack.com/archives/C0211848DBN/p1706640683120159
Nutanix
https://portal.nutanix.com/page/documents/kbs/details?targetId=kA00e000000XeiHCAS
BMH is the custom resource that is used to add a new host. Agent on the
other hand is created automatically when a host registers. Since there
is a need to control agent labels the following agent label support was
added:
In order to add an entry that controls agent label, a new BMH annotation
needs to be added.
The annotation key is prefixed with the string
'bmac.agent-install.openshift.io.agent-label.'. The remainder of the
annotation is considered the label key.
The value of the annotation is a JSON dictionary with 2 possible keys.
The key 'operation' can contain one of the values ["add","delete"] which
mean that the label can either added , or deleted.
The dictionary key 'value' contains the label value.
Please review the following PR: https://github.com/openshift/csi-external-attacher/pull/66
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The degradation of the storage operator occurred because it couldn't locate the node by UUID. I noticed that the providerID was present for node 0, but it was blank for other nodes. A successful installation can be achieved on day 2 by executing step 4 after step 7 from this document: https://access.redhat.com/solutions/6677901. Additionally, if we provide credentials from the install-config, it's necessary to add a taint to the node using the uninitialized taint(oc adm taint node "$NODE" node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule) after the bootstrap completed.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
100%
Steps to Reproduce:
1. Create an agent ISO image 2. Boot the created ISO on vSphere VM
Actual results:
Installation is failing due to storage operator unable to find the node by UUID.
Expected results:
Storage operator should be installed without any issue.
Additional info:
Slack discussion: https://redhat-internal.slack.com/archives/C02SPBZ4GPR/p1702893456002729
This is a clone of issue OCPBUGS-39110. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-29528. The following is the description of the original issue:
—
Description of problem:
Camel K provides a list of Kamelets that are able to act as an event source or sink for a Knative eventing message broker. Usually the list of Kamelets installed with the Camel K operator are displayed in the Developer Catalog list of available event sources with the provider "Apache Software Foundation" or "Red Hat Integration". When a user adds a custom Kamelet custom resource to the user namespace the list of default Kamelets coming from the Camel K operator is gone. The Developer Catalog event source list then only displays the custom Kamelet but not the default ones.
Version-Release number of selected component (if applicable):
How reproducible:
Apply a custom Kamelet custom resource to the user namespace and open the list of available event sources in Dev Console Developer Catalog.
Steps to Reproduce:
1. install global Camel K operator in operator namespace (e.g. openshift-operators) 2. list all available event sources in "default" user namespace and see all Kamelets listed as event sources/sinks 3. add a custom Kamelet custom resource to the default namespace 4. see the list of available event sources only listing the custom Kamelet and the default Kamelets are gone from that list
Actual results:
Default Kamelets that act as event source/sink are only displayed in the Developer Catalog when there is no custom Kamelet added to a namespace.
Expected results:
Default Kamelets coming with the Camel K operator (installed in the operator namespace) should always be part of the Developer Catalog list of available event sources/sinks. When the user adds more custom Kamelets these should be listed, too.
Additional info:
Reproduced with Camel K operator 2.2 and OCP 4.14.8
screenshots: https://drive.google.com/drive/folders/1mTpr1IrASMT76mWjnOGuexFr9-mP0y3i?usp=drive_link
Red Hat OpenShift Container Platform subscriptions are often measured against underlying cores. However, the metrics for cores are unreliable with some known edge cases. Namely, when virtualization is used, depending on a variety of factors, the hypervisor doesn't report the underlying cores, and instead reports a core per "cpu" where "cpu" is a schedulable executor (possibly backed by a single hyperthreaded executor). In order to address, we assume a ratio of 2-vCPU to 1 core, and divide the "cores" value by 2 to normalize when we detect that hyperthreading information was not reported, when we're on x86-64 CPU architecture, and when the cluster is not a bare-metal cluster.
At this time, x86-64 virtualized clusters are the ones affected.
This is a clone of issue OCPBUGS-36681. The following is the description of the original issue:
—
Description of problem:
Azure HC fails to create AzureMachineTemplate if a MachineIdentityID is not provided. E0705 19:09:23.783858 1 controller.go:329] "Reconciler error" err="failed to parse ProviderID : invalid resource ID: id cannot be empty" controller="azuremachinetemplate" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachineTemplate" AzureMachineTemplate="clusters-hostedcp-1671-hc/hostedcp-1671-hc-f412695a" namespace="clusters-hostedcp-1671-hc" name="hostedcp-1671-hc-f412695a" reconcileID="74581db2-0ac0-4a30-abfc-38f07b8247cc" https://github.com/openshift/hypershift/blob/84f594bd2d44e03aaac2d962b0d548d75505fed7/hypershift-operator/controllers/nodepool/azure.go#L52 does not check first to see if a MachineIdentityID was provided before adding the UserAssignedIdentity field.
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Create an Azure HC without a MachineIdentityID
Actual results:
Azure HC fails to create AzureMachineTemplate properly, nodes aren't created, and HC is in a failed state.
Expected results:
Azure HC creates AzureMachineTemplate properly, nodes are created, and HC is in a completed state.
Additional info:
In v2 there is an issue to mirroring multiple catalogs at the same mirroring flow.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Currently the openshift-baremetal-install binary is dynamically linked to libvirt-client, meaning that it is only possible to run it on a RHEL system with libvirt installed.
A new version of the libvirt bindings, v1.8010.0, allows the library to be loaded only on demand, so that users who do not execute any libvirt code can run the rest of the installer without needing to install libvirt. (See this comment from Dan Berrangé.) In practice, the "rest of the installer" is everything except the baremetal destroy cluster command (which destroys the bootstrap storage pool - though only if the bootstrap itself has already been successfully destroyed - and has probably never been used by anybody ever). The Terraform providers all run in a separate binary.
There is also a pure-go libvirt library that can be used even within a statically-linked binary on any platform, even when interacting with libvirt. The libvirt terraform provider that does almost all of our interaction with libvirt already uses this library.
This is a clone of issue OCPBUGS-34416. The following is the description of the original issue:
—
Description of problem:
Specifying N2D machine types for compute and controlPlane machines, with "confidentialCompute: Enabled", "create cluster" got the error "Confidential Instance Config is only supported for compatible cpu platforms" [1], while the real cause is missing the settings "onHostMaintenance: Terminate". That being said, the 4.16 error is mis-leading, suggest to be consistent with 4.15 [2] / 4/14 [3] error messages. FYI Confidential VM is supported on N2D machine types (see https://cloud.google.com/confidential-computing/confidential-vm/docs/supported-configurations#machine-type-cpu-zone).
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-21-221942
How reproducible:
Always
Steps to Reproduce:
1. Please refer to [1]
Actual results:
The error message is like "Confidential Instance Config is only supported for compatible cpu platforms", which is mis-leading.
Expected results:
4.15 [2] / 4/14 [3] error messages, which look better.
Additional info:
FYI it is about QE test case OCP-60212 scenario b.
Description of the problem:
Some requests of this type:
{"x_request_id":"05e66411-7612-46bb-86c2-69bf7096b6da","protocol":"HTTP/1.1","authority":"zoscaru4s08w1mz.api.openshift.com","user_agent":"Go-http-client/2.0","method":"GET","response_flags":"UC","x_forwarded_for":"163.244.72.2,10.128.10.16,23.21.192.204","bytes_rx":0,"duration":13,"bytes_tx":95,"response_code":503,"timestamp":"2024-01-31T15:52:41.418Z","upstream_duration":null,"path":"/api/assisted-install/v2/infra-envs/84596f7d-0138-4f57-ada4-be72aea031a5/hosts/62148a92-8588-b591-5f7e-046bf1136b3b/instructions?timestamp=1706716359"} |
are causing 503 because the applicaiton crashes. Fortunately it's only a goroutine to crash, so main loop is still going and other requests seem unaffected
How reproducible:
Not sure what conditions we need to meet but plenty of requests of this type can be found from prod logs
Steps to reproduce:
1.
2.
3.
Actual results:
2024/01/31 16:41:31 http: panic serving 127.0.0.1:39486: runtime error: invalid memory address or nil pointer dereference goroutine 575931 [running]: net/http.(*conn).serve.func1() /usr/local/go/src/net/http/server.go:1854 +0xbf panic({0x4369120, 0x6bee680}) /usr/local/go/src/runtime/panic.go:890 +0x263 github.com/openshift/assisted-service/internal/host/hostcommands.(*tangConnectivityCheckCmd).getTangServersFromHostIgnition(0x34?, 0x484da60?) /assisted-service/internal/host/hostcommands/tang_connectivity_check_cmd.go:41 +0x3e github.com/openshift/assisted-service/internal/host/hostcommands.(*tangConnectivityCheckCmd).GetSteps(0xc000a5c440, {0xc00057c1c8?, 0x48adb49?}, 0xc000f5d180) /assisted-service/internal/host/hostcommands/tang_connectivity_check_cmd.go:104 +0x112 github.com/openshift/assisted-service/internal/host/hostcommands.(*InstructionManager).GetNextSteps(0xc000ad0780, {0x50cf558, 0xc00444b4a0}, 0xc000f5d180) /assisted-service/internal/host/hostcommands/instruction_manager.go:178 +0xa2f github.com/openshift/assisted-service/internal/host.(*Manager).GetNextSteps(0xc0003e4990?, {0x50cf558?, 0xc00444b4a0?}, 0x0?) /assisted-service/internal/host/host.go:548 +0x48 github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).V2GetNextSteps(0xc000da4800, {0x50cf558, 0xc00444b4a0}, {0xc00203b500, 0xc002b9f460, {0xc001daa690, 0x24}, {0xc001daa6f0, 0x24}, 0xc0015c8158}) /assisted-service/internal/bminventory/inventory.go:5357 +0x1b8 github.com/openshift/assisted-service/restapi.HandlerAPI.func54({0xc00203b500, 0xc002b9f460, {0xc001daa690, 0x24}, {0xc001daa6f0, 0x24}, 0xc0015c8158}, {0x408d980?, 0xc000fb67e0?}) /assisted-service/restapi/configure_assisted_install.go:654 +0xf4 github.com/openshift/assisted-service/restapi/operations/installer.V2GetNextStepsHandlerFunc.Handle(0xc00203b300?, {0xc00203b500, 0xc002b9f460, {0xc001daa690, 0x24}, {0xc001daa6f0, 0x24}, 0xc0015c8158}, {0x408d980, 0xc000fb67e0}) /assisted-service/restapi/operations/installer/v2_get_next_steps.go:19 +0x7a github.com/openshift/assisted-service/restapi/operations/installer.(*V2GetNextSteps).ServeHTTP(0xc00169e468, {0x50bfe00, 0xc001399d20}, 0xc00203b500) /assisted-service/restapi/operations/installer/v2_get_next_steps.go:66 +0x2dd github.com/go-openapi/runtime/middleware.NewOperationExecutor.func1({0x50bfe00, 0xc001399d20}, 0xc00203b500) /assisted-service/vendor/github.com/go-openapi/runtime/middleware/operation.go:28 +0x59 net/http.HandlerFunc.ServeHTTP(0x0?, {0x50bfe00?, 0xc001399d20?}, 0x17334b7?) /usr/local/go/src/net/http/server.go:2122 +0x2f github.com/openshift/assisted-service/internal/metrics.Handler.func1.1() /assisted-service/internal/metrics/reporter.go:37 +0x31 github.com/slok/go-http-metrics/middleware.Middleware.Measure({{{0x50bfdd0, 0xc0014ce0e0}, {0x48922ea, 0x12}, 0x0, 0x0, 0x0}}, {0x0, 0x0}, {0x50d23c0, ...}, ...) /assisted-service/vendor/github.com/slok/go-http-metrics/middleware/middleware.go:117 +0x30e github.com/openshift/assisted-service/internal/metrics.Handler.func1({0x50cdea0?, 0xc0005ec770}, 0xc00203b500) /assisted-service/internal/metrics/reporter.go:36 +0x35f net/http.HandlerFunc.ServeHTTP(0x50cf558?, {0x50cdea0?, 0xc0005ec770?}, 0xc002b9ee80?) /usr/local/go/src/net/http/server.go:2122 +0x2f github.com/openshift/assisted-service/pkg/context.ContextHandler.func1.1({0x50cdea0, 0xc0005ec770}, 0xc00203b400) /assisted-service/pkg/context/param.go:95 +0xc8 net/http.HandlerFunc.ServeHTTP(0xc001735430?, {0x50cdea0?, 0xc0005ec770?}, 0xc0026625f8?) /usr/local/go/src/net/http/server.go:2122 +0x2f github.com/go-openapi/runtime/middleware.NewRouter.func1({0x50cdea0, 0xc0005ec770}, 0xc00203b200) /assisted-service/vendor/github.com/go-openapi/runtime/middleware/router.go:77 +0x257 net/http.HandlerFunc.ServeHTTP(0x7fc8dc2c3820?, {0x50cdea0?, 0xc0005ec770?}, 0xc00009f000?) /usr/local/go/src/net/http/server.go:2122 +0x2f github.com/go-openapi/runtime/middleware.Redoc.func1({0x50cdea0, 0xc0005ec770}, 0x418b480?) /assisted-service/vendor/github.com/go-openapi/runtime/middleware/redoc.go:72 +0x242 net/http.HandlerFunc.ServeHTTP(0x1?, {0x50cdea0?, 0xc0005ec770?}, 0x0?) /usr/local/go/src/net/http/server.go:2122 +0x2f github.com/go-openapi/runtime/middleware.Spec.func1({0x50cdea0, 0xc0005ec770}, 0x486ac6f?) /assisted-service/vendor/github.com/go-openapi/runtime/middleware/spec.go:46 +0x18c net/http.HandlerFunc.ServeHTTP(0xc001136380?, {0x50cdea0?, 0xc0005ec770?}, 0xc00203b200?) /usr/local/go/src/net/http/server.go:2122 +0x2f github.com/rs/cors.(*Cors).Handler.func1({0x50cdea0, 0xc0005ec770}, 0xc00203b200) /assisted-service/vendor/github.com/rs/cors/cors.go:281 +0x1c4 net/http.HandlerFunc.ServeHTTP(0x0?, {0x50cdea0?, 0xc0005ec770?}, 0x4?) /usr/local/go/src/net/http/server.go:2122 +0x2f github.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1({0x50cd990, 0xc0012e8540}, 0xc0017358f0?) /assisted-service/vendor/github.com/NYTimes/gziphandler/gzip.go:336 +0x24e net/http.HandlerFunc.ServeHTTP(0x100c0017359e8?, {0x50cd990?, 0xc0012e8540?}, 0x10?) /usr/local/go/src/net/http/server.go:2122 +0x2f github.com/openshift/assisted-service/pkg/app.WithMetricsResponderMiddleware.func1({0x50cd990?, 0xc0012e8540?}, 0x16f0a19?) /assisted-service/pkg/app/middleware.go:32 +0xb0 net/http.HandlerFunc.ServeHTTP(0xc000a26900?, {0x50cd990?, 0xc0012e8540?}, 0xc00444a7e0?) /usr/local/go/src/net/http/server.go:2122 +0x2f github.com/openshift/assisted-service/pkg/app.WithHealthMiddleware.func1({0x50cd990?, 0xc0012e8540?}, 0x5094901?) /assisted-service/pkg/app/middleware.go:55 +0x162 net/http.HandlerFunc.ServeHTTP(0x50cf4b0?, {0x50cd990?, 0xc0012e8540?}, 0x5094980?) /usr/local/go/src/net/http/server.go:2122 +0x2f github.com/openshift/assisted-service/pkg/requestid.handler.ServeHTTP({{0x50aa9c0?, 0xc0008a4eb0?}}, {0x50cd990, 0xc0012e8540}, 0xc00203b100) /assisted-service/pkg/requestid/requestid.go:69 +0x1ad github.com/openshift/assisted-service/internal/spec.WithSpecMiddleware.func1({0x50cd990?, 0xc0012e8540?}, 0xc00203b100?) /assisted-service/internal/spec/spec.go:38 +0x9b net/http.HandlerFunc.ServeHTTP(0xc00124ec35?, {0x50cd990?, 0xc0012e8540?}, 0x170a0ce?) /usr/local/go/src/net/http/server.go:2122 +0x2f net/http.serverHandler.ServeHTTP({0xc002cb8f30?}, {0x50cd990, 0xc0012e8540}, 0xc00203b100) /usr/local/go/src/net/http/server.go:2936 +0x316 net/http.(*conn).serve(0xc00189d320, {0x50cf558, 0xc0011fc240}) /usr/local/go/src/net/http/server.go:1995 +0x612 created by net/http.(*Server).Serve /usr/local/go/src/net/http/server.go:3089 +0x5ed
Expected results:
Description of problem:
Some events have time related infomration set to null (firstTimestamp, lastTimestamp, eventTime)
Version-Release number of selected component (if applicable):
cluster-logging.v5.8.0
How reproducible:
100%
Steps to Reproduce:
1.Stop one of the masters 2.Start the master 3.Wait untill the ENV stabilizes 4. oc get events -A | grep unknown
Actual results:
oc get events -A | grep unknow default <unknown> Normal TerminationStart namespace/kube-system Received signal to terminate, becoming unready, but keeping serving default <unknown> Normal TerminationPreShutdownHooksFinished namespace/kube-system All pre-shutdown hooks have been finished default <unknown> Normal TerminationMinimalShutdownDurationFinished namespace/kube-system The minimal shutdown duration of 0s finished ....
Expected results:
All time related information is set correctly
Additional info:
This causes issues with external monitoring systems. Events with no timestamp will never show or push other events from the view depending on the sorting order of the timestamp. The operator of the environment has then trouble to see what is happening there.
Please review the following PR: https://github.com/openshift/baremetal-operator/pull/328
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Currently MachineConfiguration is only effective with name cluster, we can create multiple MachineConfigurations with other names like apiVersion: operator.openshift.io/v1 kind: MachineConfiguration metadata: name: ndp-file-action-none namespace: openshift-machine-config-operator spec: nodeDisruptionPolicy: files: - path: /etc/test actions: - type: None But only 'cluster' can take action, it will make the user confused
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
Create MachineConfiguration with any name expect 'cluster'
Actual results:
new MachineConfiguration won't be effective
Expected results:
if the function only works with 'cluster' object, we should reject the CR creation with other names
Additional info:
This is a clone of issue OCPBUGS-35906. The following is the description of the original issue:
—
Description of problem:
Featuregate taking unknown value
Version-Release number of selected component (if applicable):
4.16 and 4.17
How reproducible:
Always
Steps to Reproduce:
oc patch featuregate cluster --type=json -p '[{"op": "replace", "path": "/spec/featureSet", "value": "unknownghfh"}]' featuregate.config.openshift.io/cluster patched oc get featuregate cluster -o yaml apiVersion: config.openshift.io/v1 kind: FeatureGate metadata: annotations: include.release.openshift.io/self-managed-high-availability: "true" creationTimestamp: "2024-06-21T07:20:25Z" generation: 2 name: cluster resourceVersion: "56172" uid: c900a975-78ea-4076-8e56-e5517e14b55e spec: featureSet: unknownghfh
Actual results:
featuregate.config.openshift.io/cluster patched
metadata: annotations: include.release.openshift.io/self-managed-high-availability: "true" creationTimestamp: "2024-06-21T07:20:25Z" generation: 2 name: cluster resourceVersion: "56172" uid: c900a975-78ea-4076-8e56-e5517e14b55e spec: featureSet: unknownghfh
Expected results:
Should not take invalid value and give error
{{oc patch featuregate cluster --type=json -p '[
{"op": "replace", "path": "/spec/featureSet", "value": "unknownghfh"}]'}}
The FeatureGate "cluster" is invalid: spec.featureSet: Unsupported value: "unknownghfh": supported values: "", "CustomNoUpgrade", "LatencySensitive", "TechPreviewNoUpgrade"
Additional info:
https://github.com/openshift/kubernetes/commit/facd3b18622d268a4780de1ad94f7da763351425
Description of problem:
no ipsec on cluster post NS mc's deletion during ipsecConfig mode `Full`, on an upgraded cluster from 4.14 ->4.15 build
Version-Release number of selected component (if applicable):
bot build on https://github.com/openshift/cluster-network-operator/pull/2191
How reproducible:
Always
Steps to Reproduce:
Steps: 1. Cluster on EW+NS cluster(4.14), Upgraded to above bot build to check ipsecConfig modes 2. ipsecConfig mode changed to Full 3. Deleted NS MCs 4. new MCs spawned up as `80-ipsec-master-extensions` and `80-ipsec-worker-extensions` 5. cluster settled with no ipsec at all (no ovn-ipsec-host ds) 6. mode still Full
Actual results:
mode Full actually replicated Diasbled state on above steps
Expected results:
Just NS IPsec should have gone away. EW should have persisted
Additional info:
Please review the following PR: https://github.com/openshift/containernetworking-plugins/pull/150
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The default catalog source pod never gets updates, the users have to manually recreate it to get updated. Here is must-gather log for your debugging: https://drive.google.com/file/d/16_tFq5QuJyc_n8xkDFyK83TdTkrsVFQe/view?usp=drive_link
I went through the code and found the `updateStrategy` depends on the `ImageID`, see
// imageID returns the ImageID of the primary catalog source container or an empty string if the image ID isn't available yet. // Note: the pod must be running and the container in a ready status to return a valid ImageID. func imageID(pod *corev1.Pod) string { if len(pod.Status.ContainerStatuses) < 1 { logrus.WithField("CatalogSource", pod.GetName()).Warn("pod status unknown") return "" } return pod.Status.ContainerStatuses[0].ImageID }
But, for those default catalog source pods, their `pod.Status.ContainerStatuses[0].ImageID` will never change since it's the `opm` image, not index image.
jiazha-mac:~ jiazha$ oc get pods redhat-operators-mpvzm -o=jsonpath={.status.containerStatuses} |jq [ { "containerID": "cri-o://115bd207312c7c8c36b63bfd251c085a701c58df2a48a1232711e15d7595675d", "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:965fe452763fd402ca8d8b4a3fdb13587673c8037f215c0ffcd76b6c4c24635e", "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:965fe452763fd402ca8d8b4a3fdb13587673c8037f215c0ffcd76b6c4c24635e", "lastState": {}, "name": "registry-server", "ready": true, "restartCount": 1, "started": true, "state": { "running": { "startedAt": "2024-03-26T04:21:41Z" } } } ]
The imageID() func should return the index image ID for those default catalog sources.
jiazha-mac:~ jiazha$ oc get pods redhat-operators-mpvzm -o=jsonpath={.status.initContainerStatuses[1]} |jq { "containerID": "cri-o://4cd6e1f45e23aadc27b8152126eb2761a37da61c4845017a06bb6f2203659f5c", "image": "registry.redhat.io/redhat/redhat-operator-index:v4.15", "imageID": "registry.redhat.io/redhat/redhat-operator-index@sha256:19010760d38e1a898867262698e22674d99687139ab47173e2b4665e588635e1", "lastState": {}, "name": "extract-content", "ready": true, "restartCount": 1, "started": false, "state": { "terminated": { "containerID": "cri-o://4cd6e1f45e23aadc27b8152126eb2761a37da61c4845017a06bb6f2203659f5c", "exitCode": 0, "finishedAt": "2024-03-26T04:21:39Z", "reason": "Completed", "startedAt": "2024-03-26T04:21:27Z" } } }
Version-Release number of selected component (if applicable):
4.15.2
How reproducible:
always
Steps to Reproduce:
1. Install an OCP 4.16.0 2. Waiting for the redhat-operator catalog source updates 3.
Actual results:
The redhat-operator catalog source never gets updates.
Expected results:
These default catalog source should get updates depending on the `updateStrategy`.
jiazha-mac:~ jiazha$ oc get catalogsource redhat-operators -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: annotations: operatorframework.io/managed-by: marketplace-operator target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}' creationTimestamp: "2024-03-20T15:48:59Z" generation: 1 name: redhat-operators namespace: openshift-marketplace resourceVersion: "12217605" uid: cc0fc420-c9d8-4c7d-997e-f0893b4c497f spec: displayName: Red Hat Operators grpcPodConfig: extractContent: cacheDir: /tmp/cache catalogDir: /configs memoryTarget: 30Mi nodeSelector: kubernetes.io/os: linux node-role.kubernetes.io/master: "" priorityClassName: system-cluster-critical securityContextConfig: restricted tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master operator: Exists - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 120 - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 120 icon: base64data: "" mediatype: "" image: registry.redhat.io/redhat/redhat-operator-index:v4.15 priority: -100 publisher: Red Hat sourceType: grpc updateStrategy: registryPoll: interval: 10m status: connectionState: address: redhat-operators.openshift-marketplace.svc:50051 lastConnect: "2024-03-27T06:35:36Z" lastObservedState: READY latestImageRegistryPoll: "2024-03-27T10:23:16Z" registryService: createdAt: "2024-03-20T15:56:03Z" port: "50051" protocol: grpc serviceName: redhat-operators serviceNamespace: openshift-marketplace
Additional info:
I also checked the currentPodsWithCorrectImageAndSpec, but no hash changed due to the pod.spec are the same always.
time="2024-03-26T03:22:01Z" level=info msg="of 1 pods matching label selector, 1 have the correct images and matching hash" correctHash=true correctImages=true current-pod.name=redhat-operators-mpvzm current-pod.namespace=openshift-marketplace time="2024-03-26T03:27:01Z" level=info msg="of 1 pods matching label selector, 1 have the correct images and matching hash" catalogsource.name=redhat-operators catalogsource.namespace=openshift-marketplace correctHash=true correctImages=true current-pod.name=redhat-operators-mpvzm current-pod.namespace=openshift-marketplace id=xW0cW time="2024-03-26T03:27:01Z" level=info msg="of 1 pods matching label selector, 1 have the correct images and matching hash" catalogsource.name=redhat-operators catalogsource.namespace=openshift-marketplace correctHash=true correctImages=true current-pod.name=redhat-operators-mpvzm current-pod.namespace=openshift-marketplace id=xW0cW time="2024-03-26T03:27:02Z" level=info msg="of 1 pods matching label selector, 1 have the correct images and matching hash" catalogsource.name=redhat-operators catalogsource.namespace=openshift-marketplace correctHash=true correctImages=true current-pod.name=redhat-operators-mpvzm current-pod.namespace=openshift-marketplace id=vq5VA time="2024-03-26T03:27:03Z" level=info msg="of 1 pods matching label selector, 1 have the correct images and matching hash" catalogsource.name=redhat-operators catalogsource.namespace=openshift-marketplace correctHash=true correctImages=true current-pod.name=redhat-operators-mpvzm current-pod.namespace=openshift-marketplace id=vq5VA
Description of problem:
Installer requires the `s3:HeadBucket` even though such permission does not exist. The correct permission for the `HeadBucket` action is `s3:ListBucket` https://docs.aws.amazon.com/AmazonS3/latest/API/API_HeadBucket.html
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. Install a cluster using a role with limited permissions 2. 3.
Actual results:
level=warning msg=Action not allowed with tested creds action=iam:DeleteUserPolicy level=warning msg=Tested creds not able to perform all requested actions level=warning msg=Action not allowed with tested creds action=s3:HeadBucket level=warning msg=Tested creds not able to perform all requested actions level=fatal msg=failed to fetch Cluster: failed to fetch dependency of "Cluster": failed to generate asset "Platform Permissions Check": validate AWS credentials: AWS credentials cannot be used to either create new creds or use as-is Installer exit with code 1
Expected results:
Installer should check only for s3:ListBucket
Additional info:
Some templating code can be simplified now.
Description of problem:
This is only applicable to systems that install a performance profile There seems to be a race condition where all systemd spawed processes are not being moved to /sys/fs/cgroup/cpuset/system.slice. This is suppose to be done by the one-shot cpuset-configure.service. Here is a list of processes I see on one lab that are still in the root directory /usr/bin/dbus-broker-launch --scope system --audit dbus-broker --log 4 --controller 9 --machine-id 071fd738af0146859d2c04b7fea6d276 --max-bytes 536870912 --max-fds 4096 --max-matches 131072 --audit /usr/sbin/NetworkManager --no-daemon /usr/sbin/dnsmasq -k /sbin/agetty -o -p -- \u --noclear - linux sshd: core@pts/0
Version-Release number of selected component (if applicable):
4.14, 4.15
How reproducible:
Steps to Reproduce:
1. Reboot a SNO with a peformance profile applied 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-34828. The following is the description of the original issue:
—
Multiple pr failing with this error...
Deploy git workload with devfile from topology page: A-04-TC01: Create the different workloads from Add page Deploy git workload with devfile from topology page: A-04-TC01 expand_less18s{`cy.focus()` can only be called on a single element. Your subject contained 14 elements. https://on.cypress.io/focus CypressError CypressError: `cy.focus()` can only be called on a single element. Your subject contained 14 elements.
time="2024-01-04T05:30:45-05:00" level=fatal msg="failed to fetch Terraform Variables: failed to fetch dependency of \"Terraform Variables\": failed to generate asset \"Platform Provisioning Check\": platform.vsphere: Internal error: vCenter is failing to retrieve config product version information for the ESXi host: "
The story is to track i18n upload/download routine tasks which are perform every sprint.
A.C.
- Upload strings to Memosource at the start of the sprint and reach out to localization team
- Download translated strings from Memsource when it is ready
- Review the translated strings and open a pull request
- Open a followup story for next sprint
Please review the following PR: https://github.com/openshift/cluster-network-operator/pull/2308
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
In a recently installed cluster running 4.13.29, after configuring the cluster-wide-proxy, the "vsphere-problem-detector" is not taking the proxy configuration. As the pod cannot reach vSphere it's failing to run checks: 2024-02-01T09:28:00.150332407Z E0201 09:28:00.150292 1 operator.go:199] failed to run checks: failed to connect to vsphere.local: Post "https://vsphere.local/sdk": dial tcp 172.16.1.3:443: i/o timeout The pod doesn't get the cluster proxy settings as expected: - name: HTTPS_PROXY value: http://proxy.local:3128 - name: HTTP_PROXY value: http://proxy.local:3128 Other storage related pods get the configuration expected as above. This causes the vsphere-problem-detector to fail connections to vSphere, hence failing the health checks.
Version-Release number of selected component (if applicable):
4.13.29
How reproducible:
Always
Steps to Reproduce:
1.Configure cluster-wide proxy in the environment. 2. Wait for the change 3. Check the pod configuration
Actual results:
vSphere health checks failing
Expected results:
vSphere health checks working through the cluster proxy
Additional info:
Description of problem:
We need to backport https://github.com/cri-o/cri-o/pull/7744 into 1.28 of crio. CI is failing on upgrades due to a feature not in 1.28.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-34900. The following is the description of the original issue:
—
Description of problem:
The following jobs have been failing on the bootstrap stage. The following error message is seen "level=error msg=Bootstrap failed to complete: timed out waiting for the condition"
https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.17-e2e-openstack-csi-manila
https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.17-e2e-openstack-csi-cinder
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.17-e2e-openstack-nfv-mellanox/1797334785262096384
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.17-e2e-openstack-proxy/1797330506849718272
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/270
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-operator/pull/115
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/1033
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-kube-apiserver-operator/pull/1597
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Seeing this in hypershift e2e. I think it is racing with the Infrastructure status being populated and PlatformStatus being nil.
I0501 00:13:11.951062 1 azurepathfixcontroller.go:324] Started AzurePathFixController I0501 00:13:11.951056 1 base_controller.go:73] Caches are synced for LoggingSyncer I0501 00:13:11.951072 1 imageregistrycertificates.go:214] Started ImageRegistryCertificatesController I0501 00:13:11.951077 1 base_controller.go:110] Starting #1 worker of LoggingSyncer controller ... E0501 00:13:11.951369 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 534 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x2d6bd00?, 0x57a60e0}) /go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x3bcb370?}) /go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b panic({0x2d6bd00?, 0x57a60e0?}) /usr/lib/golang/src/runtime/panic.go:914 +0x21f github.com/openshift/cluster-image-registry-operator/pkg/operator.(*AzurePathFixController).sync(0xc000003d40) /go/src/github.com/openshift/cluster-image-registry-operator/pkg/operator/azurepathfixcontroller.go:171 +0x97 github.com/openshift/cluster-image-registry-operator/pkg/operator.(*AzurePathFixController).processNextWorkItem(0xc000003d40) /go/src/github.com/openshift/cluster-image-registry-operator/pkg/operator/azurepathfixcontroller.go:154 +0x292 github.com/openshift/cluster-image-registry-operator/pkg/operator.(*AzurePathFixController).runWorker(...) /go/src/github.com/openshift/cluster-image-registry-operator/pkg/operator/azurepathfixcontroller.go:133 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?) /go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001186820?, {0x3bd1320, 0xc000cace40}, 0x1, 0xc000ca2540) /go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0011bac00?, 0x3b9aca00, 0x0, 0xd0?, 0x447f9c?) /go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0xc001385f68?, 0xc001385f78?) /go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x1e created by github.com/openshift/cluster-image-registry-operator/pkg/operator.(*AzurePathFixController).Run in goroutine 248 /go/src/github.com/openshift/cluster-image-registry-operator/pkg/operator/azurepathfixcontroller.go:322 +0x1a6 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x2966e97]
Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/155
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Unable to view the alerts, metrics page, getting a blank page.
Version-Release number of selected component (if applicable):
4.15.0-nightly
How reproducible:
Always
Steps to Reproduce:
Click on any alert under "Notification Panel" to view more, and you will be redirected to the alert page.
Actual results:
User is unable to view any alerts, metrics.
Expected results:
User should be able to view all/individual alerts, metrics.
Additional info:
N.A
Description of problem:
In the tested HCP external OIDC env, when issuerCertificateAuthority is set, console pods are stuck in ContainerCreating status. The reason is the CA configmap is not propagated to openshift-console namespace by the console operator.
Version-Release number of selected component (if applicable):
Latest 4.16 and 4.15 nightly payloads
How reproducible:
Always
Steps to Reproduce:
1. Configure HCP external OIDC env with issuerCertificateAuthority set. 2. Check oc get pods -A
Actual results:
2. Before OCPBUGS-31319 is fixed, console pods are in CrashLoopBackOff status. After OCPBUGS-31319 is fixed or manually coping the CA configmap to openshift-config namespace as workaround, console pods are stuck in ContainerCreating status until the CA configmap is manually copied to openshift-console namespace too. Console login is affected.
Expected results:
2. Console operator should be responsible to copy the CA to openshift-console namespace. And console login should succeed.
Additional info:
In https://redhat-internal.slack.com/archives/C060D1W96LB/p1711548626625499 , HyperShift Dev side Seth requested to create this separate console bug to unblock the PR merge and backport for OCPBUGS-31319 . So creating it
Please review the following PR: https://github.com/openshift/cluster-api-provider-aws/pull/488
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
In PowerVS, when I try and deploy a 4.16 cluster, I see the following:
Description of problem:
[inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc get pods -n openshift-cloud-controller-manager NAME READY STATUS RESTARTS AGE powervs-cloud-controller-manager-6b6fbcc9db-9rhtj 0/1 CrashLoopBackOff 4 (10s ago) 2m47s powervs-cloud-controller-manager-6b6fbcc9db-wnvck 0/1 CrashLoopBackOff 3 (49s ago) 2m46s [inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc logs pod/powervs-cloud-controller-manager-6b6fbcc9db-9rhtj -n openshift-cloud-controller-manager Error from server: no preferred addresses found; known addresses: [] [inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc logs pod/powervs-cloud-controller-manager-6b6fbcc9db-wnvck -n openshift-cloud-controller-manager Error from server: no preferred addresses found; known addresses: []
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-ppc64le-2024-01-07-111144
How reproducible:
Aways
Steps to Reproduce:
1. Deploy OpenShift cluster
On the master-0 node, I see:
[core@rdr-hamzy-test-wdc06-fs5m2-master-0 ~]$ sudo crictl ps -a CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD a048556553827 ec3035a371e09312254a277d5eb9affba2930adbd4018f7557899a2f3d76bc88 18 seconds ago Exited kube-rbac-proxy 7 0381a589d57cd cluster-cloud-controller-manager-operator-94dd5b468-kxqw5 a326f7ec83ddb 60f5c9455518c79a9797cfbeab0b3530dae1bf77554eccc382ff12d99053efd1 11 minutes ago Running config-sync-controllers 0 0381a589d57cd cluster-cloud-controller-manager-operator-94dd5b468-kxqw5 ddaa6999b5b86 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:60eff87ed56ee4761fd55caa4712e6bea47dccaa11c59ba53a6d5697eacc7d32 11 minutes ago Running cluster-cloud-controller-manager 0 0381a589d57cd cluster-cloud-controller-manager-operator-94dd5b468-kxqw5
The failing pod has this as its log:
[core@rdr-hamzy-test-wdc06-fs5m2-master-0 ~]$ sudo crictl logs a048556553827 Flag --logtostderr has been deprecated, will be removed in a future release, see https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/2845-deprecate-klog-specific-flags-in-k8s-components I0108 18:09:12.320332 1 flags.go:64] FLAG: --add-dir-header="false" I0108 18:09:12.320401 1 flags.go:64] FLAG: --allow-paths="[]" I0108 18:09:12.320413 1 flags.go:64] FLAG: --alsologtostderr="false" I0108 18:09:12.320420 1 flags.go:64] FLAG: --auth-header-fields-enabled="false" I0108 18:09:12.320427 1 flags.go:64] FLAG: --auth-header-groups-field-name="x-remote-groups" I0108 18:09:12.320435 1 flags.go:64] FLAG: --auth-header-groups-field-separator="|" I0108 18:09:12.320441 1 flags.go:64] FLAG: --auth-header-user-field-name="x-remote-user" I0108 18:09:12.320447 1 flags.go:64] FLAG: --auth-token-audiences="[]" I0108 18:09:12.320454 1 flags.go:64] FLAG: --client-ca-file="" I0108 18:09:12.320460 1 flags.go:64] FLAG: --config-file="/etc/kube-rbac-proxy/config-file.yaml" I0108 18:09:12.320467 1 flags.go:64] FLAG: --help="false" I0108 18:09:12.320473 1 flags.go:64] FLAG: --http2-disable="false" I0108 18:09:12.320479 1 flags.go:64] FLAG: --http2-max-concurrent-streams="100" I0108 18:09:12.320486 1 flags.go:64] FLAG: --http2-max-size="262144" I0108 18:09:12.320492 1 flags.go:64] FLAG: --ignore-paths="[]" I0108 18:09:12.320500 1 flags.go:64] FLAG: --insecure-listen-address="" I0108 18:09:12.320506 1 flags.go:64] FLAG: --kubeconfig="" I0108 18:09:12.320512 1 flags.go:64] FLAG: --log-backtrace-at=":0" I0108 18:09:12.320520 1 flags.go:64] FLAG: --log-dir="" I0108 18:09:12.320526 1 flags.go:64] FLAG: --log-file="" I0108 18:09:12.320531 1 flags.go:64] FLAG: --log-file-max-size="1800" I0108 18:09:12.320537 1 flags.go:64] FLAG: --log-flush-frequency="5s" I0108 18:09:12.320543 1 flags.go:64] FLAG: --logtostderr="true" I0108 18:09:12.320550 1 flags.go:64] FLAG: --oidc-ca-file="" I0108 18:09:12.320556 1 flags.go:64] FLAG: --oidc-clientID="" I0108 18:09:12.320564 1 flags.go:64] FLAG: --oidc-groups-claim="groups" I0108 18:09:12.320570 1 flags.go:64] FLAG: --oidc-groups-prefix="" I0108 18:09:12.320576 1 flags.go:64] FLAG: --oidc-issuer="" I0108 18:09:12.320581 1 flags.go:64] FLAG: --oidc-sign-alg="[RS256]" I0108 18:09:12.320590 1 flags.go:64] FLAG: --oidc-username-claim="email" I0108 18:09:12.320595 1 flags.go:64] FLAG: --one-output="false" I0108 18:09:12.320601 1 flags.go:64] FLAG: --proxy-endpoints-port="0" I0108 18:09:12.320608 1 flags.go:64] FLAG: --secure-listen-address="0.0.0.0:9258" I0108 18:09:12.320614 1 flags.go:64] FLAG: --skip-headers="false" I0108 18:09:12.320620 1 flags.go:64] FLAG: --skip-log-headers="false" I0108 18:09:12.320626 1 flags.go:64] FLAG: --stderrthreshold="2" I0108 18:09:12.320631 1 flags.go:64] FLAG: --tls-cert-file="/etc/tls/private/tls.crt" I0108 18:09:12.320637 1 flags.go:64] FLAG: --tls-cipher-suites="[TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305]" I0108 18:09:12.320654 1 flags.go:64] FLAG: --tls-min-version="VersionTLS12" I0108 18:09:12.320661 1 flags.go:64] FLAG: --tls-private-key-file="/etc/tls/private/tls.key" I0108 18:09:12.320667 1 flags.go:64] FLAG: --tls-reload-interval="1m0s" I0108 18:09:12.320674 1 flags.go:64] FLAG: --upstream="http://127.0.0.1:9257/" I0108 18:09:12.320681 1 flags.go:64] FLAG: --upstream-ca-file="" I0108 18:09:12.320686 1 flags.go:64] FLAG: --upstream-client-cert-file="" I0108 18:09:12.320692 1 flags.go:64] FLAG: --upstream-client-key-file="" I0108 18:09:12.320697 1 flags.go:64] FLAG: --upstream-force-h2c="false" I0108 18:09:12.320703 1 flags.go:64] FLAG: --v="3" I0108 18:09:12.320709 1 flags.go:64] FLAG: --version="false" I0108 18:09:12.320719 1 flags.go:64] FLAG: --vmodule="" I0108 18:09:12.320735 1 kube-rbac-proxy.go:578] Reading config file: /etc/kube-rbac-proxy/config-file.yaml I0108 18:09:12.321427 1 kube-rbac-proxy.go:285] Valid token audiences: I0108 18:09:12.321473 1 kube-rbac-proxy.go:399] Reading certificate files E0108 18:09:12.321519 1 run.go:74] "command failed" err="failed to initialize certificate reloader: error loading certificates: error loading certificate: open /etc/tls/private/tls.crt: no such file or directory"
When I describe the pod, I see:
[inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc describe pod/powervs-cloud-controller-manager-6b6fbcc9db-9rhtj -n openshift-cloud-controller-manager Name: powervs-cloud-controller-manager-6b6fbcc9db-9rhtj Namespace: openshift-cloud-controller-manager Priority: 2000000000 Priority Class Name: system-cluster-critical Service Account: cloud-controller-manager Node: rdr-hamzy-test-wdc06-fs5m2-master-2/ Start Time: Mon, 08 Jan 2024 11:57:45 -0600 Labels: infrastructure.openshift.io/cloud-controller-manager=PowerVS k8s-app=powervs-cloud-controller-manager pod-template-hash=6b6fbcc9db Annotations: operator.openshift.io/config-hash: 09205e81b4dc20086c29ddbdd3fccc29a675be94b2779756a0e748dd9ba91e40 Status: Running IP: IPs: <none> Controlled By: ReplicaSet/powervs-cloud-controller-manager-6b6fbcc9db Containers: cloud-controller-manager: Container ID: cri-o://4365a326d05ecaac8e4114efabb4a46e01a308459ad30438d742b4829c24a717 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3dd2cf78ddeed971d38731d27ce293501547b960cefc3aadaa220186eded8a09 Image ID: 65401afa73528f9a425a9d7f5dee8a9de8d9d3d82c8fd84cd653b16409093836 Port: 10258/TCP Host Port: 10258/TCP Command: /bin/bash -c #!/bin/bash set -o allexport if [[ -f /etc/kubernetes/apiserver-url.env ]]; then source /etc/kubernetes/apiserver-url.env fi exec /bin/ibm-cloud-controller-manager \ --bind-address=$(POD_IP_ADDRESS) \ --use-service-account-credentials=true \ --configure-cloud-routes=false \ --cloud-provider=ibm \ --cloud-config=/etc/ibm/cloud.conf \ --profiling=false \ --leader-elect=true \ --leader-elect-lease-duration=137s \ --leader-elect-renew-deadline=107s \ --leader-elect-retry-period=26s \ --leader-elect-resource-namespace=openshift-cloud-controller-manager \ --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_AES_128_GCM_SHA256,TLS_CHACHA20_POLY1305_SHA256,TLS_AES_256_GCM_SHA384 \ --v=2 State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Mon, 08 Jan 2024 12:35:12 -0600 Finished: Mon, 08 Jan 2024 12:35:12 -0600 Ready: False Restart Count: 12 Requests: cpu: 75m memory: 60Mi Liveness: http-get https://:10258/healthz delay=300s timeout=160s period=10s #success=1 #failure=3 Environment: POD_IP_ADDRESS: (v1:status.podIP) VPCCTL_CLOUD_CONFIG: /etc/ibm/cloud.conf ENABLE_VPC_PUBLIC_ENDPOINT: true Mounts: /etc/ibm from cloud-conf (rw) /etc/kubernetes from host-etc-kube (ro) /etc/pki/ca-trust/extracted/pem from trusted-ca (ro) /etc/vpc from ibm-cloud-credentials (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z5xdm (ro) Conditions: Type Status PodReadyToStartContainers True Initialized True Ready False ContainersReady False PodScheduled True Volumes: trusted-ca: Type: ConfigMap (a volume populated by a ConfigMap) Name: ccm-trusted-ca Optional: false host-etc-kube: Type: HostPath (bare host directory volume) Path: /etc/kubernetes HostPathType: Directory cloud-conf: Type: ConfigMap (a volume populated by a ConfigMap) Name: cloud-conf Optional: false ibm-cloud-credentials: Type: Secret (a volume populated by a Secret) SecretName: ibm-cloud-credentials Optional: false kube-api-access-z5xdm: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master:NoSchedule op=Exists node.cloudprovider.kubernetes.io/uninitialized:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 120s node.kubernetes.io/not-ready:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists for 120s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 38m default-scheduler Successfully assigned openshift-cloud-controller-manager/powervs-cloud-controller-manager-6b6fbcc9db-9rhtj to rdr-hamzy-test-wdc06-fs5m2-master-2 Normal Pulling 38m kubelet Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3dd2cf78ddeed971d38731d27ce293501547b960cefc3aadaa220186eded8a09" Normal Pulled 37m kubelet Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3dd2cf78ddeed971d38731d27ce293501547b960cefc3aadaa220186eded8a09" in 36.694s (36.694s including waiting) Normal Started 36m (x4 over 37m) kubelet Started container cloud-controller-manager Normal Created 35m (x5 over 37m) kubelet Created container cloud-controller-manager Normal Pulled 35m (x4 over 37m) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3dd2cf78ddeed971d38731d27ce293501547b960cefc3aadaa220186eded8a09" already present on machine Warning BackOff 2m57s (x166 over 37m) kubelet Back-off restarting failed container cloud-controller-manager in pod powervs-cloud-controller-manager-6b6fbcc9db-9rhtj_openshift-cloud-controller-manager(bf58b824-b1a2-4d2e-8735-22723642a24a)
This is a clone of issue OCPBUGS-35440. The following is the description of the original issue:
—
Description of problem:
Because of a bug in upstream CAPA, the Load Balancer ingress rules are continuously revoked and then authorized, causing unnecessary AWS API calls and cluster provision delays.
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
A constant loop of revoke-authorize of ingress rules.
Expected results:
Rules should be revoked only when needed (for example, when the installer removes the allow-all ssh rule). In the other cases, rules should be authorized only once.
Additional info:
Upstream issue created: https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5023 PR submitted upstream: https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/5024
Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/160
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
$ oc adm upgrade info: An upgrade is in progress. Working towards 4.15.0-rc.4: 701 of 873 done (80% complete), waiting on operator-lifecycle-manager Upstream: https://api.openshift.com/api/upgrades_info/v1/graph Channel: candidate-4.15 (available channels: candidate-4.15, candidate-4.16) No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available. $ oc get pods -n openshift-operator-lifecycle-manager NAME READY STATUS RESTARTS AGE catalog-operator-db86b7466-gdp4g 1/1 Running 0 9h collect-profiles-28443465-9zzbk 0/1 Completed 0 34m collect-profiles-28443480-kkgtk 0/1 Completed 0 19m collect-profiles-28443495-shvs7 0/1 Completed 0 4m10s olm-operator-56cb759d88-q2gr7 0/1 CrashLoopBackOff 8 (3m27s ago) 20m package-server-manager-7cf46947f6-sgnlk 2/2 Running 0 9h packageserver-7b795b79f-thxfw 1/1 Running 1 14d packageserver-7b795b79f-w49jj 1/1 Running 0 4d17h
Version-Release number of selected component (if applicable):
How reproducible:
Unknown
Steps to Reproduce:
Upgrade from 4.15.0-rc.2 to 4.15.0-rc.4
Actual results:
The upgrade is unable to proceed
Expected results:
The upgrade can proceed
Additional info:
Description of problem:
When using the oc cli to query information about release images it is not possible to use the --certificate-authority option to specify an alternative CA bundle for verifying connections to the target registry.
Version-Release number of selected component (if applicable): 4.14.5
How reproducible: 100%
Steps to Reproduce:
1. oc adm release info --registry-config ./auth.json --certificate-authority ./tls-ca-bundle.pem quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64
Actual results:
error: unable to read image quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64: Get "https://quay.io/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority
Expected results:
Something beginning with: Name: 4.14.9 Digest: sha256:f5eaf0248779a0478cfd83f055d56dc7d755937800a68ad55f6047c503977c44 Created: 2024-01-12T06:48:42Z OS/Arch: linux/amd64 Manifests: 680 Metadata files: 1 Pull From: quay.io/openshift-release-dev/ocp-release@sha256:f5eaf0248779a0478cfd83f055d56dc7d755937800a68ad55f6047c503977c44 Release Metadata:
Additional info:
To fully verify that this was an issue I went through the following steps which should show that the oc command is not using the CA bundle in the provided file and that the command would have worked if oc was using the provided bundle // show the command works with the system CA bundle # oc adm release info --registry-config ./auth.json quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64 | head Name: 4.14.9 Digest: sha256:f5eaf0248779a0478cfd83f055d56dc7d755937800a68ad55f6047c503977c44 Created: 2024-01-12T06:48:42Z OS/Arch: linux/amd64 Manifests: 680 Metadata files: 1 Pull From: quay.io/openshift-release-dev/ocp-release@sha256:f5eaf0248779a0478cfd83f055d56dc7d755937800a68ad55f6047c503977c44 Release Metadata: // move the system CA bundle to the local directory # mv /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem . // show the same command now fails without that bundle file # oc adm release info --registry-config ./auth.json quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64 | head error: unable to read image quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64: Get "https://quay.io/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority // show using that same bundle file with --certificate-authority doesn't work # oc adm release info --registry-config ./auth.json --certificate-authority ./tls-ca-bundle.pem quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64 | head error: unable to read image quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64: Get "https://quay.io/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority Additionally this also seems to be a problem for at least the following commands as well: oc image info oc adm release extract
Please review the following PR: https://github.com/openshift/telemeter/pull/522
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
If I use custom CVO capabilities via the install config, I can create a capability set that disables the Ingress capability. However, once the cluster boots up, the Ingress capability will always be enabled. This creates a dissonance between the desired install config and what happens. It would be better to fail the install at install-config validation to prevent that dissonance.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
Installing cluster . during installation hit on timeout warnings and installation "stucked" in a state of installing .
No more events in the log events and installation still at the same state after ~48 hours.
Looks like stuck forever ....
test-infra-cluster-cfb47d07_608f175e-aa23-493d-8a5c-d5bcaf15468f(1).tar
Screencast from 2024-03-15 21-05-34.webm
How reproducible:
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
Please review the following PR: https://github.com/openshift/oauth-server/pull/140
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-32710. The following is the description of the original issue:
—
Description of problem:
When enabled virtualHostedStyle with regionEndpoint set in config.image/cluster , image registry failed to be running. errors throw: time="2024-04-22T14:14:31.057192227Z" level=error msg="s3aws: RequestError: send request failed\ncaused by: Get \"https://s3-fips.us-west-1.amazonaws.com/ci-ln-67zbmzk-76ef8-4n6wb-image-registry-us-west-1-xjyfbabyboc?list-type=2&max-keys=1&prefix=\": dial tcp: lookup s3-fips.us-west-1.amazonaws.com on 172.30.0.10:53: no such host" go.version="go1.20.12 X:strictfipsruntime"
Version-Release number of selected component (if applicable):
4.14.18
How reproducible:
always
Steps to Reproduce:
1. $ oc get config.imageregistry/cluster -ojsonpath="{.status.storage}"|jq { "managementState": "Managed", "s3": { "bucket": "ci-ln-67zbmzk-76ef8-4n6wb-image-registry-us-west-1-xjyfbabyboc", "encrypt": true, "region": "us-west-1", "regionEndpoint": "https://s3-fips.us-west-1.amazonaws.com", "trustedCA": { "name": "" }, "virtualHostedStyle": true } } 2. Check registry pod $ oc get co image-registry NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE image-registry 4.15.5 True True True 79m Degraded: Registry deployment has timed out progressing: ReplicaSet "image-registry-b6c58998d" has timed out progressing
Actual results:
$ oc get pods image-registry-b6c58998d-m8pnb -oyaml| yq '.spec.containers[0].env' - name: REGISTRY_STORAGE_S3_REGIONENDPOINT value: https://s3-fips.us-west-1.amazonaws.com [...] - name: REGISTRY_STORAGE_S3_VIRTUALHOSTEDSTYLE value: "true" [...] $ oc logs image-registry-b6c58998d-m8pnb [...] time="2024-04-22T14:14:31.057192227Z" level=error msg="s3aws: RequestError: send request failed\ncaused by: Get \"https://s3-fips.us-west-1.amazonaws.com/ci-ln-67zbmzk-76ef8-4n6wb-image-registry-us-west-1-xjyfbabyboc?list-type=2&max-keys=1&prefix=\": dial tcp: lookup s3-fips.us-west-1.amazonaws.com on 172.30.0.10:53: no such host" go.version="go1.20.12 X:strictfipsruntime"
Expected results:
virtual hosted-style should work
Additional info:
Description of the problem:
Up to latest decision RH won't going to support installation OCP cluster on vSphere with
nested virtualization. Thus the check box "Install OpenShift Virtualization" on page "Operators" should be disabled when select platform "vSphere" on page "Cluster Details"
Slack discussion thread
https://redhat-internal.slack.com/archives/C0211848DBN/p1706640683120159
Nutanix
https://portal.nutanix.com/page/documents/kbs/details?targetId=kA00e000000XeiHCAS
Description of problem:
Converting IPv6 Primary Dual Stack to IPv6 Single stack causing control plane failures. OVN masters in CLBO state peridically OVN masters logs: http://shell.lab.bos.redhat.com/~anusaxen/convert/ MG is not working as cluster lands in bad shape. Happy to share cluster if needed for debugging
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-04-21-084440
How reproducible:
Always
Steps to Reproduce:
1.Bring cluster on IPv6 Primary Dual Stack 2.Edit network.config.openshift.io from dual stack to single like as follow spec: clusterNetwork: - cidr: fd01::/48 hostPrefix: 64 - cidr: 10.128.0.0/14 hostPrefix: 23 externalIP: policy: {} networkType: OVNKubernetes serviceNetwork: - fd02::/112 - 172.30.0.0/16 status: clusterNetwork: - cidr: fd01::/48 hostPrefix: 64 - cidr: 10.128.0.0/14 hostPrefix: 23 clusterNetworkMTU: 1400 networkType: OVNKubernetes serviceNetwork: - fd02::/112 - 172.30.0.0/16 TO apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: Network metadata: creationTimestamp: "2023-04-27T14:11:37Z" generation: 3 name: cluster resourceVersion: "81045" uid: 28f15675-e739-4262-9acc-4c2c0df4b38d spec: clusterNetwork: - cidr: fd01::/48 hostPrefix: 64 externalIP: policy: {} networkType: OVNKubernetes serviceNetwork: - fd02::/112 status: clusterNetwork: - cidr: fd01::/48 hostPrefix: 64 clusterNetworkMTU: 1400 networkType: OVNKubernetes serviceNetwork: - fd02::/112 kind: List metadata: resourceVersion: "" 3. Wait for control plane components to roll out successfully
Actual results:
Cluster fails with network, ETCD, Kube API and ingress failures
Expected results:
Cluster should convert to IPv6 single without any issues
Additional info:
MGs not working due to varios control place restricting it
Ecosystem QE is preparing to create a release-4.16 branch within our test repos. Many pkgs are currently using v0.29 modules which will not be compatible with v0.28. It would be ideal if we can update k8s modules to v0.29 to prevent us from needing to re-implement the assisted APIs.
Please review the following PR: https://github.com/openshift/csi-operator/pull/126
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
OCP 4.14.5
multicluster-engine.v2.4.1
advanced-cluster-management.v2.9.0
Attempt to run the create a spoke cluster:
apiVersion: extensions.hive.openshift.io/v1beta1 kind: AgentClusterInstall metadata: creationTimestamp: "2023-12-08T16:59:25Z" finalizers: - agentclusterinstall.agent-install.openshift.io/ai-deprovision generation: 1 name: infraenv-spoke namespace: infraenv-spoke ownerReferences: - apiVersion: hive.openshift.io/v1 kind: ClusterDeployment name: infraenv-spoke uid: 34f1fe43-2af2-4880-b4ca-fb9ab8df13df resourceVersion: "3468594" uid: 79a42bdf-db1f-4500-b689-8b3813bd27a6 spec: clusterDeploymentRef: name: infraenv-spoke imageSetRef: name: 4.14-test networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 serviceNetwork: - 172.30.0.0/16 userManagedNetworking: true provisionRequirements: controlPlaneAgents: 3 workerAgents: 2 status: conditions: - lastProbeTime: "2023-12-08T16:59:30Z" lastTransitionTime: "2023-12-08T16:59:30Z" message: SyncOK reason: SyncOK status: "True" type: SpecSynced - lastProbeTime: "2023-12-08T16:59:30Z" lastTransitionTime: "2023-12-08T16:59:30Z" message: The cluster is not ready to begin the installation reason: ClusterNotReady status: "False" type: RequirementsMet - lastProbeTime: "2023-12-08T16:59:30Z" lastTransitionTime: "2023-12-08T16:59:30Z" message: 'The cluster''s validations are failing: ' reason: ValidationsFailing status: "False" type: Validated - lastProbeTime: "2023-12-08T16:59:30Z" lastTransitionTime: "2023-12-08T16:59:30Z" message: The installation has not yet started reason: InstallationNotStarted status: "False" type: Completed - lastProbeTime: "2023-12-08T16:59:30Z" lastTransitionTime: "2023-12-08T16:59:30Z" message: The installation has not failed reason: InstallationNotFailed status: "False" type: Failed - lastProbeTime: "2023-12-08T16:59:30Z" lastTransitionTime: "2023-12-08T16:59:30Z" message: The installation is waiting to start or in progress reason: InstallationNotStopped status: "False" type: Stopped debugInfo: eventsURL: https://assisted-service-rhacm.apps.sno-0.qe.lab.redhat.com/api/assisted-install/v2/events?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiZjU4MjVmMTctNTg0OS00OTljLWE1NDctNjJmMDc4ZDU3MDJiIn0.qpSeZuqLwZ3cr3qn6AZo665o1ANp45YVE6IWUv7Gdn1RmapG4HZaxsUUY4iswkRMiqIfka_pLHFnBeVzXSTbrg&cluster_id=f5825f17-5849-499c-a547-62f078d5702b logsURL: "" state: insufficient stateInfo: Cluster is not ready for install platformType: None progress: totalPercentage: 0 userManagedNetworking: true apiVersion: agent-install.openshift.io/v1beta1 kind: InfraEnv metadata: creationTimestamp: "2023-12-08T16:59:26Z" finalizers: - infraenv.agent-install.openshift.io/ai-deprovision generation: 1 name: infraenv-spoke namespace: infraenv-spoke resourceVersion: "3468794" uid: 6254bbb3-5531-4665-bb78-f073b439b023 spec: clusterRef: name: infraenv-spoke namespace: infraenv-spoke cpuArchitecture: s390x ipxeScriptType: "" nmStateConfigLabelSelector: {} pullSecretRef: name: infraenv-spoke-pull-secret status: agentLabelSelector: matchLabels: infraenvs.agent-install.openshift.io: infraenv-spoke bootArtifacts: initrd: "" ipxeScript: "" kernel: "" rootfs: "" conditions: - lastTransitionTime: "2023-12-08T16:59:51Z" message: 'Failed to create image: cannot use Minimal ISO because it''s not compatible with the s390x architecture on version 4.14.6 of OpenShift' reason: ImageCreationError status: "False" type: ImageCreated debugInfo: eventsURL: "" oc get clusterimagesets.hive.openshift.io 4.14-test -o yaml apiVersion: hive.openshift.io/v1 kind: ClusterImageSet metadata: creationTimestamp: "2023-12-08T18:11:29Z" generation: 1 name: 4.14-test resourceVersion: "3514589" uid: 32e2ba8d-6bb7-4e4b-b3a5-63fa8224d144 spec: releaseImage: registry.ci.openshift.org/ocp-s390x/release-s390x@sha256:f024a617c059bf2cbf4a669c2a19ab4129e78a007c6863b64dd73a413c0bdf46 oc get agentserviceconfigs.agent-install.openshift.io agent -o yaml apiVersion: agent-install.openshift.io/v1beta1 kind: AgentServiceConfig metadata: creationTimestamp: "2023-12-08T18:10:42Z" finalizers: - agentserviceconfig.agent-install.openshift.io/ai-deprovision generation: 1 name: agent resourceVersion: "3514534" uid: ef204896-25f1-4ff3-ae60-c80c2f45cd30 spec: databaseStorage: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi filesystemStorage: accessModes: - ReadWriteOnce resources: requests: storage: 20Gi imageStorage: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi mirrorRegistryRef: name: mirror-registry-ca osImages: - cpuArchitecture: x86_64 openshiftVersion: "4.14" rootFSUrl: "" url: https://mirror.openshift.com/pub/openshift-v4/amd64/dependencies/rhcos/4.14/latest/rhcos-live.x86_64.iso version: "4.14" - cpuArchitecture: arm64 openshiftVersion: "4.14" rootFSUrl: "" url: https://mirror.openshift.com/pub/openshift-v4/aarch64/dependencies/rhcos/4.14/latest/rhcos-live.aarch64.iso version: "4.14" - cpuArchitecture: ppc64le openshiftVersion: "4.14" rootFSUrl: "" url: https://mirror.openshift.com/pub/openshift-v4/ppc64le/dependencies/rhcos/4.14/latest/rhcos-live.ppc64le.iso version: "4.14" - cpuArchitecture: s390x openshiftVersion: "4.14" rootFSUrl: "" url: https://mirror.openshift.com/pub/openshift-v4/s390x/dependencies/rhcos/4.14/latest/rhcos-live.s390x.iso version: "4.14" status: conditions: - lastTransitionTime: "2023-12-08T18:10:42Z" message: AgentServiceConfig reconcile completed without error. reason: ReconcileSucceeded status: "True" type: ReconcileCompleted - lastTransitionTime: "2023-12-08T18:11:23Z" message: All the deployments managed by Infrastructure-operator are healthy. reason: DeploymentSucceeded status: "True" type: DeploymentsHealthy
Description of problem:
mirrorToDisk command failed when imagesetconfig include multiple catalogs (v2docker2 + oci)
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202403251146.p0.g03ce0ca.assembly.stream.el9-03ce0ca", GitCommit:"03ce0ca797e73b6762fd3e24100ce043199519e9", GitTreeState:"clean", BuildDate:"2024-03-25T16:34:33Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
always
Steps to Reproduce:
1) Copy oci redhat-operator-index multi-arch with skopeo: `skopeo copy --all docker://registry.redhat.io/redhat/redhat-operator-index:v4.15 oci:///app1/noo/redhat-operator-index --remove-signatures` 2) Set the imagesetconfig with multiple catalogs : cat config-multi-cs.yaml kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 mirror: operators: - catalog: oci:///app1/noo/redhat-operator-index packages: - name: aws-load-balancer-operator - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 packages: - name: elasticsearch-operator - catalog: registry.redhat.io/redhat/redhat-marketplace-index:v4.15 packages: - name: datadog-operator-certified-rhmp - catalog: registry.redhat.io/redhat/certified-operator-index:v4.15 packages: - name: portworx-certified - catalog: registry.redhat.io/redhat/community-operator-index:v4.15 packages: - name: seldon-operator 3) Run the mirrorToDisk `oc-mirror --config config-multi-cs.yaml file://multics --v2`
Actual results:
3) mirror command failed: oc-mirror --config config-multi-cs.yaml file://multics --v2 --v2 flag identified, flow redirected to the oc-mirror v2 version. PLEASE DO NOT USE that. V2 is still under development and it is not ready to be used. 2024/04/02 06:41:10 [INFO] : mode mirrorToDisk 2024/04/02 06:41:10 [INFO] : local storage registry will log to /app1/0401/multics/working-dir/logs/registry.log 2024/04/02 06:41:10 [INFO] : starting local storage on localhost:55000 2024/04/02 06:41:10 [INFO] : copying cincinnati response to multics/working-dir/release-filters 2024/04/02 06:41:10 [INFO] : total release images to copy 0 2024/04/02 06:41:10 [INFO] : copying operator image oci:///app1/noo/redhat-operator-index 2024/04/02 06:41:15 [INFO] : manifest 8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c 2024/04/02 06:41:15 [INFO] : label /configs 2024/04/02 06:41:29 [INFO] : copying operator image registry.redhat.io/redhat/redhat-operator-index:v4.15 2024/04/02 06:41:39 [INFO] : manifest c866c3b4dac531016e4798f3232bb40e07d7dabd7d628d575de53d1821e51a50 2024/04/02 06:41:39 [INFO] : label /configs 2024/04/02 06:41:53 [INFO] : copying operator image registry.redhat.io/redhat/redhat-marketplace-index:v4.15 2024/04/02 06:42:01 [INFO] : manifest 895c1659c4337aaa963263f002fefa938087e40011cd2c1331ef4780d62fd1a7 2024/04/02 06:42:01 [INFO] : label /configs 2024/04/02 06:42:07 [INFO] : copying operator image registry.redhat.io/redhat/certified-operator-index:v4.15 2024/04/02 06:42:13 [INFO] : manifest 5c32d95f0c6d873454f2fcd6a9750dbadf638b8db8ada3d0b1c282d80b0dbcb3 2024/04/02 06:42:13 [INFO] : label /configs 2024/04/02 06:42:21 [INFO] : copying operator image registry.redhat.io/redhat/community-operator-index:v4.15 2024/04/02 06:42:27 [INFO] : manifest befb55a98886578684023b155c5889a845defb35d0d09c52b8738b851ee4eec2 2024/04/02 06:42:27 [INFO] : label /configs 2024/04/02 06:42:36 [INFO] : related images length 10 2024/04/02 06:42:36 [INFO] : images to copy (before duplicates) 26 error closing log file registry.log: close multics/working-dir/logs/registry.log: file already closed 2024/04/02 06:42:36 [ERROR] : unable to parse image correctly
Expected results:
4) no error
The provisioning CR is now created with a paused annotation (since https://github.com/openshift/installer/pull/8346)
On baremetal IPI installs, this annotation is removed at the conclusion of bootstrapping.
On assisted/ABI installs there is nothing to remove it, so cluster-baremetal-operator never deploys anything.
Description of problem:
After upgrading to 4.16.0-0.nightly-2024-02-23-013505 from 4.15.0-rc.8 (gcp-ipi-disc-priv-oidc-f14), openshift-cloud-network-config-controller CrashLoopBackOff by Error building cloud provider client, err: error: cannot initialize google client, must gather is available. The other job (gcp-ipi-oidc-rt-fips-f14) failed by same error.
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.15-gcp-ipi-disc-priv-oidc-f14/1761337726575054848
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-02-23-013505
How reproducible:
Steps to Reproduce:
After upgrading to 4.16.0-0.nightly-2024-02-23-013505 from 4.15.0-rc.8 (gcp-ipi-disc-priv-oidc-f14), openshift-cloud-network-config-controller CrashLoopBackOff by Error building cloud provider client, err: error: cannot initialize google client, must gather is available. The other job (gcp-ipi-oidc-rt-fips-f14) failed by same error.
Actual results:
containerStatuses: - containerID: cri-o://b7dc826c4004583a4195f953bb7c858f3645b3ba864db65c69282fc8b7a9a9e8 image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:02a0ea00865bda78b3b04056dc9e4f596dae74996ecc1fcdee7fbe8d603e33f1 imageID: 9dfa10971dce332900b111bbe6a28df76e1d6e0c5b9c132c3abfff80ea0afa9c lastState: terminated: containerID: cri-o://b7dc826c4004583a4195f953bb7c858f3645b3ba864db65c69282fc8b7a9a9e8 exitCode: 255 finishedAt: "2024-02-24T15:20:08Z" message: | r,UID:,APIVersion:apps/v1,ResourceVersion:,FieldPath:,},Reason:FeatureGatesInitialized,Message:FeatureGates updated to featuregates.Features{Enabled:[]v1.FeatureGateName{\"AlibabaPlatform\", \"AzureWorkloadIdentity\", \"BuildCSIVolumes\", \"CloudDualStackNodeIPs\", \"ExternalCloudProvider\", \"ExternalCloudProviderAzure\", \"ExternalCloudProviderExternal\", \"ExternalCloudProviderGCP\", \"KMSv1\", \"NetworkLiveMigration\", \"OpenShiftPodSecurityAdmission\", \"PrivateHostedZoneAWS\", \"VSphereControlPlaneMachineSet\"}, Disabled:[]v1.FeatureGateName{\"AdminNetworkPolicy\", \"AutomatedEtcdBackup\", \"CSIDriverSharedResource\", \"ClusterAPIInstall\", \"DNSNameResolver\", \"DisableKubeletCloudCredentialProviders\", \"DynamicResourceAllocation\", \"EventedPLEG\", \"GCPClusterHostedDNS\", \"GCPLabelsTags\", \"GatewayAPI\", \"InsightsConfigAPI\", \"InstallAlternateInfrastructureAWS\", \"MachineAPIOperatorDisableMachineHealthCheckController\", \"MachineAPIProviderOpenStack\", \"MachineConfigNodes\", \"ManagedBootImages\", \"MaxUnavailableStatefulSet\", \"MetricsServer\", \"MixedCPUsAllocation\", \"NodeSwap\", \"OnClusterBuild\", \"PinnedImages\", \"RouteExternalCertificate\", \"SignatureStores\", \"SigstoreImageVerification\", \"TranslateStreamCloseWebsocketRequests\", \"UpgradeStatus\", \"VSphereStaticIPs\", \"ValidatingAdmissionPolicy\", \"VolumeGroupSnapshot\"}},Source:EventSource{Component:cloud-network-config-controller-86bc6cf968-54kkg,Host:,},FirstTimestamp:2024-02-24 15:20:07.457325229 +0000 UTC m=+0.107570685,LastTimestamp:2024-02-24 15:20:07.457325229 +0000 UTC m=+0.107570685,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:cloud-network-config-controller-86bc6cf968-54kkg,ReportingInstance:,}" F0224 15:20:08.633010 1 main.go:138] Error building cloud provider client, err: error: cannot initialize google client, err: Get "http://169.254.169.254/computeMetadata/v1/universe/universe_domain": dial tcp 169.254.169.254:80: connect: connection refused reason: Error startedAt: "2024-02-24T15:20:07Z" name: controller ready: false restartCount: 12 started: false state: waiting: message: back-off 5m0s restarting failed container=controller pod=cloud-network-config-controller-86bc6cf968-54kkg_openshift-cloud-network-config-controller(95a0c264-ad8b-4fb0-9218-5b2b84fb8194) reason: CrashLoopBackOff
Expected results:
CNCC won't crash after upgrade
Additional info:
Description of problem:
If there is a taskRun with same name in 2 different namespace, then in TaskRuns list page for All namespace, showing only one record due to same name
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. Create TaskRun using https://gist.github.com/karthikjeeyar/eb1bbdf9157431f5c875eb55ce47580c in 2 different namespace 2. Go to TaskRun list page 3. Select All Projects
Actual results:
Only one entry is shown
Expected results:
Both entries should be visible
Additional info:
This is a clone of issue OCPBUGS-34181. The following is the description of the original issue:
—
In the agent installer, assisted-service must always use the openshift-baremetal-installer binary (which is dynamically linked) to ensure that if the target cluster is in FIPS mode the installer will be able to run. (This was implemented in MGMT-15150.)
A recent change for OCPBUGS-33227 has switched to using the statically-linked openshift-installer for 4.16 and later. This breaks FIPS on the agent-based installer.
It appears that CI tests for the agent installer (the compact-ipv4 job runs with FIPS enabled) did not detect this, because we are unable to correctly determine the "version" of OpenShift being installed when it is in fact a CI payload.
This is a clone of issue OCPBUGS-36932. The following is the description of the original issue:
—
Description of problem:
Customer defines proxy in its HostedCluster resource definition. The variables are propagated to some pods but not to oauth one:
oc describe pod kube-apiserver-5f5dbf78dc-8gfgs | grep PROX
HTTP_PROXY: http://ocpproxy.corp.example.com:8080
HTTPS_PROXY: http://ocpproxy.corp.example.com:8080
NO_PROXY: .....
oc describe pod oauth-openshift-6d7b7c79f8-2cf99| grep PROX
HTTP_PROXY: socks5://127.0.0.1:8090
HTTPS_PROXY: socks5://127.0.0.1:8090
ALL_PROXY: socks5://127.0.0.1:8090
NO_PROXY: kube-apiserver
apiVersion: hypershift.openshift.io/v1beta1
kind: HostedCluster
...
spec:
autoscaling: {}
clusterID: 9c8db607-b291-4a72-acc7-435ec23a72ea
configuration:
.....
proxy:
httpProxy: http://ocpproxy.corp.example.com:8080
httpsProxy: http://ocpproxy.corp.example.com:8080
Version-Release number of selected component (if applicable): 4.14
Complement https://hypershift-docs.netlify.app/how-to/distribute-hosted-cluster-workloads/ to include Share everything / share nothing / dedicated behaviour and requirements. -> https://docs.google.com/document/d/1eSaqR7rUwelq0PRC_trL3d5vxLMlJmLytHtFZyFOpvg/edit
Please review the following PR: https://github.com/openshift/cluster-api-provider-ovirt/pull/176
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
In https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.ci/release/4.16.0-0.ci-2024-02-06-031624, I see several PR's involving moving crio metrics. This payload is being rejected on TargetAlerts alerts on AWS minor upgrades.
Example job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-[…]e-from-stable-4.15-e2e-aws-ovn-upgrade/1754707749708500992
[sig-node][invariant] alert/TargetDown should not be at or above info in ns/kube-system expand_less 0s { TargetDown was at or above info for at least 1m58s on platformidentification.JobType
(maxAllowed=1s): pending for 15m0s, firing for 1m58s: Feb 06 04:48:42.698 - 118s W namespace/kube-system alert/TargetDown alertstate/firing severity/warning ALERTS{alertname="TargetDown", alertstate="firing", job="crio", namespace="kube-system", prometheus="openshift-monitoring/k8s", service="kubelet", severity="warning"}}
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/101
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Perf & scale team is running scale tests to to find out maximum supported egress ips and come across this issue. When we have 55339 egress ip objects (each egress ip object with one egress ip address) in 118 worker node baremetal cluster, multus-admission-controller pod is stuck in CrashLoopBackOff state. "oc describe pod" command output is copied here http://storage.scalelab.redhat.com/anilvenkata/multus-admission/multus-admission-controller-84b896c8-kmvdk.describe "oc describe pod" shows that the names of all 55339 egress ips are passed to container's exec command #cat multus-admission-controller-84b896c8-kmvdk.describe | grep ignore-namespaces | tr ',' '\n' | grep -c egressip 55339 and exec command is failing as this argument list is too long. # oc logs -n openshift-multus multus-admission-controller-84b896c8-kmvdk Defaulted container "multus-admission-controller" out of: multus-admission-controller, kube-rbac-proxy exec /bin/bash: argument list too long # oc get co network NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE network 4.14.16 True True False 35d Deployment "/openshift-multus/multus-admission-controller" update is rolling out (1 out of 3 updated) # oc describe pod -n openshift-multus multus-admission-controller-84b896c8-kmvdk > multus-admission-controller-84b896c8-kmvdk.describe # oc get pods -n openshift-multus | grep multus-admission-controller multus-admission-controller-6c58c66ff9-5x9hn 2/2 Running 0 35d multus-admission-controller-6c58c66ff9-zv9pd 2/2 Running 0 35d multus-admission-controller-84b896c8-kmvdk 1/2 CrashLoopBackOff 26 (2m56s ago) 110m As this environment has 55338 namespaces (each namespace with 1 pod and 1 eip object), it will hard to capture must gather.
Version-Release number of selected component (if applicable):
4.14.16
How reproducible:
always
Steps to Reproduce:
1. use kube-burner to create 55339 egress ip obejct, each object with one egress ip address. 2. We will see multus-admission-controller pod stuck in CrashLoopBackOff
Actual results:
Expected results:
Additional info:
Description of problem:
The sdn image inherits from the cli image to get the oc binary. Change this to install the openshift-clients rpm instead.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Add the possibility to have other search filters in the resource list toolbar
For now, using the props and some hacks we were able to change the Name search into an IP search but we would like to have both.
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/60
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When running oc-mirror against yaml that includes community-operator-index the process terminates prematurely
Version-Release number of selected component (if applicable):
$ oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202404221110.p0.g0e2235f.assembly.stream.el9-0e2235f", GitCommit:"0e2235f4a51ce0a2d51cfc87227b1c76bc7220ea", GitTreeState:"clean", BuildDate:"2024-04-22T16:05:56Z", GoVersion:"go1.21.9 (Red Hat 1.21.9-1.el9_4) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
$ cat imageset-config.yaml apiVersion: mirror.openshift.io/v1alpha2 kind: ImageSetConfiguration archiveSize: 4 mirror: platform: channels: - name: stable-4.15 type: ocp graph: true operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 full: false - catalog: registry.redhat.io/redhat/certified-operator-index:v4.15 full: false - catalog: registry.redhat.io/redhat/community-operator-index:v4.15 full: false additionalImages: - name: registry.redhat.io/ubi8/ubi:latest helm: {} $ oc-mirror --v2 -c imageset-config.yaml --loglevel debug --workspace file:////data/oc-mirror/workdir/ docker://registry.local.momolab.io:8443 Last 10 lines: 2024/04/29 06:01:40 [DEBUG] : source docker://public.ecr.aws/aws-controllers-k8s/apigatewayv2-controller:1.0.7 2024/04/29 06:01:40 [DEBUG] : destination docker://registry.local.momolab.io:8443/aws-controllers-k8s/apigatewayv2-controller:1.0.7 2024/04/29 06:01:40 [DEBUG] : source docker://quay.io/openshift-community-operators/ack-apigatewayv2-controller@sha256:c6844909fa2fdf8aabf1c6762a2871d85fb3491e4c349990f46e4cd1e7ecc099 2024/04/29 06:01:40 [DEBUG] : destination docker://registry.local.momolab.io:8443/openshift-community-operators/ack-apigatewayv2-controller:c6844909fa2fdf8aabf1c6762a2871d85fb3491e4c349990f46e4cd1e7ecc099 2024/04/29 06:01:40 [DEBUG] : source docker://quay.io/openshift-community-operators/openshift-nfd-operator@sha256:880517267f12e0ca4dd9621aa196c901eb1f754e5ec990a1459d0869a8c17451 2024/04/29 06:01:40 [DEBUG] : destination docker://registry.local.momolab.io:8443/openshift-community-operators/openshift-nfd-operator:880517267f12e0ca4dd9621aa196c901eb1f754e5ec990a1459d0869a8c17451 2024/04/29 06:01:40 [DEBUG] : source docker://quay.io/openshift/origin-cluster-nfd-operator:4.10 2024/04/29 06:01:40 [DEBUG] : destination docker://registry.local.momolab.io:8443/openshift/origin-cluster-nfd-operator:4.10 2024/04/29 06:01:40 [ERROR] : [OperatorImageCollector] unable to parse image registry.redhat.io/openshift4/ose-kube-rbac-proxy correctly 2024/04/29 06:01:40 [INFO] : 👋 Goodbye, thank you for using oc-mirror error closing log file registry.log: close /data/oc-mirror/workdir/working-dir/logs/registry.log: file already closed 2024/04/29 06:01:40 [ERROR] : unable to parse image registry.redhat.io/openshift4/ose-kube-rbac-proxy correctly
Steps to Reproduce:
1. Run oc-mirror command as above with debug enabled 2. Wait a few minutes 3. oc-mirror fails
Actual results:
oc-mirror fails when openshift-community-operator is included
Expected results:
oc-mirror to complete
Additional info:
I have the debug logs, which I can attach.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-35752. The following is the description of the original issue:
—
Description of problem:
When installing a fresh 4.16-rc.5 on AWS, the following logs are shown: time="2024-06-18T16:47:23+02:00" level=debug msg="I0618 16:47:23.596147 4921 logger.go:75] \"enabling EKS controllers and webhooks\" logger=\"setup\"" time="2024-06-18T16:47:23+02:00" level=debug msg="I0618 16:47:23.596154 4921 logger.go:81] \"EKS IAM role creation\" logger=\"setup\" enabled=false" time="2024-06-18T16:47:23+02:00" level=debug msg="I0618 16:47:23.596159 4921 logger.go:81] \"EKS IAM additional roles\" logger=\"setup\" enabled=false" time="2024-06-18T16:47:23+02:00" level=debug msg="I0618 16:47:23.596164 4921 logger.go:81] \"enabling EKS control plane controller\" logger=\"setup\"" time="2024-06-18T16:47:23+02:00" level=debug msg="I0618 16:47:23.596184 4921 logger.go:81] \"enabling EKS bootstrap controller\" logger=\"setup\"" time="2024-06-18T16:47:23+02:00" level=debug msg="I0618 16:47:23.596198 4921 logger.go:81] \"enabling EKS managed cluster controller\" logger=\"setup\"" time="2024-06-18T16:47:23+02:00" level=debug msg="I0618 16:47:23.596215 4921 logger.go:81] \"enabling EKS managed machine pool controller\" logger=\"setup\"" That is somehow strange and may have side effects. It seems the EKS CAPA is enabled by default (see additional info)
Version-Release number of selected component (if applicable):
4.16-rc.5
How reproducible:
Always
Steps to Reproduce:
1. Install an cluster (even an SNO works) on AWS using IPI
Actual results:
EKS feature enabled
Expected results:
EKS feature not enabled
Additional info:
https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/feature/feature.go#L99
This is a clone of issue OCPBUGS-39402. The following is the description of the original issue:
—
There is a typo here: https://github.com/openshift/installer/blob/release-4.18/upi/openstack/security-groups.yaml#L370
It should be os_subnet6_range.
That task is only run if os_master_schedulable is defined and greater to 0 in the inventory.yaml
Description of problem
List DeploymentConfig triggers a warning notification which is not required for Display warning policy feature. This Warning response is set in the cluster by default. See the Warning response below:
299 - "apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+"
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Steps to Reproduce:{code:none} 1. Click on Deployment Config sub nav 2. The Admission Webhook notification is displayed 3. Additional info: I think this is good since the the CLI behavior like that too,. Will discuss this behavior in next stand up. Actual results:{code:none} Expected results:{code:none} Additional info:{code:none}
Please review the following PR: https://github.com/openshift/csi-external-provisioner/pull/80
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/250
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-33869. The following is the description of the original issue:
—
Download and merge French and Spanish languages translations in the OCP Console.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-43350. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-42732. The following is the description of the original issue:
—
Description of problem:
The operator cannot succeed removing resources when networkAccess is set to Removed. It looks like the authorization error changes from bloberror.AuthorizationPermissionMismatch to bloberror.AuthorizationFailure after the storage account becomes private (networkAccess: Internal). This is either caused by weird behavior in the azure sdk, or in the azure api itself. The easiest way to solve it is to also handle bloberror.AuthorizationFailure here: https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/storage/azure/azure.go?plain=1#L1145 The error condition is the following: status: conditions: - lastTransitionTime: "2024-09-27T09:04:20Z" message: "Unable to delete storage container: DELETE https://imageregistrywxj927q6bpj.blob.core.windows.net/wxj-927d-jv8fc-image-registry-rwccleepmieiyukdxbhasjyvklsshhee\n--------------------------------------------------------------------------------\nRESPONSE 403: 403 This request is not authorized to perform this operation.\nERROR CODE: AuthorizationFailure\n--------------------------------------------------------------------------------\n\uFEFF<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>AuthorizationFailure</Code><Message>This request is not authorized to perform this operation.\nRequestId:ababfe86-301e-0005-73bd-10d7af000000\nTime:2024-09-27T09:10:46.1231255Z</Message></Error>\n--------------------------------------------------------------------------------\n" reason: AzureError status: Unknown type: StorageExists - lastTransitionTime: "2024-09-27T09:02:26Z" message: The registry is removed reason: Removed status: "True" type: Available
Version-Release number of selected component (if applicable):
4.18, 4.17, 4.16 (needs confirmation), 4.15 (needs confirmation)
How reproducible:
Always
Steps to Reproduce:
1. Get an Azure cluster 2. In the operator config, set networkAccess to Internal 3. Wait until the operator reconciles the change (watch networkAccess in status with `oc get configs.imageregistry/cluster -oyaml |yq '.status.storage'`) 4. In the operator config, set management state to removed: `oc patch configs.imageregistry/cluster -p '{"spec":{"managementState":"Removed"}}' --type=merge` 5. Watch the cluster operator conditions for the error
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-38569. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-38551. The following is the description of the original issue:
—
Description of problem:
If multiple NICs are configured in install-config, the installer will provision nodes properly but will fail in bootstrap due to API validation. > 4.17 will support multiple NICs, < 4.17 will not and will fail. Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: [#1672] failed to create some manifests: Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: "cluster-infrastructure-02-config.yml": failed to create infrastructures.v1.config.openshift.io/cluster -n : Infrastructure.config.openshift.io "cluster" is invalid: [spec.platformSpec.vsphere.failureDomains[0].topology.networks: Too many: 2: must have at most 1 items, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/337
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Openshift Installer supports HTTP Proxy configuration in a restricted environment. However, it seems the bootstrap node doesn't use the given proxy when it grabs ignition assets.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-04-27-113605
How reproducible:
Always
Steps to Reproduce:
1. try IPI installation in a restricted/disconnected network with "publish: Internal", and without using Google Private Access
Actual results:
The installation failed, because bootstrap node failed to fetch its ignition config.
Expected results:
The installation should succeed.
Additional info:
We'd ever fixed similar issue on AWS (and Alibabacloud) by https://bugzilla.redhat.com/show_bug.cgi?id=2090836.
Please review the following PR: https://github.com/openshift/cloud-network-config-controller/pull/129
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/58
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/machine-os-images/pull/34
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
I noticed this today when looking at component readiness. A ~5% decrease in instability may seem minor, but these can certainly add up. This test passed 713 times in a row on 4.14. You can see today's failure here.
Details below:
-------
Component Readiness has found a potential regression in [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers.
Probability of significant regression: 99.96%
Sample (being evaluated) Release: 4.15
Start Time: 2024-01-17T00:00:00Z
End Time: 2024-01-23T23:59:59Z
Success Rate: 94.83%
Successes: 55
Failures: 3
Flakes: 0
Base (historical) Release: 4.14
Start Time: 2023-10-04T00:00:00Z
End Time: 2023-10-31T23:59:59Z
Success Rate: 100.00%
Successes: 713
Failures: 0
Flakes: 4
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The hypershift operator introduced Azure customer-managed keys etcd encryption in https://github.com/openshift/hypershift/pull/3183. The implementation will not work in any non-Azure Public Cloud as the keyvault URL is hardcoded: https://github.com/openshift/hypershift/blob/cd4d4c69a64d8983da04d7bb26ea39a72109e135/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go#L4871 to vault.azure.net, which is only the public cloud keyvault domain suffix. The cloud-specific keyvault domain suffixes are here: https://learn.microsoft.com/en-us/azure/key-vault/general/about-keys-secrets-certificates#dns-suffixes-for-object-identifiers
Version-Release number of selected component (if applicable):
Since https://github.com/openshift/hypershift/pull/3183 was merged
How reproducible:
Every time
Steps to Reproduce:
1. 2. 3.
Actual results:
The keyvault domain is hardcoded to work specifically for public cloud, but will not for azure gov cloud when using etcd encryption with customer-managed keys
Expected results:
The keyvault domain to fetch from will use the correct cloud's domain suffix as outlined here: https://learn.microsoft.com/en-us/azure/key-vault/general/about-keys-secrets-certificates#dns-suffixes-for-object-identifiers
Additional info:
Description of problem:
Trying to install AWS EFS Driver 4.15 in 4.16 OCP. And driver pods get stuck with the below error: $ oc get pods NAME READY STATUS RESTARTS AGE aws-ebs-csi-driver-controller-5f85b66c6-5gw8n 11/11 Running 0 80m aws-ebs-csi-driver-controller-5f85b66c6-r5lzm 11/11 Running 0 80m aws-ebs-csi-driver-node-4mcjp 3/3 Running 0 76m aws-ebs-csi-driver-node-82hmk 3/3 Running 0 76m aws-ebs-csi-driver-node-p7g8j 3/3 Running 0 80m aws-ebs-csi-driver-node-q9bnd 3/3 Running 0 75m aws-ebs-csi-driver-node-vddmg 3/3 Running 0 80m aws-ebs-csi-driver-node-x8cwl 3/3 Running 0 80m aws-ebs-csi-driver-operator-5c77fbb9fd-dc94m 1/1 Running 0 80m aws-efs-csi-driver-controller-6c4c6f8c8c-725f4 4/4 Running 0 11m aws-efs-csi-driver-controller-6c4c6f8c8c-nvtl7 4/4 Running 0 12m aws-efs-csi-driver-node-2frs7 0/3 Pending 0 6m29s aws-efs-csi-driver-node-5cpb8 0/3 Pending 0 6m26s aws-efs-csi-driver-node-bchg5 0/3 Pending 0 6m28s aws-efs-csi-driver-node-brndb 0/3 Pending 0 6m27s aws-efs-csi-driver-node-qcc4m 0/3 Pending 0 6m27s aws-efs-csi-driver-node-wpk5d 0/3 Pending 0 6m27s aws-efs-csi-driver-operator-6b54c78484-gvxrt 1/1 Running 0 13m Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 6m58s default-scheduler 0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match Pod's node affinity/selector. Warning FailedScheduling 3m42s (x2 over 4m24s) default-scheduler 0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match Pod's node affinity/selector.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
all the time
Steps to Reproduce:
1. Install AWS EFS CSI driver 4.15 in 4.16 OCP 2. 3.
Actual results:
EFS CSI drive node pods are stuck in pending state
Expected results:
All pod should be running.
Additional info:
More info on the initial debug here: https://redhat-internal.slack.com/archives/CBQHQFU0N/p1715757611210639
Description of problem:
The bubble box with wrong layout
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-11-16-110328
How reproducible:
Always
Steps to Reproduce:
1. Make sure there is no pod under your using project 2. navigate to Networking -> NetworkPolicies -> Create NetworkPolicy page, click the 'affected pods' in Pod selector section 3. Check the layout in the bubble component
Actual results:
the layout is in correct (shared file:https://drive.google.com/file/d/1I8e2ZkiFO2Gu4nSt9kJ6JmRG3LdvkE-u/view?usp=drive_link )
Expected results:
layout should correct
Additional info:
This is a clone of issue OCPBUGS-34079. The following is the description of the original issue:
—
Description of problem:
If a cluster admin creates a new MachineOSConfig that references a legacy pull secret, the canonicalized version of this secret that gets created is not updated whenever the original pull secret changes.
How reproducible:
Always
Steps to Reproduce:
.
Actual results:
The canonicalized version of the pull secret is never updated with the contents of the legacy-style pull secret.
Expected results:
Ideally, the canonicalized version of the pull secret should be updated since BuildController created it.
Additional info:
This occurs because when the legacy pull secret is initially detected, BuildController canonicalizes it and then updates the MachineOSConfig with the name of the canonicalized secret. The next time this secret is referenced, the original secret does not get read.
Description of problem:
A user noticed on delete cluster that the IPI generated service instance was not cleaned up. Add more debugging statements to find out why.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create cluster 2. Delete cluster
Actual results:
Expected results:
Additional info:
Description of problem:
RHEL8 workers fail to go ready, ovn-controller node component is crashlooping with 2024-03-29T20:41:34.082252221Z + sourcedir=/usr/libexec/cni/ 2024-03-29T20:41:34.082269221Z + case "${rhelmajor}" in 2024-03-29T20:41:34.082269221Z + sourcedir=/usr/libexec/cni/rhel8 2024-03-29T20:41:34.082276361Z + cp -f /usr/libexec/cni/rhel8/ovn-k8s-cni-overlay /cni-bin-dir/ 2024-03-29T20:41:34.083575440Z cp: cannot stat '/usr/libexec/cni/rhel8/ovn-k8s-cni-overlay': No such file or directory
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
100% since https://github.com/openshift/ovn-kubernetes/pull/2083 merged
Steps to Reproduce:
1. run periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-workers-rhel8
Actual results:
Expected results:
Additional info:
Description of problem:
revert "force cert rotation every couple days for development" in 4.16 Below is the steps to verify this bug: # oc adm release info --commits registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-06-25-081133|grep -i cluster-kube-apiserver-operator cluster-kube-apiserver-operator https://github.com/openshift/cluster-kube-apiserver-operator 7764681777edfa3126981a0a1d390a6060a840a3 # git log --date local --pretty="%h %an %cd - %s" 776468 |grep -i "#1307" 08973b820 openshift-ci[bot] Thu Jun 23 22:40:08 2022 - Merge pull request #1307 from tkashem/revert-cert-rotation # oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-25-081133 True False 64m Cluster version is 4.11.0-0.nightly-2022-06-25-081133 $ cat scripts/check_secret_expiry.sh FILE="$1" if [ ! -f "$1" ]; then echo "must provide \$1" && exit 0 fi export IFS=$'\n' for i in `cat "$FILE"` do if `echo "$i" | grep "^#" > /dev/null`; then continue fi NS=`echo $i | cut -d ' ' -f 1` SECRET=`echo $i | cut -d ' ' -f 2` rm -f tls.crt; oc extract secret/$SECRET -n $NS --confirm > /dev/null echo "Check cert dates of $SECRET in project $NS:" openssl x509 -noout --dates -in tls.crt; echo done $ cat certs.txt openshift-kube-controller-manager-operator csr-signer-signer openshift-kube-controller-manager-operator csr-signer openshift-kube-controller-manager kube-controller-manager-client-cert-key openshift-kube-apiserver-operator aggregator-client-signer openshift-kube-apiserver aggregator-client openshift-kube-apiserver external-loadbalancer-serving-certkey openshift-kube-apiserver internal-loadbalancer-serving-certkey openshift-kube-apiserver service-network-serving-certkey openshift-config-managed kube-controller-manager-client-cert-key openshift-config-managed kube-scheduler-client-cert-key openshift-kube-scheduler kube-scheduler-client-cert-key Checking the Certs, they are with one day expiry times, this is as expected. # ./check_secret_expiry.sh certs.txt Check cert dates of csr-signer-signer in project openshift-kube-controller-manager-operator: notBefore=Jun 27 04:41:38 2022 GMT notAfter=Jun 28 04:41:38 2022 GMT Check cert dates of csr-signer in project openshift-kube-controller-manager-operator: notBefore=Jun 27 04:52:21 2022 GMT notAfter=Jun 28 04:41:38 2022 GMT Check cert dates of kube-controller-manager-client-cert-key in project openshift-kube-controller-manager: notBefore=Jun 27 04:52:26 2022 GMT notAfter=Jul 27 04:52:27 2022 GMT Check cert dates of aggregator-client-signer in project openshift-kube-apiserver-operator: notBefore=Jun 27 04:41:37 2022 GMT notAfter=Jun 28 04:41:37 2022 GMT Check cert dates of aggregator-client in project openshift-kube-apiserver: notBefore=Jun 27 04:52:26 2022 GMT notAfter=Jun 28 04:41:37 2022 GMT Check cert dates of external-loadbalancer-serving-certkey in project openshift-kube-apiserver: notBefore=Jun 27 04:52:26 2022 GMT notAfter=Jul 27 04:52:27 2022 GMT Check cert dates of internal-loadbalancer-serving-certkey in project openshift-kube-apiserver: notBefore=Jun 27 04:52:49 2022 GMT notAfter=Jul 27 04:52:50 2022 GMT Check cert dates of service-network-serving-certkey in project openshift-kube-apiserver: notBefore=Jun 27 04:52:28 2022 GMT notAfter=Jul 27 04:52:29 2022 GMT Check cert dates of kube-controller-manager-client-cert-key in project openshift-config-managed: notBefore=Jun 27 04:52:26 2022 GMT notAfter=Jul 27 04:52:27 2022 GMT Check cert dates of kube-scheduler-client-cert-key in project openshift-config-managed: notBefore=Jun 27 04:52:47 2022 GMT notAfter=Jul 27 04:52:48 2022 GMT Check cert dates of kube-scheduler-client-cert-key in project openshift-kube-scheduler: notBefore=Jun 27 04:52:47 2022 GMT notAfter=Jul 27 04:52:48 2022 GMT # # cat check_secret_expiry_within.sh #!/usr/bin/env bash # usage: ./check_secret_expiry_within.sh 1day # or 15min, 2days, 2day, 2month, 1year WITHIN=${1:-24hours} echo "Checking validity within $WITHIN ..." oc get secret --insecure-skip-tls-verify -A -o json | jq -r '.items[] | select(.metadata.annotations."auth.openshift.io/certificate-not-after" | . != null and fromdateiso8601<='$( date --date="+$WITHIN" +%s )') | "\(.metadata.annotations."auth.openshift.io/certificate-not-before") \(.metadata.annotations."auth.openshift.io/certificate-not-after") \(.metadata.namespace)\t\(.metadata.name)"' # ./check_secret_expiry_within.sh 1day Checking validity within 1day ... 2022-06-27T04:41:37Z 2022-06-28T04:41:37Z openshift-kube-apiserver-operator aggregator-client-signer 2022-06-27T04:52:26Z 2022-06-28T04:41:37Z openshift-kube-apiserver aggregator-client 2022-06-27T04:52:21Z 2022-06-28T04:41:38Z openshift-kube-controller-manager-operator csr-signer 2022-06-27T04:41:38Z 2022-06-28T04:41:38Z openshift-kube-controller-manager-operator csr-signer-signer
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
See https://issues.redhat.com/browse/OCPBUGS-26053
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create an ETP=Local LB Service on LGW for some v6 workload (assign IP to lb with MetalLB or manually) 2. Set static routes to a node hosting a pod on the client 3. Attempt reaching the IPv6 Service fails
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-34360. The following is the description of the original issue:
—
Description of problem:
After upgrading to OpenShift 4.14, the must-gather took much longer than before.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Run oc adm must-gather 2. Wait for it to complete 3.
Actual results:
For a cluster with around 50 nodes, the must-gather took about 30 minutes.
Expected results:
For a cluster with around 50 nodes, the must-gather can finish in about 10 minutes.
Additional info:
It seems the gather_ppc collection script is related here. https://github.com/openshift/must-gather/blob/release-4.14/collection-scripts/gather_ppc
Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/779
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When running agent-based installation with arm64 and multi payload, after booting the iso file, assisted-service raise the error, and the installation fail to start: Openshift version 4.16.0-0.nightly-arm64-2024-04-02-182838 for CPU architecture arm64 is not supported: no release image found for openshiftVersion: '4.16.0-0.nightly-arm64-2024-04-02-182838' and CPU architecture 'arm64'" go-id=419 pkg=Inventory request_id=5817b856-ca79-43c0-84f1-b38f733c192f The same error when running the installation with multi-arch build in assisted-service.log: Openshift version 4.16.0-0.nightly-multi-2024-04-01-135550 for CPU architecture multi is not supported: no release image found for openshiftVersion: '4.16.0-0.nightly-multi-2024-04-01-135550' and CPU architecture 'multi'" go-id=306 pkg=Inventory request_id=21a47a40-1de9-4ee3-9906-a2dd90b14ec8 Amd64 build works fine for now.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Create agent iso file with openshift-install binary: openshift-install agent create image with arm64/multi payload 2. Booting the iso file 3. Track the "openshift-install agent wait-for bootstrap-complete" output and assisted-service log
Actual results:
The installation can't start with error
Expected results:
The installation is working fine
Additional info:
assisted-service log: https://docs.google.com/spreadsheets/d/1Jm-eZDrVz5so4BxsWpUOlr3l_90VmJ8FVEvqUwG8ltg/edit#gid=0 Job fail url: multi payload: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-multi-nightly-baremetal-compact-agent-ipv4-dhcp-day2-amd-mixarch-f14/1774134780246364160 arm64 payload: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-arm64-nightly-baremetal-pxe-ha-agent-ipv4-static-connected-f14/1773354788239446016
Tracker issue for bootimage bump in 4.16. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-32041.
Description of problem:
In service details page, under Revision and Route tabs, user is able to see No resource found message although Revision and Route is created for that service
Version-Release number of selected component (if applicable):
4.15.z
How reproducible:
Always
Steps to Reproduce:
1.Install serverless operator 2.Create serving instance 3.Create knative service/ function 4.Go to details page
Actual results:
User is not able to see Revision and Route created for the service
Expected results:
User should be able to see Revision and Route created for the service
Additional info:
Please review the following PR: https://github.com/openshift/prom-label-proxy/pull/359
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/630
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
HardwareDetails is a pointer and we fail to check if it's null. The installer panics when attempting to collect gather logs from masters.
This is a clone of issue OCPBUGS-33792. The following is the description of the original issue:
—
Description of problem:
The ingress operator is E2E tests are perma-failing with a prometheus service account issue: === CONT TestAll/parallel/TestRouteMetricsControllerRouteAndNamespaceSelector route_metrics_test.go:86: prometheus service account not found === CONT TestAll/parallel/TestRouteMetricsControllerOnlyNamespaceSelector route_metrics_test.go:86: prometheus service account not found === CONT TestAll/parallel/TestRouteMetricsControllerOnlyRouteSelector route_metrics_test.go:86: prometheus service account not found We need to bump openshift/library-go to get update https://github.com/openshift/library-go/pull/1697 for NewPrometheusClient function that switches from using a legacy service account API to TokenRequest API.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100%
Steps to Reproduce:
1. Run e2e-[aws|gcp|azure]-operator E2E tests on cluster-ingress-operator
Actual results:
route_metrics_test.go:86: prometheus service account not found
Expected results:
No failure
Additional info:
Update the tekton files per the migration instructions for 4.14, 4.15, & 4.16 branches.
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/60
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-34869. The following is the description of the original issue:
—
Description of problem:
During IPI CAPI cluster creation, it is possible that the load balancer is currently busy. So wrap AddIPToLoadBalancerPool in a PollUntilContextCancel loop.
Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/271
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Sample job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-azure-4.15-nightly-x86-data-path-9nodes/1760228008968327168
Version-Release number of selected component (if applicable):
How reproducible:
Anytime there is an error from the move-blobs command
Steps to Reproduce:
1. 2. 3.
Actual results:
An error message is shown
Expected results:
A panic is shown followed by the error message
Additional info:
After some investigation, the issues we have seen with util-linux missing from some images are due to the CentOS Stream base image not installing subscription-manager
[root@92063ff10998 /]# yum install subscription-manager
CentOS Stream 9 - BaseOS 3.3 MB/s | 8.9 MB 00:02
CentOS Stream 9 - AppStream 2.1 MB/s | 17 MB 00:08
CentOS Stream 9 - Extras packages 14 kB/s | 17 kB 00:01
Dependencies resolved.
==============================================================================================================================================================================================================================================
Package Architecture Version Repository Size
==============================================================================================================================================================================================================================================
Installing:
subscription-manager aarch64 1.29.40-1.el9 baseos 911 k
Installing dependencies:
acl aarch64 2.3.1-4.el9 baseos 71 k
checkpolicy aarch64 3.6-1.el9 baseos 348 k
cracklib aarch64 2.9.6-27.el9 baseos 95 k
cracklib-dicts aarch64 2.9.6-27.el9 baseos 3.6 M
dbus aarch64 1:1.12.20-8.el9 baseos 3.7 k
dbus-broker aarch64 28-7.el9 baseos 166 k
dbus-common noarch 1:1.12.20-8.el9 baseos 15 k
dbus-libs aarch64 1:1.12.20-8.el9 baseos 150 k
diffutils aarch64 3.7-12.el9 baseos 392 k
dmidecode aarch64 1:3.3-7.el9 baseos 70 k
gobject-introspection aarch64 1.68.0-11.el9 baseos 248 k
iproute aarch64 6.2.0-5.el9 baseos 818 k
kmod-libs aarch64 28-9.el9 baseos 62 k
libbpf aarch64 2:1.3.0-2.el9 baseos 172 k
libdb aarch64 5.3.28-53.el9 baseos 712 k
libdnf-plugin-subscription-manager aarch64 1.29.40-1.el9 baseos 63 k
libeconf aarch64 0.4.1-4.el9 baseos 26 k
libfdisk aarch64 2.37.4-18.el9 baseos 150 k
libmnl aarch64 1.0.4-16.el9 baseos 28 k
libpwquality aarch64 1.4.4-8.el9 baseos 119 k
libseccomp aarch64 2.5.2-2.el9 baseos 72 k
libselinux-utils aarch64 3.6-1.el9 baseos 190 k
libuser aarch64 0.63-13.el9 baseos 405 k
libutempter aarch64 1.2.1-6.el9 baseos 27 k
openssl aarch64 1:3.2.1-1.el9 baseos 1.3 M
pam aarch64 1.5.1-19.el9 baseos 627 k
passwd aarch64 0.80-12.el9 baseos 121 k
policycoreutils aarch64 3.6-2.1.el9 baseos 242 k
policycoreutils-python-utils noarch 3.6-2.1.el9 baseos 77 k
psmisc aarch64 23.4-3.el9 baseos 243 k
python3-audit aarch64 3.1.2-2.el9 baseos 83 k
python3-chardet noarch 4.0.0-5.el9 baseos 239 k
python3-cloud-what aarch64 1.29.40-1.el9 baseos 77 k
python3-dateutil noarch 1:2.8.1-7.el9 baseos 288 k
python3-dbus aarch64 1.2.18-2.el9 baseos 144 k
python3-decorator noarch 4.4.2-6.el9 baseos 28 k
python3-distro noarch 1.5.0-7.el9 baseos 37 k
python3-dnf-plugins-core noarch 4.3.0-15.el9 baseos 264 k
python3-gobject-base aarch64 3.40.1-6.el9 baseos 184 k
python3-gobject-base-noarch noarch 3.40.1-6.el9 baseos 161 k
python3-idna noarch 2.10-7.el9.1 baseos 102 k
python3-iniparse noarch 0.4-45.el9 baseos 47 k
python3-inotify noarch 0.9.6-25.el9 baseos 53 k
python3-librepo aarch64 1.14.5-2.el9 baseos 48 k
python3-libselinux aarch64 3.6-1.el9 baseos 183 k
python3-libsemanage aarch64 3.6-1.el9 baseos 79 k
python3-policycoreutils noarch 3.6-2.1.el9 baseos 2.1 M
python3-pysocks noarch 1.7.1-12.el9 baseos 35 k
python3-requests noarch 2.25.1-8.el9 baseos 125 k
python3-setools aarch64 4.4.4-1.el9 baseos 595 k
python3-setuptools noarch 53.0.0-12.el9 baseos 944 k
python3-six noarch 1.15.0-9.el9 baseos 37 k
python3-subscription-manager-rhsm aarch64 1.29.40-1.el9 baseos 162 k
python3-systemd aarch64 234-18.el9 baseos 89 k
python3-urllib3 noarch 1.26.5-5.el9 baseos 215 k
subscription-manager-rhsm-certificates noarch 20220623-1.el9 baseos 21 k
systemd aarch64 252-33.el9 baseos 4.0 M
systemd-libs aarch64 252-33.el9 baseos 641 k
systemd-pam aarch64 252-33.el9 baseos 271 k
systemd-rpm-macros noarch 252-33.el9 baseos 69 k
usermode aarch64 1.114-4.el9 baseos 189 k
util-linux aarch64 2.37.4-18.el9 baseos 2.3 M
util-linux-core aarch64 2.37.4-18.el9 baseos 463 k
virt-what aarch64 1.25-5.el9 baseos 33 k
which aarch64 2.21-29.el9 baseos 41 kTransaction Summary
==============================================================================================================================================================================================================================================
Install 66 PackagesTotal download size: 26 M
Installed size: 92 M
Is this ok [y/N]:
subscription-manager does bring in quite a few things. we can probably get away with installing
systemd util-linux iproute dbus
we may hit some edge cases still where something works in OCP but doesn't in OKD due to a missing package. we have hit at least 6 or 7 containers using tools from util-linux so far.
Description of problem:
User may provide an DNS domain outside GCP, once custom DNS is enabled, installer should skip DNS zone validation: level=fatal msg="failed to fetch Terraform Variables: failed to generate asset \"Terraform Variables\": failed to get GCP public zone: no matching public DNS Zone found"
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-02-03-192446 4.16.0-0.nightly-2024-02-03-221256
How reproducible:
Always
Steps to Reproduce:
1. Enable custom DNS on gcp: platform.gcp.userProvisionedDNS:Enabled and featureSet:TechPreviewNoUpgrade 2. config a baseDomain which does not exist on GCP.
Actual results:
See description.
Expected results:
Installer should skip the validation, as the custom domain may not exist on GCP
Additional info:
Description of problem:
Workload hints test cases get stuck when the existing profile is similar to changes proposed in some of the test cases
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When we run opm on RHEL8, we met the following error ./opm: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by ./opm) ./opm: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by ./opm) ./opm: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by ./opm) Note: it happened for 4.15.0-ec.3 I tried the 4.14, it works. I also tried to compile it with latest code, it also work.
Version-Release number of selected component (if applicable):
4.15.0-ec.3
How reproducible:
always
Steps to Reproduce:
[root@preserve-olm-env2 slavecontainer]# curl -s -k -L https://mirror2.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/candidate/opm-linux-4.15.0-ec.3.tar.gz -o opm.tar.gz && tar -xzvf opm.tar.gz opm [root@preserve-olm-env2 slavecontainer]# ./opm version ./opm: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by ./opm) ./opm: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by ./opm) ./opm: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by ./opm) [root@preserve-olm-env2 slavecontainer]# curl -s -l -L https://mirror2.openshift.com/pub/openshift-v4/x86_64/clients/ocp/latest-4.14/opm-linux-4.14.5.tar.gz -o opm.tar.gz && tar -xzvf opm.tar.gz opm [root@preserve-olm-env2 slavecontainer]# opm version Version: version.Version{OpmVersion:"639fc1203", GitCommit:"639fc12035292dec74a16b306226946c8da404a2", BuildDate:"2023-11-21T08:03:15Z", GoOs:"linux", GoArch:"amd64"} [root@preserve-olm-env2 kuiwang]# cd operator-framework-olm/ [root@preserve-olm-env2 operator-framework-olm]# git branch gs * master release-4.10 release-4.11 release-4.12 release-4.13 release-4.8 release-4.9 [root@preserve-olm-env2 operator-framework-olm]# git pull origin master remote: Enumerating objects: 1650, done. remote: Counting objects: 100% (1650/1650), done. remote: Compressing objects: 100% (831/831), done. remote: Total 1650 (delta 727), reused 1617 (delta 711), pack-reused 0 Receiving objects: 100% (1650/1650), 2.03 MiB | 12.81 MiB/s, done. Resolving deltas: 100% (727/727), completed with 468 local objects. From github.com:openshift/operator-framework-olm * branch master -> FETCH_HEAD 639fc1203..85c579f9b master -> origin/master Updating 639fc1203..85c579f9b Fast-forward go.mod | 120 +- go.sum | 240 ++-- manifests/0000_50_olm_00-pprof-secret.yaml ... create mode 100644 vendor/google.golang.org/protobuf/types/dynamicpb/types.go [root@preserve-olm-env2 operator-framework-olm]# rm -fr bin/opm [root@preserve-olm-env2 operator-framework-olm]# make build/opm make bin/opm make[1]: Entering directory '/data/kuiwang/operator-framework-olm' go build -ldflags "-X 'github.com/operator-framework/operator-registry/cmd/opm/version.gitCommit=85c579f9be61aaea11e90b6c870452c72107300a' -X 'github.com/operator-framework/operator-registry/cmd/opm/version.opmVersion=85c579f9b' -X 'github.com/operator-framework/operator-registry/cmd/opm/version.buildDate=2023-12-11T06:12:50Z'" -mod=vendor -tags "json1" -o bin/opm github.com/operator-framework/operator-registry/cmd/opm make[1]: Leaving directory '/data/kuiwang/operator-framework-olm' [root@preserve-olm-env2 operator-framework-olm]# which opm /data/kuiwang/operator-framework-olm/bin/opm [root@preserve-olm-env2 operator-framework-olm]# opm version Version: version.Version{OpmVersion:"85c579f9b", GitCommit:"85c579f9be61aaea11e90b6c870452c72107300a", BuildDate:"2023-12-11T06:12:50Z", GoOs:"linux", GoArch:"amd64"}
Actual results:
Expected results:
Additional info:
Description of problem:
Kube apiserver pod keeps crashing when tested against v1.29 rebase
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. Run hypershift e2e agains v1.29 rebase 2. 3.
Actual results:
Fails https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-kubernetes-1810-periodics-e2e-aws-ovn/1732734494688940032
Expected results:
Succeeds
Additional info:
Kube apiserver pod is crashlooping with: E1208 21:17:06.619997 1 run.go:74] "command failed" err="group version flowcontrol.apiserver.k8s.io/v1alpha1 that has not been registered"
Description of problem:
Cluster install fails on IBMCloud, nodes tainted with node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
Version-Release number of selected component (if applicable):
from 4.16.0-0.nightly-2023-12-22-210021 last PASS version: 4.16.0-0.nightly-2023-12-20-061023
How reproducible:
Always
Steps to Reproduce:
1. Install a cluster on IBMCloud, we use auto flexy template: aos-4_16/ipi-on-ibmcloud/versioned-installer liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 92m Unable to apply 4.16.0-0.nightly-2023-12-25-200355: an unknown error has occurred: MultipleErrors liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication baremetal cloud-controller-manager 4.16.0-0.nightly-2023-12-25-200355 True False False 89m cloud-credential cluster-autoscaler config-operator console control-plane-machine-set csi-snapshot-controller dns etcd image-registry ingress insights kube-apiserver kube-controller-manager kube-scheduler kube-storage-version-migrator machine-api machine-approver machine-config marketplace monitoring network node-tuning openshift-apiserver openshift-controller-manager openshift-samples operator-lifecycle-manager operator-lifecycle-manager-catalog operator-lifecycle-manager-packageserver service-ca storage liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION huliu-ibma-qbg48-master-0 NotReady control-plane,master 89m v1.29.0+b0d609f huliu-ibma-qbg48-master-1 NotReady control-plane,master 89m v1.29.0+b0d609f huliu-ibma-qbg48-master-2 NotReady control-plane,master 89m v1.29.0+b0d609f liuhuali@Lius-MacBook-Pro huali-test % oc describe node huliu-ibma-qbg48-master-0 Name: huliu-ibma-qbg48-master-0 Roles: control-plane,master Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=huliu-ibma-qbg48-master-0 kubernetes.io/os=linux node-role.kubernetes.io/control-plane= node-role.kubernetes.io/master= node.openshift.io/os_id=rhcos Annotations: volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Wed, 27 Dec 2023 18:02:21 +0800 Taints: node-role.kubernetes.io/master:NoSchedule node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule node.kubernetes.io/not-ready:NoSchedule Unschedulable: false Lease: HolderIdentity: huliu-ibma-qbg48-master-0 AcquireTime: <unset> RenewTime: Wed, 27 Dec 2023 19:32:24 +0800 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Wed, 27 Dec 2023 19:32:21 +0800 Wed, 27 Dec 2023 18:02:21 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 27 Dec 2023 19:32:21 +0800 Wed, 27 Dec 2023 18:02:21 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Wed, 27 Dec 2023 19:32:21 +0800 Wed, 27 Dec 2023 18:02:21 +0800 KubeletHasSufficientPID kubelet has sufficient PID available Ready False Wed, 27 Dec 2023 19:32:21 +0800 Wed, 27 Dec 2023 18:02:21 +0800 KubeletNotReady container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? Addresses: Capacity: cpu: 4 ephemeral-storage: 104266732Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 16391716Ki pods: 250 Allocatable: cpu: 3500m ephemeral-storage: 95018478229 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 15240740Ki pods: 250 System Info: Machine ID: 0ae21a012be844f18c5871f6eaefb85b System UUID: 0ae21a01-2be8-44f1-8c58-71f6eaefb85b Boot ID: fbe619e2-8ff5-4cdb-b6a4-cd6830ccc568 Kernel Version: 5.14.0-284.45.1.el9_2.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 416.92.202312250319-0 (Plow) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.28.2-9.rhaos4.15.git6d902a3.el9 Kubelet Version: v1.29.0+b0d609f Kube-Proxy Version: v1.29.0+b0d609f Non-terminated Pods: (0 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age --------- ---- ------------ ---------- --------------- ------------- --- Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 0 (0%) 0 (0%) memory 0 (0%) 0 (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal NodeHasNoDiskPressure 90m (x7 over 90m) kubelet Node huliu-ibma-qbg48-master-0 status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 90m (x7 over 90m) kubelet Node huliu-ibma-qbg48-master-0 status is now: NodeHasSufficientPID Normal NodeHasSufficientMemory 90m (x7 over 90m) kubelet Node huliu-ibma-qbg48-master-0 status is now: NodeHasSufficientMemory Normal RegisteredNode 90m node-controller Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller Normal RegisteredNode 73m node-controller Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller Normal RegisteredNode 53m node-controller Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller Normal RegisteredNode 32m node-controller Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller Normal RegisteredNode 12m node-controller Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller liuhuali@Lius-MacBook-Pro huali-test % oc get pod -n openshift-cloud-controller-manager NAME READY STATUS RESTARTS AGE ibm-cloud-controller-manager-787645668b-djqnr 0/1 CrashLoopBackOff 22 (2m29s ago) 90m ibm-cloud-controller-manager-787645668b-pgkh2 0/1 Error 15 (5m8s ago) 52m liuhuali@Lius-MacBook-Pro huali-test % oc describe pod ibm-cloud-controller-manager-787645668b-pgkh2 -n openshift-cloud-controller-manager Name: ibm-cloud-controller-manager-787645668b-pgkh2 Namespace: openshift-cloud-controller-manager Priority: 2000000000 Priority Class Name: system-cluster-critical Node: huliu-ibma-qbg48-master-2/ Start Time: Wed, 27 Dec 2023 18:41:23 +0800 Labels: infrastructure.openshift.io/cloud-controller-manager=IBMCloud k8s-app=ibm-cloud-controller-manager pod-template-hash=787645668b Annotations: operator.openshift.io/config-hash: 82a75c6ff86a490b0dac9c8c9b91f1987da0e646a42d72c33c54cbde3c29395b Status: Running IP: IPs: <none> Controlled By: ReplicaSet/ibm-cloud-controller-manager-787645668b Containers: cloud-controller-manager: Container ID: cri-o://c56e246f64c770146c30b7a894f6a4d974159551dbb9d1ea31c238e516a0f854 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76aedf175591ff1675c891e5c80d02ee7425a6b3a98c34427765f402ca050218 Image ID: e494d0d4b28e31170a4a2792bb90701c7f1e81c78c03e3686c5f0e601801937e Port: 10258/TCP Host Port: 10258/TCP Command: /bin/bash -c #!/bin/bash set -o allexport if [[ -f /etc/kubernetes/apiserver-url.env ]]; then source /etc/kubernetes/apiserver-url.env fi exec /bin/ibm-cloud-controller-manager \ --bind-address=$(POD_IP_ADDRESS) \ --use-service-account-credentials=true \ --configure-cloud-routes=false \ --cloud-provider=ibm \ --cloud-config=/etc/ibm/cloud.conf \ --profiling=false \ --leader-elect=true \ --leader-elect-lease-duration=137s \ --leader-elect-renew-deadline=107s \ --leader-elect-retry-period=26s \ --leader-elect-resource-namespace=openshift-cloud-controller-manager \ --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_AES_128_GCM_SHA256,TLS_CHACHA20_POLY1305_SHA256,TLS_AES_256_GCM_SHA384 \ --v=2 State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Wed, 27 Dec 2023 19:33:23 +0800 Finished: Wed, 27 Dec 2023 19:33:23 +0800 Ready: False Restart Count: 15 Requests: cpu: 75m memory: 60Mi Liveness: http-get https://:10258/healthz delay=300s timeout=160s period=10s #success=1 #failure=3 Environment: POD_IP_ADDRESS: (v1:status.podIP) VPCCTL_CLOUD_CONFIG: /etc/ibm/cloud.conf VPCCTL_PUBLIC_ENDPOINT: false Mounts: /etc/ibm from cloud-conf (rw) /etc/kubernetes from host-etc-kube (ro) /etc/pki/ca-trust/extracted/pem from trusted-ca (ro) /etc/vpc from ibm-cloud-credentials (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cbd4b (ro) Conditions: Type Status PodReadyToStartContainers True Initialized True Ready False ContainersReady False PodScheduled True Volumes: trusted-ca: Type: ConfigMap (a volume populated by a ConfigMap) Name: ccm-trusted-ca Optional: false host-etc-kube: Type: HostPath (bare host directory volume) Path: /etc/kubernetes HostPathType: Directory cloud-conf: Type: ConfigMap (a volume populated by a ConfigMap) Name: cloud-conf Optional: false ibm-cloud-credentials: Type: Secret (a volume populated by a Secret) SecretName: ibm-cloud-credentials Optional: false kube-api-access-cbd4b: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master:NoSchedule op=Exists node.cloudprovider.kubernetes.io/uninitialized:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 120s node.kubernetes.io/not-ready:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists for 120s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 52m default-scheduler Successfully assigned openshift-cloud-controller-manager/ibm-cloud-controller-manager-787645668b-pgkh2 to huliu-ibma-qbg48-master-2 Normal Pulling 52m kubelet Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76aedf175591ff1675c891e5c80d02ee7425a6b3a98c34427765f402ca050218" Normal Pulled 52m kubelet Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76aedf175591ff1675c891e5c80d02ee7425a6b3a98c34427765f402ca050218" in 3.431s (3.431s including waiting) Normal Created 50m (x5 over 52m) kubelet Created container cloud-controller-manager Normal Started 50m (x5 over 52m) kubelet Started container cloud-controller-manager Normal Pulled 50m (x4 over 52m) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76aedf175591ff1675c891e5c80d02ee7425a6b3a98c34427765f402ca050218" already present on machine Warning BackOff 2m19s (x240 over 52m) kubelet Back-off restarting failed container cloud-controller-manager in pod ibm-cloud-controller-manager-787645668b-pgkh2_openshift-cloud-controller-manager(d7f93ecf-cd14-450e-a986-028559a775b3) liuhuali@Lius-MacBook-Pro huali-test %
Actual results:
cluster install failed on IBMCloud
Expected results:
cluster install succeed on IBMCloud
Additional info:
Description of problem:
Panic thrown by origin-tests
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Create aws or rosa 4.15 cluster 2. run origin tests 3.
Actual results:
time="2024-03-07T17:03:50Z" level=info msg="resulting interval message" message="{RegisteredNode Node ip-10-0-8-83.ec2.internal event: Registered Node ip-10-0-8-83.ec2.internal in Controller map[reason:RegisteredNode roles:worker]}" E0307 17:03:50.319617 71 runtime.go:79] Observed a panic: runtime.boundsError{x:24, y:23, signed:true, code:0x3} (runtime error: slice bounds out of range [24:23]) goroutine 310 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x84c6f20?, 0xc006fdc588}) k8s.io/apimachinery@v0.29.0/pkg/util/runtime/runtime.go:75 +0x99 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc008c38120?}) k8s.io/apimachinery@v0.29.0/pkg/util/runtime/runtime.go:49 +0x75 panic({0x84c6f20, 0xc006fdc588}) runtime/panic.go:884 +0x213 github.com/openshift/origin/pkg/monitortests/testframework/watchevents.nodeRoles(0x0?) github.com/openshift/origin/pkg/monitortests/testframework/watchevents/event.go:251 +0x1e5 github.com/openshift/origin/pkg/monitortests/testframework/watchevents.recordAddOrUpdateEvent({0x96bcc00, 0xc0076e3310}, {0x7f2a0e47a1b8, 0xc007732330}, {0x281d36d?, 0x0?}, {0x9710b50, 0xc000c5e000}, {0x9777af, 0xedd7be6b7, ...}, ...) github.com/openshift/origin/pkg/monitortests/testframework/watchevents/event.go:116 +0x41b github.com/openshift/origin/pkg/monitortests/testframework/watchevents.startEventMonitoring.func2({0x8928f00?, 0xc00b528c80}) github.com/openshift/origin/pkg/monitortests/testframework/watchevents/event.go:65 +0x185 k8s.io/client-go/tools/cache.(*FakeCustomStore).Add(0x8928f00?, {0x8928f00?, 0xc00b528c80?}) k8s.io/client-go@v0.29.0/tools/cache/fake_custom_store.go:35 +0x31 k8s.io/client-go/tools/cache.watchHandler({0x0?, 0x0?, 0xe16d020?}, {0x9694a10, 0xc006b00180}, {0x96d2780, 0xc0078afe00}, {0x96f9e28?, 0x8928f00}, 0x0, ...) k8s.io/client-go@v0.29.0/tools/cache/reflector.go:756 +0x603 k8s.io/client-go/tools/cache.(*Reflector).watch(0xc0005dcc40, {0x0?, 0x0?}, 0xc005cdeea0, 0xc005bf8c40?) k8s.io/client-go@v0.29.0/tools/cache/reflector.go:437 +0x53b k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch(0xc0005dcc40, 0xc005cdeea0) k8s.io/client-go@v0.29.0/tools/cache/reflector.go:357 +0x453 k8s.io/client-go/tools/cache.(*Reflector).Run.func1() k8s.io/client-go@v0.29.0/tools/cache/reflector.go:291 +0x26 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x10?) k8s.io/apimachinery@v0.29.0/pkg/util/wait/backoff.go:226 +0x3e k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc007974ec0?, {0x9683f80, 0xc0078afe50}, 0x1, 0xc005cdeea0) k8s.io/apimachinery@v0.29.0/pkg/util/wait/backoff.go:227 +0xb6 k8s.io/client-go/tools/cache.(*Reflector).Run(0xc0005dcc40, 0xc005cdeea0) k8s.io/client-go@v0.29.0/tools/cache/reflector.go:290 +0x17d created by github.com/openshift/origin/pkg/monitortests/testframework/watchevents.startEventMonitoring github.com/openshift/origin/pkg/monitortests/testframework/watchevents/event.go:83 +0x6a5 panic: runtime error: slice bounds out of range [24:23] [recovered] panic: runtime error: slice bounds out of range [24:23]
Expected results:
execution of tests
Additional info:
Description of problem:
When deploying to Power VS with endpoint overrides set in the provider status, the operator will ignore the overrides.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Easily
Steps to Reproduce:
1. Set overrides in platform status 2. Deploy cluster-image-registry-operator 3. Endpoints are ignored
Actual results:
Specified endpoints are ignored
Expected results:
Specified endpoints are used
Additional info:
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Description of problem:
CNV upgrades from v4.14.1 to v4.15.0 (unreleased) are not starting due to out of sync operatorCondition.
We see:
$ oc get csv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v4.14.1 OpenShift Virtualization 4.14.1 kubevirt-hyperconverged-operator.v4.14.0 Replacing kubevirt-hyperconverged-operator.v4.15.0 OpenShift Virtualization 4.15.0 kubevirt-hyperconverged-operator.v4.14.1 Pending
And on the v4.15.0 CSV:
$ oc get csv kubevirt-hyperconverged-operator.v4.15.0 -o yaml .... status: cleanup: {} conditions: - lastTransitionTime: "2023-12-19T01:50:48Z" lastUpdateTime: "2023-12-19T01:50:48Z" message: requirements not yet checked phase: Pending reason: RequirementsUnknown - lastTransitionTime: "2023-12-19T01:50:48Z" lastUpdateTime: "2023-12-19T01:50:48Z" message: 'operator is not upgradeable: the operatorcondition status "Upgradeable"="True" is outdated' phase: Pending reason: OperatorConditionNotUpgradeable lastTransitionTime: "2023-12-19T01:50:48Z" lastUpdateTime: "2023-12-19T01:50:48Z" message: 'operator is not upgradeable: the operatorcondition status "Upgradeable"="True" is outdated' phase: Pending reason: OperatorConditionNotUpgradeable
and if we check the pending operator condition (v4.14.1) we see:
$ oc get operatorcondition kubevirt-hyperconverged-operator.v4.14.1 -o yaml apiVersion: operators.coreos.com/v2 kind: OperatorCondition metadata: creationTimestamp: "2023-12-16T17:10:17Z" generation: 18 labels: operators.coreos.com/kubevirt-hyperconverged.openshift-cnv: "" name: kubevirt-hyperconverged-operator.v4.14.1 namespace: openshift-cnv ownerReferences: - apiVersion: operators.coreos.com/v1alpha1 blockOwnerDeletion: false controller: true kind: ClusterServiceVersion name: kubevirt-hyperconverged-operator.v4.14.1 uid: 7db79d4b-e69e-4af8-9335-6269cf004440 resourceVersion: "4116127" uid: 347306c9-865a-42b8-b2c9-69192b0e350a spec: conditions: - lastTransitionTime: "2023-12-18T18:47:23Z" message: "" reason: Upgradeable status: "True" type: Upgradeable deployments: - hco-operator - hco-webhook - hyperconverged-cluster-cli-download - cluster-network-addons-operator - virt-operator - ssp-operator - cdi-operator - hostpath-provisioner-operator - mtq-operator serviceAccounts: - hyperconverged-cluster-operator - cluster-network-addons-operator - kubevirt-operator - ssp-operator - cdi-operator - hostpath-provisioner-operator - mtq-operator - cluster-network-addons-operator - kubevirt-operator - ssp-operator - cdi-operator - hostpath-provisioner-operator - mtq-operator status: conditions: - lastTransitionTime: "2023-12-18T09:41:06Z" message: "" observedGeneration: 11 reason: Upgradeable status: "True" type: Upgradeable
where metadata.generation (18) is not in sync with status.conditions[*].observedGeneration (11).
Even manually redacting spec.conditions.lastTransitionTime is causing a change in metadata.generation (as expected) but this doesn't trigger any reconciliation on the OLM and so status.conditions[*].observedGeneration remains at 11.
$ oc get operatorcondition kubevirt-hyperconverged-operator.v4.14.1 -o yaml apiVersion: operators.coreos.com/v2 kind: OperatorCondition metadata: creationTimestamp: "2023-12-16T17:10:17Z" generation: 19 labels: operators.coreos.com/kubevirt-hyperconverged.openshift-cnv: "" name: kubevirt-hyperconverged-operator.v4.14.1 namespace: openshift-cnv ownerReferences: - apiVersion: operators.coreos.com/v1alpha1 blockOwnerDeletion: false controller: true kind: ClusterServiceVersion name: kubevirt-hyperconverged-operator.v4.14.1 uid: 7db79d4b-e69e-4af8-9335-6269cf004440 resourceVersion: "4147472" uid: 347306c9-865a-42b8-b2c9-69192b0e350a spec: conditions: - lastTransitionTime: "2023-12-18T18:47:25Z" message: "" reason: Upgradeable status: "True" type: Upgradeable deployments: - hco-operator - hco-webhook - hyperconverged-cluster-cli-download - cluster-network-addons-operator - virt-operator - ssp-operator - cdi-operator - hostpath-provisioner-operator - mtq-operator serviceAccounts: - hyperconverged-cluster-operator - cluster-network-addons-operator - kubevirt-operator - ssp-operator - cdi-operator - hostpath-provisioner-operator - mtq-operator - cluster-network-addons-operator - kubevirt-operator - ssp-operator - cdi-operator - hostpath-provisioner-operator - mtq-operator status: conditions: - lastTransitionTime: "2023-12-18T09:41:06Z" message: "" observedGeneration: 11 reason: Upgradeable status: "True" type: Upgradeable
since its observedGeneration is out of sync, this check:
https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/operators/olm/operatorconditions.go#L44C1-L48
fails and the upgrade never starts.
I suspect (I'm only guessing) that it could be a regression introduced with the memory optimization for https://issues.redhat.com/browse/OCPBUGS-17157 .
Version-Release number of selected component (if applicable):
OCP 4.15.0-ec.3
How reproducible:
- Not reproducible (with the same CNV bundles) on OCP v4.14.z. - Pretty high (but not 100%) on OCP 4.15.0-ec.3
Steps to Reproduce:
1. Try triggering a CNV v4.14.1 -> v4.15.0 on OCP 4.15.0-ec.3 2. 3.
Actual results:
The OLM is not reacting to changes on spec.conditions on the pending operator condition, so metadata.generation is constantly out of sync with status.conditions[*].observedGeneration and so the CSV is reported as message: 'operator is not upgradeable: the operatorcondition status "Upgradeable"="True" is outdated' phase: Pending reason: OperatorConditionNotUpgradeable
Expected results:
The OLM correctly reconcile the operatorCondition and the upgrade starts
Additional info:
Not reproducible with exactly the same bundle (origin and target) on OCP v4.14.z
Description of problem:
When the replica for a nodepool is set to 0, the message for the nodepool is "NotFound". This message should not be displayed if the desired replica is 0.
Version-Release number of selected component (if applicable):
How reproducible:
Create a nodepool and set the replica to 0
Steps to Reproduce:
1. Create a hosted cluster 2. Set the replica for the nodepool to 0 3.
Actual results:
NodePool message is "NotFound"
Expected results:
NodePool message to be empty
Additional info:
This is a clone of issue OCPBUGS-34020. The following is the description of the original issue:
—
Description of problem:
Mirroring fails sometimes due to various number of reasons and since mirror fails, current code does not generate idms & itms files. Even if user tries to mirror the operators twice or thrice the operators does not get mirrored and no resources are created to utilize the operators that have already been mirrored. This bug is to create idms and itms even if mirroring fails
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Install latest oc-mirror 2. Use the ImageSetConfig.yaml below apiVersion: mirror.openshift.io/v1alpha2 kind: ImageSetConfiguration archiveSize: 4 mirror: operators: - catalog: registry.redhat.io/redhat/certified-operator-index:v4.15 full: false # only mirror the latest versions - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 full: false # only mirror the latest versions 3. Mirror using the command `oc-mirror -c config.yaml docker://localhost:5000/m2m --dest-skip-verify=false --workspace=file://test`
Actual results:
Mirroring fails and does not generate any idms or itms files
Expected results:
IDMS and ITMS files should be generated for the mirrored operators, even if mirroring fails
Additional info:
unknown machine config node can be listed, the name is not in current cluster, in my cluster, there are 6 nodes, but I can see 10 machine config nodes
// current node $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-12-209.us-east-2.compute.internal Ready worker 3h48m v1.28.3+59b90bd ip-10-0-23-177.us-east-2.compute.internal Ready control-plane,master 3h54m v1.28.3+59b90bd ip-10-0-32-216.us-east-2.compute.internal Ready control-plane,master 3h54m v1.28.3+59b90bd ip-10-0-42-207.us-east-2.compute.internal Ready worker 53m v1.28.3+59b90bd ip-10-0-71-71.us-east-2.compute.internal Ready worker 3h46m v1.28.3+59b90bd ip-10-0-81-190.us-east-2.compute.internal Ready control-plane,master 3h54m v1.28.3+59b90bd // current mcn $ oc get machineconfignode NAME UPDATED UPDATEPREPARED UPDATEEXECUTED UPDATEPOSTACTIONCOMPLETE UPDATECOMPLETE RESUMED ip-10-0-12-209.us-east-2.compute.internal True False False False False False ip-10-0-23-177.us-east-2.compute.internal True False False False False False ip-10-0-32-216.us-east-2.compute.internal True False False False False False ip-10-0-42-207.us-east-2.compute.internal True False False False False False ip-10-0-53-5.us-east-2.compute.internal True False False False False False ip-10-0-56-84.us-east-2.compute.internal True False False False False False ip-10-0-58-210.us-east-2.compute.internal True False False False False False ip-10-0-58-99.us-east-2.compute.internal False True True Unknown False False ip-10-0-71-71.us-east-2.compute.internal True False False False False False ip-10-0-81-190.us-east-2.compute.internal True False False False False False
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-04-162702
How reproducible:
Consistently
Steps to Reproduce:
1. setup cluster with 4.15.0-0.nightly-2023-12-04-162702 on aws 2. enable featureSet: TechPreviewNoUpgrade 3. apply file based mc few times. 4. check node list 5. check machine config node list
Actual results:
there are some unknown machine config nodes found
Expected results:
machine config node number should be same as cluster node number
Additional info:
must-gather: https://drive.google.com/file/d/1-VTismwXXZ9sYMHi8hDL7vhwzjuMn92n/view?usp=drive_link
Description of problem:
capi machine cannot be deleted by installer during cluster destroy, checked on GCP console, found this machine lacks label(kubernetes-io-cluster-clusterid: owned), if adding this label manually on GCP console for the machine, then the machine can be deleted by installer during cluster destroy.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-05-053337
How reproducible:
Always
Steps to Reproduce:
1.Follow the steps here https://bugzilla.redhat.com/show_bug.cgi?id=2107999#c9 to create a capi machine liuhuali@Lius-MacBook-Pro huali-test % oc get machine.cluster.x-k8s.io -n openshift-cluster-api NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION capi-ms-mtchm huliu-gcpx-c55vm gce://openshift-qe/us-central1-a/capi-gcp-machine-template-gcw9t Provisioned 51m 2.Destroy the cluster The cluster destroyed successfully, but checked on GCP console, found the capi machine is still there.
labels of capi machine
labels of mapi machine
Actual results:
capi machine cannot be deleted by installer during cluster destroy
Expected results:
capi machine should be deleted by installer during cluster destroy
Additional info:
Also checked on aws, the case worked well, and found there is tag(kubernetes.io/cluster/clusterid:owned) for capi machines.
https://github.com/openshift/console/pull/13420 updated the console to use the new OpenShift branding for the favicon, but this change was not applied to oauth-templates.
Because the installer generates some of the keys that will remain present in the cluster (e.g. the signing key for the admin kubeconfig), it should also run in an environment where FIPS is enabled.
Because it is very easy to fail to notice that the keys were generated in a non-FIPS-certified environment, we should enforce this by checking that fips_enabled is true if the target cluster is to have FIPS enabled.
Colin Walters has a patch for this.
Description of problem:
We see failures in this test: [Jira:"Networking / router"] monitor test service-type-load-balancer-availability setup expand_less 15m1s{ failed during setup error waiting for load balancer: timed out waiting for service "service-test" to have a load balancer: timed out waiting for the condition} See this https://search.ci.openshift.org/?search=error+waiting+for+load+balancer&maxAge=168h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job to find recent ones. example job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade/1754402739040817152 this has failed payloads like: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.ci/release/4.16.0-0.ci-2024-02-01-211543 https://amd64.ocp.releases.ci.openshift.org/releasestream/4.15.0-0.ci/release/4.15.0-0.ci-2024-02-02-061913 https://amd64.ocp.releases.ci.openshift.org/releasestream/4.15.0-0.ci/release/4.15.0-0.ci-2024-02-02-001913
Version-Release number of selected component (if applicable):
4.15 and 4.16
How reproducible:
intermittent as shown in the search.ci query above
Steps to Reproduce:
1. run the e2e tests on 4.15 and 4.16 2. 3.
Actual results:
timeouts on getting load balancer
Expected results:
no timeout and successful load balancer
Additional info:
https://issues.redhat.com/browse/TRT-1486 has more info thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1707142256956139
Description of problem:
When deploying to a Power VS workspace created after February 14th 2024, it will not be found by the installer.
Version-Release number of selected component (if applicable):
How reproducible:
Easily.
Steps to Reproduce:
1. Create a Power VS Workspace 2. Specify it in the install config 3. Attempt to deploy 4. Fail with "...is not a valid guid" error.
Actual results:
Failure to deploy to service instance
Expected results:
Should deploy to service instance
Additional info:
Description of problem:
When building the agent ISO with the debug log level enabled, a number of FAT32 error messages are logged. They do not hamper the ISO creation, but they make the log output very noisy and harder to read (and moreover they are not necessary).
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Create an agent ISO (the configuration does not matter) with the debug log level
Steps to Reproduce:
1.$ openshift-install agent create image --log-level=debug
Actual results:
The ouput contains several traces like following:
... level=debug msg=trying fat32 level=debug msg=fat32 failed: error reading MS-DOS Boot Sector: could not read FAT32 BIOS Parameter Block from boot sector: could not read embedded DOS 3.31 BPB: error reading embedded DOS 2.0 BPB: invalid sector size 37008 provided in DOS 2.0 BPB. Must be 512 level=debug msg=trying iso9660 with physical block size 0 ...
Expected results:
The above traces are not shown
Additional info:
Description of problem:
The audit-logs container for kas, oapi and oauth apiservers does not terminate within the `TerminationGracePeriodSeconds` timer. This is due to the container not terminating when a `SIGTERM` command is issued. When testing without the audit logs container, oapi and oath-apiserver terminates within a 90-110 second range gracefully. The kas does not terminate with the container gone and I have a hunch that it's the konnectivity container that also does not follow `SIGTERM` (I've attempted 10 minutes and still does not timeout). So this issue is to change the logic for audit-logs to terminate gracefully and increase the TerminationGracePeriodSeconds from the default of 30s to 120s.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
User Story:
This is story use to merge PR in openshift/origin repository
https://github.com/openshift/origin/pull/28382
Acceptance criteria:
Description of problem:
Installation failed on 4.16 nightly build when waiting for install-complete. API is unavailable. level=info msg=Waiting up to 20m0s (until 5:00AM UTC) for the Kubernetes API at https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443... level=info msg=API v1.29.2+a0beecc up level=info msg=Waiting up to 30m0s (until 5:11AM UTC) for bootstrapping to complete... api available waiting for bootstrap to complete level=info msg=Waiting up to 20m0s (until 5:01AM UTC) for the Kubernetes API at https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443... level=info msg=API v1.29.2+a0beecc up level=info msg=Waiting up to 30m0s (until 5:11AM UTC) for bootstrapping to complete... level=info msg=It is now safe to remove the bootstrap resources level=info msg=Time elapsed: 15m54s Copying kubeconfig to shared dir as kubeconfig-minimal level=info msg=Destroying the bootstrap resources... level=info msg=Waiting up to 40m0s (until 5:39AM UTC) for the cluster at https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443 to initialize... W0313 04:59:34.272442 229 reflector.go:539] k8s.io/client-go/tools/watch/informerwatcher.go:146: failed to list *v1.ClusterVersion: Get "https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": dial tcp 172.212.184.131:6443: i/o timeout I0313 04:59:34.272658 229 trace.go:236] Trace[533197684]: "Reflector ListAndWatch" name:k8s.io/client-go/tools/watch/informerwatcher.go:146 (13-Mar-2024 04:59:04.271) (total time: 30000ms): Trace[533197684]: ---"Objects listed" error:Get "https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": dial tcp 172.212.184.131:6443: i/o timeout 30000ms (04:59:34.272) ... E0313 05:38:18.669780 229 reflector.go:147] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": dial tcp 172.212.184.131:6443: i/o timeout level=error msg=Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 172.212.184.131:6443: i/o timeout level=error msg=Cluster initialization failed because one or more operators are not functioning properly. level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below, level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation level=error msg=failed to initialize the cluster: timed out waiting for the condition On master node, seems that kube-apiserver is not running, [root@ci-op-4sgxj8jx-8482f-hppxj-master-0 ~]# crictl ps | grep apiserver e4b6cc9622b01 ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90 7 minutes ago Running kube-apiserver-cert-syncer 22 3ff4af6614409 kube-apiserver-ci-op-4sgxj8jx-8482f-hppxj-master-0 1249824fe5788 ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90 4 hours ago Running kube-apiserver-insecure-readyz 0 3ff4af6614409 kube-apiserver-ci-op-4sgxj8jx-8482f-hppxj-master-0 ca774b07284f0 ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90 4 hours ago Running kube-apiserver-cert-regeneration-controller 0 3ff4af6614409 kube-apiserver-ci-op-4sgxj8jx-8482f-hppxj-master-0 2931b9a2bbabd ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90 4 hours ago Running openshift-apiserver-check-endpoints 0 4136bf2183de1 apiserver-7df5bb879-xx74p 0c9534aec3b6b 8c9042f97c89d8c8519d6e6235bef5a5346f08e6d7d9864ef0f228b318b4c3de 4 hours ago Running openshift-apiserver 0 4136bf2183de1 apiserver-7df5bb879-xx74p db21a2dd1df33 ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90 4 hours ago Running guard 0 199e1f4e665b9 kube-apiserver-guard-ci-op-4sgxj8jx-8482f-hppxj-master-0 429110f9ea5a3 6a03f3f082f3719e79087d569b3cd1e718fb670d1261fbec9504662f1005b1a5 4 hours ago Running apiserver-watcher 0 7664f480df29d apiserver-watcher-ci-op-4sgxj8jx-8482f-hppxj-master-0 [root@ci-op-4sgxj8jx-8482f-hppxj-master-1 ~]# crictl ps | grep apiserver c64187e7adcc6 ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90 4 hours ago Running openshift-apiserver-check-endpoints 0 1a4a5b247c28a apiserver-7df5bb879-f6v5x ff98c52402288 8c9042f97c89d8c8519d6e6235bef5a5346f08e6d7d9864ef0f228b318b4c3de 4 hours ago Running openshift-apiserver 0 1a4a5b247c28a apiserver-7df5bb879-f6v5x 2f8a97f959409 faa1b95089d101cdc907d7affe310bbff5a9aa8f92c725dc6466afc37e731927 4 hours ago Running oauth-apiserver 0 ffa2c316a0cca apiserver-97fbc599c-2ftl7 72897e30e0df0 6a03f3f082f3719e79087d569b3cd1e718fb670d1261fbec9504662f1005b1a5 4 hours ago Running apiserver-watcher 0 3b6c3849ce91f apiserver-watcher-ci-op-4sgxj8jx-8482f-hppxj-master-1 [root@ci-op-4sgxj8jx-8482f-hppxj-master-2 ~]# crictl ps | grep apiserver 04c426f07573d faa1b95089d101cdc907d7affe310bbff5a9aa8f92c725dc6466afc37e731927 4 hours ago Running oauth-apiserver 0 2172a64fb1a38 apiserver-654dcb4cc6-tq8fj 4dcca5c0e9b99 6a03f3f082f3719e79087d569b3cd1e718fb670d1261fbec9504662f1005b1a5 4 hours ago Running apiserver-watcher 0 1cd99ec327199 apiserver-watcher-ci-op-4sgxj8jx-8482f-hppxj-master-2 And found below error in kubelet log, Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: E0313 06:10:15.004656 23961 kuberuntime_manager.go:1262] container &Container{Name:kube-apiserver,Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:789f242b8bc721b697e265c6f9d025f45e56e990bfd32e331c633fe0b9f076bc,Command:[/bin/bash -ec],Args:[LOCK=/var/log/kube-apiserver/.lock Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: # We should be able to acquire the lock immediatelly. If not, it means the init container has not released it yet and kubelet or CRI-O started container prematurely. Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: exec {LOCK_FD}>${LOCK} && flock --verbose -w 30 "${LOCK_FD}" || { Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: echo "Failed to acquire lock for kube-apiserver. Please check setup container for details. This is likely kubelet or CRI-O bug." Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: exit 1 Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: } Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: if [ -f /etc/kubernetes/static-pod-certs/configmaps/trusted-ca-bundle/ca-bundle.crt ]; then Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: echo "Copying system trust bundle ..." Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: cp -f /etc/kubernetes/static-pod-certs/configmaps/trusted-ca-bundle/ca-bundle.crt /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: fi Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: exec watch-termination --termination-touch-file=/var/log/kube-apiserver/.terminating --termination-log-file=/var/log/kube-apiserver/termination.log --graceful-termination-duration=135s --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig -- hyperkube kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --advertise-address=${HOST_IP} -v=2 --permit-address-sharing Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: ],WorkingDir:,Ports:[]ContainerPort{ContainerPort{Name:,HostPort:6443,ContainerPort:6443,Protocol:TCP,HostIP:,},},Env:[]EnvVar{EnvVar{Name:POD_NAME,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.name,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:POD_NAMESPACE,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:STATIC_POD_VERSION,Value:4,ValueFrom:nil,},EnvVar{Name:HOST_IP,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.hostIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:GOGC,Value:100,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{cpu: {{265 -3} {<nil>} 265m DecimalSI},memory: {{1073741824 0} {<nil>} 1Gi BinarySI},},Claims:[]ResourceClaim{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:resource-dir,ReadOnly:false,MountPath:/etc/kubernetes/static-pod-resources,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:cert-dir,ReadOnly:false,MountPath:/etc/kubernetes/static-pod-certs,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:audit-dir,ReadOnly:false,MountPath:/var/log/kube-apiserver,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:livez,Port:{0 6443 },Host:,Scheme:HTTPS,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:10,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,TerminationGracePeriodSeconds:nil,},ReadinessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:readyz,Port:{0 6443 },Host:,Scheme:HTTPS,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:10,PeriodSeconds:5,SuccessThreshold:1,FailureThreshold:1,TerminationGracePeriodSeconds:nil,},Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:nil,Privileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:FallbackToLogsOnError,VolumeDevices:[]VolumeDevice{},StartupProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:healthz,Port:{0 6443 },Host:,Scheme:HTTPS,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:10,PeriodSeconds:5,SuccessThreshold:1,FailureThreshold:30,TerminationGracePeriodSeconds:nil,},ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod kube-apiserver-ci-op-4sgxj8jx-8482f-hppxj-master-0_openshift-kube-apiserver(196e0956694ff43707b03f4585f3b6cd): CreateContainerConfigError: host IP unknown; known addresses: []
Version-Release number of selected component (if applicable):
4.16 latest nightly build
How reproducible:
frequently
Steps to Reproduce:
1. Install cluster on 4.16 nightly build 2. 3.
Actual results:
Installation failed.
Expected results:
Installation is successful.
Additional info:
Searched CI jobs, found many jobs failed with same error, most are on azure platform. https://search.dptools.openshift.org/?search=failed+to+initialize+the+cluster%3A+timed+out+waiting+for+the+condition&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Chinese translation in topology was invalid, see https://github.com/openshift/console/pull/13458
Description of problem:
When using an old version "oc" client to extract something newly introduced into the release extract, got an error about it doesn't support linux, which is a bit confusing. [root@gpei-test-rhel9 0423]# ./oc version Client Version: 4.15.10 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Kubernetes Version: v1.27.12+7bee54d [root@gpei-test-rhel9 0423]# ./oc adm release extract --registry-config ~/.docker/config --command=openshift-install-fips --to ./ registry.ci.openshift.org/ocp/release:4.16.0-0.ci-2024-04-23-065741 error: command "openshift-install-fips" does not support the operating system "linux" And for the oc client extracted from the same payload, it works well. [root@gpei-test-rhel9 fips]# ./oc version Client Version: 4.16.0-0.ci-2024-04-23-065741 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Kubernetes Version: v1.27.12+7bee54d [root@gpei-test-rhel9 fips]# ./oc adm release extract --registry-config ~/.docker/config --command=openshift-install-fips --to ./ registry.ci.openshift.org/ocp/release:4.16.0-0.ci-2024-04-23-065741 [root@gpei-test-rhel9 fips]# ls oc openshift-install-fips It would be expected to get error prompt "command "openshift-install-fips" is not supported in current oc client" or something like this, but not saying it does not support the operating system "linux"
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Test case failure- OpenShift alerting rules [apigroup:image.openshift.io] should have description and summary annotations The obtained response seems to have unmarshalling errors. Failed to fetch alerting rules: unable to parse response invalid character 's' after object key
Expected output- The response should be proper and the unmarshalling should have worked
Openshift Version- 4.13 & 4.14
Cloud Provider/Platform- PowerVS
Prow Job Link/Must gather path- https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.13-ocp-e2e-ovn-ppc64le-powervs/1700992665824268288/artifacts/ocp-e2e-ovn-ppc64le-powervs/
release-4.16 of openshift/cloud-provider-openstack is missing some commits that were backported in upstream project into the release-1.29 branch.
We should import them in our downstream fork.
There is currently no way to specify AgentLabelSelector when creating a nodepool via the hypershift CLI
Please review the following PR: https://github.com/openshift/csi-node-driver-registrar/pull/57
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Component Readiness has found a potential regression in [sig-arch][Early] CRDs for openshift.io should have subresource.status [Suite:openshift/conformance/parallel].
Probability of significant regression: 98.48%
Sample (being evaluated) Release: 4.16
Start Time: 2024-03-21T00:00:00Z
End Time: 2024-03-27T23:59:59Z
Success Rate: 89.29%
Successes: 25
Failures: 3
Flakes: 0
Base (historical) Release: 4.15
Start Time: 2024-02-01T00:00:00Z
End Time: 2024-02-28T23:59:59Z
Success Rate: 99.28%
Successes: 138
Failures: 1
Flakes: 0
In examining these test failures we found what is actually a pretty random grouping of tests failing and likely as a group these are a fairly significant part of why component readiness is reporting so much red on metal right now.
In March the metal team modified some configuration such that a portion of metal jobs can now land in a couple new environments, one of them ibmcloud.
This linked test above helped find the pattern whereby we can open the spyglass chart in prow and see a clear pattern that we then found in many other failed metal jobs:
All of these line up within the same vertical space indicating the problem was at the same time, and the pod-logs section is as full as ever.
Derek Higgins has pulled ibmcloud out of rotation until they can attempt some SSD for etcd.
This bug is for introduction of a test that will make this symptom of etcd being very unhealthy visible as a test failure, both to communicate to engineers who look at the runs and help them understand this critical failure, and to help us locate runs affected because no single existing test can really do this today.
Azure and GCP jobs can normally log these etcd warnings 3-5k times in a CI run. These ibmcloud runs were showing 30-70k. A limit of 10k was chosen based on examining the data in bigquery, only 50 jobs have exceeded that this month, all metal and agent jobs.
Please review the following PR: https://github.com/openshift/monitoring-plugin/pull/103
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
nothing happens when user clicks on the 'Configure' button next to AlertmanagerReceiversNotConfigured alert
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-03-11-041450
How reproducible:
Always
Steps to Reproduce:
1. navigate to Home -> Overview, locate the AlertmanagerReceiversNotConfigured alert in 'Status' card 2. click the 'Configure' button next to AlertmanagerReceiversNotConfigured alert
Actual results:
nothing happens
Expected results:
user should be taken to alert manager configuration page /monitoring/alertmanagerconfig
Additional info:
This is a clone of issue OCPBUGS-35731. The following is the description of the original issue:
—
Description of problem:
A ServiceAccount is not deleted due to a race condition in the controller manager. When deleting the SA, this is logged in the controller manager:
2024-06-17T15:57:47.793991942Z I0617 15:57:47.793942 1 image_pull_secret_controller.go:233] "Internal registry pull secret auth data does not contain the correct number of entries" ns="test-qtreoisu" name="sink-eguqqiwm-dockercfg-vh8mw" expected=3 actual=0 2024-06-17T15:57:47.794120755Z I0617 15:57:47.794080 1 image_pull_secret_controller.go:163] "Refreshing image pull secret" ns="test-qtreoisu" name="sink-eguqqiwm-dockercfg-vh8mw" serviceaccount="sink-eguqqiwm"
As a result, the Secret is updated and the ServiceAccount owning the Secret is updated by the controller via server-side apply operation as can be seen in the managedFields:
{ "apiVersion":"v1", "imagePullSecrets":[ { "name":"default-dockercfg-vdck9" }, { "name":"kn-test-image-pull-secret" }, { "name":"sink-eguqqiwm-dockercfg-vh8mw" } ], "kind":"ServiceAccount", "metadata":{ "annotations":{ "openshift.io/internal-registry-pull-secret-ref":"sink-eguqqiwm-dockercfg-vh8mw" }, "creationTimestamp":"2024-06-17T15:57:47Z", "managedFields":[ { "apiVersion":"v1", "fieldsType":"FieldsV1", "fieldsV1":{ "f:imagePullSecrets":{ }, "f:metadata":{ "f:annotations":{ "f:openshift.io/internal-registry-pull-secret-ref":{ } } }, "f:secrets":{ "k:{\"name\":\"sink-eguqqiwm-dockercfg-vh8mw\"}":{ } } }, "manager":"openshift.io/image-registry-pull-secrets_service-account-controller", "operation":"Apply", "time":"2024-06-17T15:57:47Z" } ], "name":"sink-eguqqiwm", "namespace":"test-qtreoisu", "resourceVersion":"104739", "uid":"eaae8d0e-8714-4c2e-9d20-c0c1a221eecc" }, "secrets":[ { "name":"sink-eguqqiwm-dockercfg-vh8mw" } ] }"Events":{ "metadata":{ }, "items":null }
The ServiceAccount then hangs there and is NOT deleted.
We have seen this only on OCP 4.16 (not on older versions) but already several time, like for example in this CI run which also has must-gather logs that can be investigated.
Another run is here
The controller code is new in 4.16 and it seems to be a regression.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-14-130320
How reproducible:
It happens sometimes in our CI runs where we want to delete a ServiceAccount but it's hanging there. The test doesn't try to delete it again. It tries only once.
Steps to Reproduce:
The following reproducer works for me. Some service accounts keep handing there after running the script
#!/usr/bin/env bash kubectl create namespace test for i in `seq 100`; do ( kubectl create sa "my-sa-${i}" -n test kubectl wait --for=jsonpath="{.metadata.annotations.openshift\\.io/internal-registry-pull-secret-ref}" sa/my-sa-${i} kubectl delete sa/my-sa-${i} kubectl wait --for=delete sa/my-sa-${i} --timeout=60s )& done wait
Actual results:
ServiceAccount not deleted
Expected results:
ServiceAccount deleted
Additional info:
Please review the following PR: https://github.com/openshift/bond-cni/pull/62
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-33715. The following is the description of the original issue:
—
Description of problem:
console-operator is fetching the organization ID from OCM on every sync call, which is too often. We need to reduce the fetch period.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-api-provider-alibaba/pull/49
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
We deprecated the default field manager in CNO, but it is still used by default calls like Patch(). We need to update all calls to use explicit fieldManager, and add a test to verify deprecated managers are not used.
Since 4.14 https://github.com/openshift/cluster-network-operator/commit/db57a477b10f517bc4ae501d95cc7b8398a8755c#diff-33ef32bf6c23acb95f5902d7097b7a1d5128ca061167ec0716715b0b9eeaa5f6R31 (more specifically, since sigs.k8s.io/controller-runtime bump) we were exposed to this bug https://github.com/kubernetes-sigs/controller-runtime/pull/2435/commits/a6b9c0b672c77a79fff4d5bc03221af1e1fe21fa which made the default fieldManager to be "Go-http-client" instead of "cluster-network-operator".
It means, that "cluster-network-operator" deprecation doesn't really work, since the manager is called differently. Manager name, when unset, is coming from https://github.com/kubernetes/kubernetes/blob/b85c9bbf1ac911a2a2aed2d5c1f5eaf5956cc199/staging/src/k8s.io/client-go/rest/config.go#L498 and then is cropped https://github.com/openshift/cluster-network-operator/blob/5f18e4231f291bf5a01812974b0b4dff19c77f2c/vendor/k8s.io/apiserver/pkg/endpoints/handlers/create.go#L253-L260.
Identified changes needed (may be more):
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Check CNO logs for deprecated field manager logs 2. oc logs -l name=network-operator --tail=-1 -n openshift-network-operator|grep "Depreciated field manager" 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Unable to run disk to mirror in enclave environment.
Version:
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202406131906.p0.g7c0889f.assembly.stream.el9-7c0889f", GitCommit:"7c0889f4bd343ccaaba5f33b7b861db29b1e5e49", GitTreeState:"clean", BuildDate:"2024-06-13T22:07:44Z", GoVersion:"go1.21.9 (Red Hat 1.21.9-1.el9_4) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. Mirror to disk on internet facing machine 2. scp the tar to a disconnected machine, different folder 3. Disk to mirror
Actual results:
[ec2-user@ip-10-0-1-197 ~]$ oc-mirror --v2 -c isc_resources.yaml --from file:///home/ec2-user/entreprise-content/ docker://localhost:5000 2024/07/15 08:42:40 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/07/15 08:42:40 [INFO] : 👋 Hello, welcome to oc-mirror 2024/07/15 08:42:40 [INFO] : ⚙️ setting up the environment for you... 2024/07/15 08:42:40 [INFO] : 🔀 workflow mode: diskToMirror I0715 08:42:40.646736 40155 client.go:44] Usage of the UPDATE_URL_OVERRIDE environment variable is unsupported 2024/07/15 08:49:34 [INFO] : 🕵️ going to discover the necessary images... 2024/07/15 08:49:34 [INFO] : 🔍 collecting release images... 2024/07/15 08:49:34 [ERROR] : [ReleaseImageCollector] open /home/skhoury/demo/working-dir/hold-release/ocp-release/4.15.17-x86_64/release-manifests/image-references: no such file or directory 2024/07/15 08:49:34 [INFO] : 👋 Goodbye, thank you for using oc-mirror error closing log file registry.log: close /home/ec2-user/entreprise-content/working-dir/logs/registry.log: file already closed 2024/07/15 08:49:34 [ERROR] : [ReleaseImageCollector] open /home/skhoury/demo/working-dir/hold-release/ocp-release/4.15.17-x86_64/release-manifests/image-references: no such file or directory
Expected results:
Success the folder releaseImageCollector should have used is /home/ec2-user/entreprise-content/working-dir/hold-release/ocp-release/4.15.17-x86_64/release-manifests/image-references
Additional info:
This looks very much like a 'downstream a thing' process, but only making a modification to an existing one.
Currently, the operator-framework-olm monorepo generates a self-hosting catalog from operator-registry.Dockerfile. This image also contains cross-compiled opm binaries for windows and mac, and joins the payload as ose-operator-registry.
To separate concerns, this introduces a new operator-framework-cli image which will be based on scratch, not self-hosting in any way, and just a container to convey repeatably produced o-f CLIs. Right now, this will focus on opm for olm v0 only, but others can be added in future.
This fix contains the following changes coming from updated version of kubernetes up to v1.29.4:
Changelog:
v1.29.4: https://github.com/kubernetes/kubernetes/blob/release-1.29/CHANGELOG/CHANGELOG-1.29.md#changelog-since-v1293
Description of problem:
Disable create button on Network Attachment Definitions page for regular user. Regular user cannot create NAD but can list NAD, the button should be dislabled just like the "Create" button on MigrationPolicies page, and a tooltip should be added to the button.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Customer pentest shows that the Server header is returned by admin console when browsing
https://console-openshift-console$domain/locales/resource.json?lng=en&ns=plugin__odf-console
This could lead to information about CVE for a potential attacker.
Response header:
Server: nginx/1.20.1
The management interface for idrac-redfish in the redfish BMO module is wrongly set to "ipxe" causing the error
"Could not find the following interface in the 'ironic.hardware.interfaces.management' entrypoint: ipxe."
we need to set that to "idracRedfish"
This is a clone of issue OCPBUGS-38375. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-37782. The following is the description of the original issue:
—
Description of problem:
ci/prow/security is failing on google.golang.org/grpc/metadata
Version-Release number of selected component (if applicable):
4.15
How reproducible:
always
Steps to Reproduce:
1. run ci/pro/security job on 4.15 pr 2. 3.
Actual results:
Medium severity vulnerability found in google.golang.org/grpc/metadata
Expected results:
Additional info:
When metal3-plugin is enabled it adds Disks and NICs tabs to Nodes details page.
We would like to remove these tabs becase:
Description of problem:
Currently in oc-mirror v2 user have no way of determining if the error occurs during a mirror is an actual mirror or a flake. We need to provide a way for the user to determine such errors easily which will make the product/user experience better.
Version-Release number of selected component (if applicable):
[knarra@knarra-thinkpadx1carbon7th openshift-tests-private]$ oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202404221110.p0.g0e2235f.assembly.stream.el9-0e2235f", GitCommit:"0e2235f4a51ce0a2d51cfc87227b1c76bc7220ea", GitTreeState:"clean", BuildDate:"2024-04-22T16:05:56Z", GoVersion:"go1.21.9 (Red Hat 1.21.9-1.el9_4) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. Install latest oc-mirror v2 2. Run mirror2disk via command `oc-mirror -c <config.yaml> file://out --v2` 3. Now run disk2mirror via command `oc-mirror -c <config.yaml> --from file://out docker:<localhost:5000>/mirror
Actual results:
sometimes mirror fails with error 2024/04/24 13:15:38 [ERROR] : [Worker] err: copying image 3/3 from manifest list: reading blob sha256:418a8fe842682e4eadab6f16a6ac8d30550665a2510090aa9a29c607d5063e67: fetching blob: unauthorized: Access to the requested resource is not authorized
Expected results:
There should be a way for user to determine whether the error happened was an actual error or kind of a race. Above error is not intutive
Additional info:
Discussed here https://redhat-internal.slack.com/archives/C050P27C71S/p1714025295014549
Please review the following PR: https://github.com/openshift/must-gather/pull/406
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/azure-workload-identity/pull/10
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to create bootstrap: unsupported configuration: No emulator found for arch 'x86_64'
See TODOs in https://github.com/openshift/cluster-monitoring-operator/pull/2160/files
Remove the DeleteConfigMapByNamespaceAndName util (if not used by others )
Description of problem:
Should print out an error if single arch image specified with non-expected arch by filter-by-os
Version-Release number of selected component (if applicable):
oc version Client Version: 4.16.0-202403121314.p0.gc92b507.assembly.stream-c92b507
How reproducible:
Always
Steps to Reproduce:
1) Use `filter-by-os linux/amd64` for the image only with arch : arm64 `oc image info quay.io/openshift-release-dev/ocp-release:4.16.0-ec.4-aarch64 --filter-by-os linux/amd64 2) Use invalid `--filter-by-os linux/invalid` for the image `oc image info quay.io/openshift-release-dev/ocp-release:4.16.0-ec.4-aarch64 --filter-by-os linux/invalid`
Actual results:
1) Succeed with no error or warning oc image info quay.io/openshift-release-dev/ocp-release:4.16.0-ec.4-aarch64 --filter-by-os linux/amd64 Name: quay.io/openshift-release-dev/ocp-release:4.16.0-ec.4-aarch64 Digest: sha256:0c13de057d9f75c40999778bb924f654be1d0def980acbe8a00096e6bf18cc2a Media Type: application/vnd.docker.distribution.manifest.v2+json Created: 16d ago Image Size: 155.5MB in 5 layers Layers: 75.95MB sha256:f90c4920e095dc91c490dd9ed7920d18e0327ddedcf5e10d2887e80ccae94fd7 42.16MB sha256:a974fa00e888c491ab67f8d63456937bbaffbebb530db5ee2f9f5193fc5bb910 10.2MB sha256:c391a61f467f437cf6a0ba00c394aa4dbc107ecf56edd91a018de97ca4cd16bc 26.07MB sha256:0e78634759d2f9c988dbf5ee73a7ed9a5d3b4ec28dcad5dd9086544826bbde05 1.115MB sha256:277f2a9ba38386db697a1cbde875c1ec79988a632d006c6d697d0a79911d9955 OS: linux Arch: arm64 Entrypoint: /usr/bin/cluster-version-operator Environment: PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin container=oci GODEBUG=x509ignoreCN=0,madvdontneed=1 __doozer=merge BUILD_RELEASE=202403070215.p0.g6a76ba9.assembly.stream.el9 BUILD_VERSION=v4.16.0 OS_GIT_MAJOR=4 OS_GIT_MINOR=16 OS_GIT_PATCH=0 OS_GIT_TREE_STATE=clean OS_GIT_VERSION=4.16.0-202403070215.p0.g6a76ba9.assembly.stream.el9-6a76ba9 SOURCE_GIT_TREE_STATE=clean __doozer_group=openshift-4.16 __doozer_key=cluster-version-operator __doozer_version=v4.16.0 OS_GIT_COMMIT=6a76ba9 SOURCE_DATE_EPOCH=1709342193 SOURCE_GIT_COMMIT=6a76ba95ed441893e1bdf6616c47701c0464b7f4 SOURCE_GIT_TAG=v1.0.0-1176-g6a76ba95 SOURCE_GIT_URL=https://github.com/openshift/cluster-version-operator Labels: io.openshift.release=4.16.0-ec.4 io.openshift.release.base-image-digest=sha256:fa1b36be29e72ca5c180ce8cc599a1f0871fa5aacd3153ed4cefc84038cd439a 2) succeed with no error or warning: oc image info quay.io/openshift-release-dev/ocp-release:4.16.0-ec.4-aarch64 --filter-by-os linux/invalid Name: quay.io/openshift-release-dev/ocp-release:4.16.0-ec.4-aarch64 Digest: sha256:0c13de057d9f75c40999778bb924f654be1d0def980acbe8a00096e6bf18cc2a Media Type: application/vnd.docker.distribution.manifest.v2+json Created: 16d ago Image Size: 155.5MB in 5 layers Layers: 75.95MB sha256:f90c4920e095dc91c490dd9ed7920d18e0327ddedcf5e10d2887e80ccae94fd7 42.16MB sha256:a974fa00e888c491ab67f8d63456937bbaffbebb530db5ee2f9f5193fc5bb910 10.2MB sha256:c391a61f467f437cf6a0ba00c394aa4dbc107ecf56edd91a018de97ca4cd16bc 26.07MB sha256:0e78634759d2f9c988dbf5ee73a7ed9a5d3b4ec28dcad5dd9086544826bbde05 1.115MB sha256:277f2a9ba38386db697a1cbde875c1ec79988a632d006c6d697d0a79911d9955 OS: linux Arch: arm64 Entrypoint: /usr/bin/cluster-version-operator Environment: PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin container=oci GODEBUG=x509ignoreCN=0,madvdontneed=1 __doozer=merge BUILD_RELEASE=202403070215.p0.g6a76ba9.assembly.stream.el9 BUILD_VERSION=v4.16.0 OS_GIT_MAJOR=4 OS_GIT_MINOR=16 OS_GIT_PATCH=0 OS_GIT_TREE_STATE=clean OS_GIT_VERSION=4.16.0-202403070215.p0.g6a76ba9.assembly.stream.el9-6a76ba9 SOURCE_GIT_TREE_STATE=clean __doozer_group=openshift-4.16 __doozer_key=cluster-version-operator __doozer_version=v4.16.0 OS_GIT_COMMIT=6a76ba9 SOURCE_DATE_EPOCH=1709342193 SOURCE_GIT_COMMIT=6a76ba95ed441893e1bdf6616c47701c0464b7f4 SOURCE_GIT_TAG=v1.0.0-1176-g6a76ba95 SOURCE_GIT_URL=https://github.com/openshift/cluster-version-operator Labels: io.openshift.release=4.16.0-ec.4 io.openshift.release.base-image-digest=sha256:fa1b36be29e72ca5c180ce8cc599a1f0871fa5aacd3153ed4cefc84038cd439a [root@localhost Doc]# echo $? 0
Expected results:
1) If the image is not a manifest list , we’d better to print out an error as these is nothing to filter Or have a warning this is not at manifest-list image; 2) Better to print out with error for the invalid arch.
Additional info:
ZTP manifests generated by the openshift-install agent create cluster-manifests command do not contain the correct Group/Version/Kind type metadata.
This means that they could not be applied to an OpenShift cluster to use with ZTP as intended.
Reproducer:
1. On a GCP cluster, create an ingress controller with internal load balancer scope, like this:
apiVersion: operator.openshift.io/v1 kind: IngressController metadata: name: foo namespace: openshift-ingress-operator spec: domain: foo.<cluster-domain> endpointPublishingStrategy: type: LoadBalancerService loadBalancer: dnsManagementPolicy: Managed scope: Internal
2. Wait for load balancer service to complete rollout
$ oc -n openshift-ingress get service router-foo NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-foo LoadBalancer 172.30.101.233 10.0.128.5 80:32019/TCP,443:32729/TCP 81s
3. Edit ingress controller to set spec.endpointPublishingStrategy.loadBalancer.scope to External
the load balancer service (router-foo in this case) should get an external IP address, but currently it keeps the 10.x.x.x address that was already assigned.
Please review the following PR: https://github.com/openshift/cluster-capi-operator/pull/153
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
In CNO, need to change cni-sysctl-allowlist daemonset to hostNetwork: true in case of multus ds + cni-sysctl-allowlist ds upgrade successfully.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Clone of https://issues.redhat.com/browse/OCPBUGS-32141 for 4.16
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Description of problem:
VIP's are on a different network than the machine network on a 4.14 cluster
failing cluster: 4:14
Infrastructure
--------------
Platform: VSphere
Install Type: IPI
apiServerInternalIP: 10.8.0.83
apiServerInternalIPs: 10.8.0.83
ingressIP: 10.8.0.84
ingressIPs: 10.8.0.84
All internal IP addresses of all nodes match the Machine Network.
Machine Network: 10.8.42.0/23
Node name IP Address Matches CIDR
..............................................................................................................
sv1-prd-ocp-int-bn8ln-master-0 10.8.42.24 YES
sv1-prd-ocp-int-bn8ln-master-1 10.8.42.35 YES
sv1-prd-ocp-int-bn8ln-master-2 10.8.42.36 YES
sv1-prd-ocp-int-bn8ln-worker-0-5rbwr 10.8.42.32 YES
sv1-prd-ocp-int-bn8ln-worker-0-h7fq7 10.8.42.49 YES
logs from one of the haproxy pods
oc logs -n openshift-vsphere-infra haproxy-sv1-prd-ocp-int-bn8ln-master-0 haproxy-monitor
.....
2024-04-02T18:48:57.534824711Z time="2024-04-02T18:48:57Z" level=info msg="An error occurred while trying to read master nodes details from api-vip:kube-apiserver: failed find a interface for the ip 10.8.0.83"
2024-04-02T18:48:57.534849744Z time="2024-04-02T18:48:57Z" level=info msg="Trying to read master nodes details from localhost:kube-apiserver"
2024-04-02T18:48:57.544507441Z time="2024-04-02T18:48:57Z" level=error msg="Could not retrieve subnet for IP 10.8.0.83" err="failed find a interface for the ip 10.8.0.83"
2024-04-02T18:48:57.544507441Z time="2024-04-02T18:48:57Z" level=error msg="Failed to retrieve API members information" kubeconfigPath=/var/lib/kubelet/kubeconfig
2024-04-02T18:48:57.544507441Z time="2024-04-02T18:48:57Z" level=info msg="GetLBConfig failed, sleep half of interval and retry" kubeconfigPath=/var/lib/kubelet/kubeconfig
2024-04-02T18:49:00.572652095Z time="2024-04-02T18:49:00Z" level=error msg="Could not retrieve subnet for IP 10.8.0.83" err="failed find a interface for the ip 10.8.0.83"
There is a kcs that addresses this:
https://access.redhat.com/solutions/7037425
Howerver, this same configuration works in production on 4.12
working cluster:
Infrastructure
--------------
Platform: VSphere
Install Type: IPI
apiServerInternalIP: 10.8.0.73
apiServerInternalIPs: 10.8.0.73
ingressIP: 10.8.0.72
ingressIPs: 10.8.0.72
All internal IP addresses of all nodes match the Machine Network.
Machine Network: 10.8.38.0/23
Node name IP Address Matches CIDR
..............................................................................................................
sb1-prd-ocp-int-qls2m-cp4d-4875s 10.8.38.29 YES
sb1-prd-ocp-int-qls2m-cp4d-phczw 10.8.38.19 YES
sb1-prd-ocp-int-qls2m-cp4d-ql5sj 10.8.38.43 YES
sb1-prd-ocp-int-qls2m-cp4d-svzl7 10.8.38.27 YES
sb1-prd-ocp-int-qls2m-cp4d-x286s 10.8.38.18 YES
sb1-prd-ocp-int-qls2m-cp4d-xk48m 10.8.38.40 YES
sb1-prd-ocp-int-qls2m-master-0 10.8.38.25 YES
sb1-prd-ocp-int-qls2m-master-1 10.8.38.24 YES
sb1-prd-ocp-int-qls2m-master-2 10.8.38.30 YES
sb1-prd-ocp-int-qls2m-worker-njzdx 10.8.38.15 YES
sb1-prd-ocp-int-qls2m-worker-rhqn5 10.8.38.39 YES
logs from one of the haproxy pods
2023-08-18T21:12:19.730010034Z time="2023-08-18T21:12:19Z" level=info msg="API is not reachable through HAProxy"
2023-08-18T21:12:19.755357706Z time="2023-08-18T21:12:19Z" level=info msg="Config change detected" configChangeCtr=1 curConfig="{6443 9445 29445 [
{sb1-prd-ocp-int-qls2m-master-0 10.8.38.25 6443} {sb1-prd-ocp-int-qls2m-master-2 10.8.38.30 6443}] }"
{sb1-prd-ocp-int-qls2m-master-2 10.8.38.30 6443}
] }"
The data is being redirected
found this in the sos report: sos_commands/firewall_tables/
nft_-a_list_ruleset
table ip nat { # handle 2
chain PREROUTING
chain INPUT
{ # handle 2 type nat hook input priority 100; policy accept; }chain POSTROUTING
{ # handle 3 type nat hook postrouting priority srcnat; policy accept; counter packets 245475292 bytes 16221809463 jump OVN-KUBE-EGRESS-SVC # handle 25 oifname "ovn-k8s-mp0" counter packets 58115015 bytes 4184247096 jump OVN-KUBE-SNAT-MGMTPORT # handle 16 counter packets 187360548 bytes 12037581317 jump KUBE-POSTROUTING # handle 10 }chain OUTPUT
{ # handle 4 type nat hook output priority -100; policy accept; oifname "lo" meta l4proto tcp ip daddr 10.8.0.73 tcp dport 6443 counter packets 0 bytes 0 redirect to :9445 # handle 67 counter packets 245122162 bytes 16200621351 jump OVN-KUBE-EXTERNALIP # handle 29 counter packets 245122163 bytes 16200621411 jump OVN-KUBE-NODEPORT # handle 27 counter packets 245122166 bytes 16200621591 jump OVN-KUBE-ITP # handle 24 }... many more lines ...
This code was not added by the customer
None of the redirect statements are in the same file for 4.14 (the failing cluster)
ocp 4.14: (if applicable):{code:none}
How reproducible:100%
Steps to Reproduce:{code:none} This is the install script that our ansible job uses to install 4.12 If you need it cleared up let me know, all the items in {{}} are just variables for file paths cp -r {{ item.0.cluster_name }}/install-config.yaml {{ openshift_base }}{{ item.0.cluster_name }}/ ./openshift-install create manifests --dir {{ openshift_base }}{{ item.0.cluster_name }}/ cp -r machineconfigs/* {{ openshift_base }}{{ item.0.cluster_name }}/openshift/ cp -r {{ item.0.cluster_name }}/customizations/* {{ openshift_base }}{{ item.0.cluster_name }}/openshift/ ./openshift-install create ignition-configs --dir {{ openshift_base }}{{ item.0.cluster_name }}/ ./openshift-install create cluster --dir {{ openshift_base }}{{ item.0.cluster_name }} --log-level=debug We are installing IPI on vmware API and Ingress VIPs are configured on our external load balancer appliance. (Citrix ADCs if that matters)
Actual results:
haproxy pods crashloop and do not work In 4.14 following the same install workflow neither the API or Ingress IP binds to masters or workers and we see HAPROXY crashlooping
Expected results:
for 4.12 Following a completion of 4.12 if we look in vmware at our master and worker nodes we will see all of them have an IP address from the machine network assigned to them, and one node from both masters and workers will have the VIP bound to them as well.
Additional info:
Description of problem:
In a dualstack cluster, we use IPv4 URLs for callbacks to Ironic and Inspector. If the host only has IPv6 networking, the provisioning will fail. This issue affects both the normal IPI and ZTP with the converged flow.
A similar issue has been fixed as part of METAL-163 where we use the BMC's address family to determine which URL to send to it. This bug is somewhat simpler: we can provide IPA with several URLs and let it decide which one it can use. This way, only small changes to IPA itself, ICC and CBO are required.
The fix will only affect virtual media deployments without provisioning network or with virtualMediaViaExternalNetwork:true. We don't have a good dualstack story around provisioning networks anyway.
Upstream IPA request: https://bugs.launchpad.net/ironic/+bug/2045548
This is a clone of issue OCPBUGS-41908. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-39246. The following is the description of the original issue:
—
Description of problem:
Alerts with non-standard severity labels are sent to Telemeter.
Version-Release number of selected component (if applicable):
All supported versions
How reproducible:
Always
Steps to Reproduce:
1. Create an always firing alerting rule with severity=foo. 2. Make sure that telemetry is enabled for the cluster. 3.
Actual results:
The alert can be seen on the telemeter server side.
Expected results:
The alert is dropped by the telemeter allow-list.
Additional info:
Red Hat operators should use standard severities: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide Looking at the current data, it looks like ~2% of the alerts reported to Telemter have an invalid severity.
Description of problem:
console-config sets telemeterClientDisabled: true even telemeter client is NOT disabled
Version-Release number of selected component (if applicable):
a cluster launched by image built with cluster-bot: build 4.16-ci,openshift/console#13677,openshift/console-operator#877
How reproducible:
Always
Steps to Reproduce:
1. Check if telemeter client is enabled $ oc -n openshift-monitoring get pod | grep telemeter-clienttelemeter-client-7cc8bf56db-7wcs5 3/3 Running 0 83m $ oc get cm cluster-monitoring-config -n openshift-monitoring Error from server (NotFound): configmaps "cluster-monitoring-config" not found 2. Check console-config settings $ oc get cm console-config -n openshift-console -o yaml apiVersion: v1 data: console-config.yaml: | apiVersion: console.openshift.io/v1 auth: authType: openshift clientID: console clientSecretFile: /var/oauth-config/clientSecret oauthEndpointCAFile: /var/oauth-serving-cert/ca-bundle.crt clusterInfo: consoleBaseAddress: https://xxxxx controlPlaneTopology: HighlyAvailable masterPublicURL: https://xxxxx:6443 nodeArchitectures: - amd64 nodeOperatingSystems: - linux releaseVersion: 4.16.0-0.test-2024-03-18-024238-ci-ln-0q7bq2t-latest customization: branding: ocp documentationBaseURL: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.16/ kind: ConsoleConfig monitoringInfo: alertmanagerTenancyHost: alertmanager-main.openshift-monitoring.svc:9092 alertmanagerUserWorkloadHost: alertmanager-main.openshift-monitoring.svc:9094 plugins: monitoring-plugin: https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/ providers: {} servingInfo: bindAddress: https://[::]:8443 certFile: /var/serving-cert/tls.crt keyFile: /var/serving-cert/tls.key session: {} telemetry: telemeterClientDisabled: "true" kind: ConfigMap metadata: creationTimestamp: "2024-03-19T01:20:23Z" labels: app: console name: console-config namespace: openshift-console resourceVersion: "27723" uid: 2f9282c3-1c4a-4400-9908-4e70025afc33
Actual results:
in cm/console-config, telemeterClientDisabled is set with 'true'
Expected results:
telemeterClientDisabled property should reveal the real status of telemeter client telemeter client is not disabled because 1. telemeter client pod is running 2. user didn't disable telemeter client manually because 'cluster-monitoring-config' configmap doesn't exist
Additional info:
This is a clone of issue OCPBUGS-42200. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-41920. The following is the description of the original issue:
—
Description of problem:
When we move one node from one custom MCP to another custom MCP, the MCPs are reporting a wrong number of nodes. For example, we reach this situation (worker-perf MCP is not reporting the right number of nodes) $ oc get mcp,nodes NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE machineconfigpool.machineconfiguration.openshift.io/master rendered-master-c8d23b071e1ccf6cf85c7f1b31c0def6 True False False 3 3 3 0 142m machineconfigpool.machineconfiguration.openshift.io/worker rendered-worker-36ee1fdc485685ac9c324769889c3348 True False False 1 1 1 0 142m machineconfigpool.machineconfiguration.openshift.io/worker-perf rendered-worker-perf-6b5fbffac62c3d437e307e849c44b556 True False False 2 2 2 0 24m machineconfigpool.machineconfiguration.openshift.io/worker-perf-canary rendered-worker-perf-canary-6b5fbffac62c3d437e307e849c44b556 True False False 1 1 1 0 7m52s NAME STATUS ROLES AGE VERSION node/ip-10-0-13-228.us-east-2.compute.internal Ready worker,worker-perf-canary 138m v1.30.4 node/ip-10-0-2-250.us-east-2.compute.internal Ready control-plane,master 145m v1.30.4 node/ip-10-0-34-223.us-east-2.compute.internal Ready control-plane,master 144m v1.30.4 node/ip-10-0-35-61.us-east-2.compute.internal Ready worker,worker-perf 136m v1.30.4 node/ip-10-0-79-232.us-east-2.compute.internal Ready control-plane,master 144m v1.30.4 node/ip-10-0-86-124.us-east-2.compute.internal Ready worker 139m v1.30.4 After 20 minutes or half an hour the MCPs start reporting the right number of nodes
Version-Release number of selected component (if applicable):
IPI on AWS version:
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.17.0-0.nightly-2024-09-13-040101 True False 124m Cluster version is 4.17.0-0.nightly-2024-09-13-040101
How reproducible:
Always
Steps to Reproduce:
1. Create a MCP oc create -f - << EOF apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: worker-perf spec: machineConfigSelector: matchExpressions: - { key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-perf] } nodeSelector: matchLabels: node-role.kubernetes.io/worker-perf: "" EOF 2. Add 2 nodes to the MCP $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf= $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[1].metadata.name}") node-role.kubernetes.io/worker-perf= 3. Create another MCP oc create -f - << EOF apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: worker-perf-canary spec: machineConfigSelector: matchExpressions: - { key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-perf,worker-perf-canary] } nodeSelector: matchLabels: node-role.kubernetes.io/worker-perf-canary: "" EOF 3. Move one node from the MCP created in step 1 to the MCP created in step 3 $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf-canary= $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf-
Actual results:
The worker-perf pool is not reporting the right number of nodes. It continues reporting 2 nodes even though one of them was moved to the worker-perf-canary MCP. $ oc get mcp,nodes NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE machineconfigpool.machineconfiguration.openshift.io/master rendered-master-c8d23b071e1ccf6cf85c7f1b31c0def6 True False False 3 3 3 0 142m machineconfigpool.machineconfiguration.openshift.io/worker rendered-worker-36ee1fdc485685ac9c324769889c3348 True False False 1 1 1 0 142m machineconfigpool.machineconfiguration.openshift.io/worker-perf rendered-worker-perf-6b5fbffac62c3d437e307e849c44b556 True False False 2 2 2 0 24m machineconfigpool.machineconfiguration.openshift.io/worker-perf-canary rendered-worker-perf-canary-6b5fbffac62c3d437e307e849c44b556 True False False 1 1 1 0 7m52s NAME STATUS ROLES AGE VERSION node/ip-10-0-13-228.us-east-2.compute.internal Ready worker,worker-perf-canary 138m v1.30.4 node/ip-10-0-2-250.us-east-2.compute.internal Ready control-plane,master 145m v1.30.4 node/ip-10-0-34-223.us-east-2.compute.internal Ready control-plane,master 144m v1.30.4 node/ip-10-0-35-61.us-east-2.compute.internal Ready worker,worker-perf 136m v1.30.4 node/ip-10-0-79-232.us-east-2.compute.internal Ready control-plane,master 144m v1.30.4 node/ip-10-0-86-124.us-east-2.compute.internal Ready worker 139m v1.30.4
Expected results:
MCPs should always report the right number of nodes
Additional info:
It is very similar to this other issue https://bugzilla.redhat.com/show_bug.cgi?id=2090436 That was discussed in this slack conversation https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1653479831004619
This is a clone of issue OCPBUGS-38941. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-38925. The following is the description of the original issue:
—
Description of problem:
periodics are failing due to a change in coreos.
Version-Release number of selected component (if applicable):
4.15,4.16,4.17,4.18
How reproducible:
100%
Steps to Reproduce:
1. Check any periodic conformance jobs 2. 3.
Actual results:
periodic conformance fails with hostedcluster creation
Expected results:
periodic conformance test suceeds
Additional info:
This is a clone of issue OCPBUGS-34354. The following is the description of the original issue:
—
Description of problem:
Update the PowerVS CAPI provider to v0.8.0
Description of problem:
Bootstrap process fails. When attempting to gather logs, the process fails. The SSH connection was refused.
Version-Release number of selected component (if applicable):
How reproducible:
Alsways when failing bootstrap process
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-44047. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-42660. The following is the description of the original issue:
—
There were remaining issues from the original issue. A new bug has been opened to address this. This is a clone of issue OCPBUGS-32947. The following is the description of the original issue:
—
Description of problem:
[vSphere] network.devices, template and workspace will be cleared when deleting the controlplanemachineset, updating these fields will not trigger an update
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-04-23-032717
How reproducible:
Always
Steps to Reproduce:
1.Install a vSphere 4.16 cluster, we use automated template: ipi-on-vsphere/versioned-installer liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-04-23-032717 True False 24m Cluster version is 4.16.0-0.nightly-2024-04-23-032717 2.Check the controlplanemachineset, you can see network.devices, template and workspace have value. liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE cluster 3 3 3 3 Active 51m liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset cluster -oyaml apiVersion: machine.openshift.io/v1 kind: ControlPlaneMachineSet metadata: creationTimestamp: "2024-04-25T02:52:11Z" finalizers: - controlplanemachineset.machine.openshift.io generation: 1 labels: machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl name: cluster namespace: openshift-machine-api resourceVersion: "18273" uid: f340d9b4-cf57-4122-b4d4-0f45f20e4d79 spec: replicas: 3 selector: matchLabels: machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master state: Active strategy: type: RollingUpdate template: machineType: machines_v1beta1_machine_openshift_io machines_v1beta1_machine_openshift_io: failureDomains: platform: VSphere vsphere: - name: generated-failure-domain metadata: labels: machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master spec: lifecycleHooks: {} metadata: {} providerSpec: value: apiVersion: machine.openshift.io/v1beta1 credentialsSecret: name: vsphere-cloud-credentials diskGiB: 120 kind: VSphereMachineProviderSpec memoryMiB: 16384 metadata: creationTimestamp: null network: devices: - networkName: devqe-segment-221 numCPUs: 4 numCoresPerSocket: 4 snapshot: "" template: huliu-vs425c-f5tfl-rhcos-generated-region-generated-zone userDataSecret: name: master-user-data workspace: datacenter: DEVQEdatacenter datastore: /DEVQEdatacenter/datastore/vsanDatastore folder: /DEVQEdatacenter/vm/huliu-vs425c-f5tfl resourcePool: /DEVQEdatacenter/host/DEVQEcluster/Resources server: vcenter.devqe.ibmc.devcluster.openshift.com status: conditions: - lastTransitionTime: "2024-04-25T02:59:37Z" message: "" observedGeneration: 1 reason: AsExpected status: "False" type: Error - lastTransitionTime: "2024-04-25T03:03:45Z" message: "" observedGeneration: 1 reason: AllReplicasAvailable status: "True" type: Available - lastTransitionTime: "2024-04-25T03:03:45Z" message: "" observedGeneration: 1 reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2024-04-25T03:01:04Z" message: "" observedGeneration: 1 reason: AllReplicasUpdated status: "False" type: Progressing observedGeneration: 1 readyReplicas: 3 replicas: 3 updatedReplicas: 3 3.Delete the controlplanemachineset, it will recreate a new one, but those three fields that had values before are now cleared. liuhuali@Lius-MacBook-Pro huali-test % oc delete controlplanemachineset cluster controlplanemachineset.machine.openshift.io "cluster" deleted liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE cluster 3 3 3 3 Inactive 6s liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset cluster -oyaml apiVersion: machine.openshift.io/v1 kind: ControlPlaneMachineSet metadata: creationTimestamp: "2024-04-25T03:45:51Z" finalizers: - controlplanemachineset.machine.openshift.io generation: 1 name: cluster namespace: openshift-machine-api resourceVersion: "46172" uid: 45d966c9-ec95-42e1-b8b0-c4945ea58566 spec: replicas: 3 selector: matchLabels: machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master state: Inactive strategy: type: RollingUpdate template: machineType: machines_v1beta1_machine_openshift_io machines_v1beta1_machine_openshift_io: failureDomains: platform: VSphere vsphere: - name: generated-failure-domain metadata: labels: machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master spec: lifecycleHooks: {} metadata: {} providerSpec: value: apiVersion: machine.openshift.io/v1beta1 credentialsSecret: name: vsphere-cloud-credentials diskGiB: 120 kind: VSphereMachineProviderSpec memoryMiB: 16384 metadata: creationTimestamp: null network: devices: null numCPUs: 4 numCoresPerSocket: 4 snapshot: "" template: "" userDataSecret: name: master-user-data workspace: {} status: conditions: - lastTransitionTime: "2024-04-25T03:45:51Z" message: "" observedGeneration: 1 reason: AsExpected status: "False" type: Error - lastTransitionTime: "2024-04-25T03:45:51Z" message: "" observedGeneration: 1 reason: AllReplicasAvailable status: "True" type: Available - lastTransitionTime: "2024-04-25T03:45:51Z" message: "" observedGeneration: 1 reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2024-04-25T03:45:51Z" message: "" observedGeneration: 1 reason: AllReplicasUpdated status: "False" type: Progressing observedGeneration: 1 readyReplicas: 3 replicas: 3 updatedReplicas: 3 4.I active the controlplanemachineset and it does not trigger an update, I continue to add these field values back and it does not trigger an update, I continue to edit these fields to add a second network device and it still does not trigger an update. network: devices: - networkName: devqe-segment-221 - networkName: devqe-segment-222 By the way, I can create worker machines with other network device or two network devices. huliu-vs425c-f5tfl-worker-0a-ldbkh Running 81m huliu-vs425c-f5tfl-worker-0aa-r8q4d Running 70m
Actual results:
network.devices, template and workspace will be cleared when deleting the controlplanemachineset, updating these fields will not trigger an update
Expected results:
The fields value should not be changed when deleting the controlplanemachineset, Updating these fields should trigger an update, or if these fields should not be modified, then it should not take effect when modifying the controlplanemachineset, as such an inconsistency seems confusing.
Additional info:
Must gather: https://drive.google.com/file/d/1mHR31m8gaNohVMSFqYovkkY__t8-E30s/view?usp=sharing
Sometimes user manifests could be the source of problems, and right now they're not included in the logs archive downloaded for an Assisted cluster from the UI.
Currently we can only see the feature usage "Custom manifest" in metadata.json but that only tells us the user has custom manifests, not manifests and what is their value
CNO managed component (network-node-identity) to conform to hypershift control plane expectations that All secrets should be mounted to not have global read. change from 420(0644) to 416(0640)
revert: https://github.com/openshift/origin/pull/28603
output looks like:
INFO[2024-02-16T00:25:04Z] time="2024-02-16T00:20:41Z" level=error msg="disruption sample failed: error running request: 403 Forbidden: http://www.w3.org/TR/html4/strict.dtd\">\n\n\n\n\n\n\n \n ERROR \n The requested URL could not be retrieved \n \n \n\n \n The following error was encountered while trying to retrieve the URL: http://35.212.33.188/health\">http://35.212.33.188/health \n\n \n Access Denied. \n \n\n Access control configuration prevents your request from being allowed at this time. Please contact your service provider if you feel this is incorrect. \n\n Your cache administrator is root. \n \n \n\n \n \n Generated Fri, 16 Feb 2024 00:20:41 GMT by ofcir-3329d9226457452fb2040e269776e3a5 (squid/5.2) \n\n \n\n" auditID=facdfd31-51e5-4812-a356-6e4b0e30cd38 backend=gcp-network-liveness-reused-connections this-instance="{Disruption map[backend-disruption-name:gcp-network-liveness-reused-connections connection:reused disruption:openshift-tests]}" type=reused time="2024-02-16T00:20:41Z" level=error msg="disruption sample failed: error running request: 403 Forbidden: http://www.w3.org/TR/html4/strict.dtd\">\n\n\n\n\n\n\n \n ERROR
Description of problem:
gatewayConfig.ipForwarding allows any invalid value but it should enforce "", "Restricted" or "Global"
You can currently even do really funky stuff with that:
oc edit network.operator/cluster (...) 15 spec: 16 clusterNetwork: 17 - cidr: 10.128.0.0/14 18 hostPrefix: 23 19 - cidr: fd01::/48 20 hostPrefix: 64 21 defaultNetwork: 22 ovnKubernetesConfig: 23 egressIPConfig: {} 24 gatewayConfig: 25 ipForwarding: $(echo 'Im injected'; lscpu)
$ oc get pods -n openshift-ovn-kubernetes ovnkube-node-24628 -o yaml | grep sysctl -C5 fi # If IP Forwarding mode is global set it in the host here. ip_forwarding_flag= if [ "$(echo 'Im injected'; lscpu)" == "Global" ]; then sysctl -w net.ipv4.ip_forward=1 sysctl -w net.ipv6.conf.all.forwarding=1 else ip_forwarding_flag="--disable-forwarding" fi NETWORK_NODE_IDENTITY_ENABLE=
$ oc logs -n openshift-ovn-kubernetes ovnkube-node-24628 -c ovnkube-controller | grep inje -A5 ++ echo 'Im injected' ++ lscpu + '[' 'Im injected Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 112
I wouldn't consider this a security issue, because I have to be the admin to do that, and as the admin I can also simply modify the pod, but it's not very elegant to allow for some sort of code injection, even by the admin
Description of problem:
[csi-snapshot-controller-operator] does not create suitable role and roleBinding for csi-snapshot-webhook
Version-Release number of selected component (if applicable):
$ oc version Client Version: 4.14.0-rc.0 Kustomize Version: v5.0.1 Server Version: 4.14.0-0.nightly-2024-03-28-004801 Kubernetes Version: v1.27.11+749fe1d
How reproducible:
Always
Steps to Reproduce:
1. Create an OpenShift cluster on AWS; 2. Check the csi-snapshot-webhook logs with no errors.
Actual results:
In step 2: $ oc logs csi-snapshot-webhook-76bf9bd758-cxr7g I0328 08:02:58.016020 1 certwatcher.go:129] Updated current TLS certificate W0328 08:02:58.029464 1 reflector.go:424] github.com/kubernetes-csi/external-snapshotter/client/v6/informers/externalversions/factory.go:117: failed to list *v1.VolumeSnapshotClass: volumesnapshotclasses.snapshot.storage.k8s.io is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:default" cannot list resource "volumesnapshotclasses" in API group "snapshot.storage.k8s.io" at the cluster scope E0328 08:02:58.029512 1 reflector.go:140] github.com/kubernetes-csi/external-snapshotter/client/v6/informers/externalversions/factory.go:117: Failed to watch *v1.VolumeSnapshotClass: failed to list *v1.VolumeSnapshotClass: volumesnapshotclasses.snapshot.storage.k8s.io is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:default" cannot list resource "volumesnapshotclasses" in API group "snapshot.storage.k8s.io" at the cluster scope W0328 08:02:58.888397 1 reflector.go:424] github.com/kubernetes-csi/external-snapshotter/client/v6/informers/externalversions/factory.go:117: failed to list *v1.VolumeSnapshotClass: volumesnapshotclasses.snapshot.storage.k8s.io is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:default" cannot list resource "volumesnapshotclasses" in API group "snapshot.storage.k8s.io" at the cluster scope
Expected results:
In step2 the csi-snapshot-webhook logs should have no cannot list resource errors
Additional info:
The issue exist on 4.15 and 4.16 as well, in addition since 4.15+ the webhook needs additional "VolumeGroupSnapshotClass" list permissions $ oc logs csi-snapshot-webhook-794b7b54d7-b8vl9 ... E0328 12:12:06.509158 1 reflector.go:147] github.com/kubernetes-csi/external-snapshotter/client/v6/informers/externalversions/factory.go:133: Failed to watch *v1alpha1.VolumeGroupSnapshotClass: failed to list *v1alpha1.VolumeGroupSnapshotClass: volumegroupsnapshotclasses.groupsnapshot.storage.k8s.io is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:default" cannot list resource "volumegroupsnapshotclasses" in API group "groupsnapshot.storage.k8s.io" at the cluster scope W0328 12:12:50.836582 1 reflector.go:535] github.com/kubernetes-csi/external-snapshotter/client/v6/informers/externalversions/factory.go:133: failed to list *v1alpha1.VolumeGroupSnapshotClass: volumegroupsnapshotclasses.groupsnapshot.storage.k8s.io is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:default" cannot list resource "volumegroupsnapshotclasses" in API group "groupsnapshot.storage.k8s.io" at the cluster scope ...
Component Readiness has found a potential regression in [sig-arch] events should not repeat pathologically for ns/openshift-monitoring.
Probability of significant regression: 100.00%
Sample (being evaluated) Release: 4.15
Start Time: 2024-01-04T00:00:00Z
End Time: 2024-01-10T23:59:59Z
Success Rate: 42.31%
Successes: 11
Failures: 15
Flakes: 0
Base (historical) Release: 4.14
Start Time: 2023-10-04T00:00:00Z
End Time: 2023-10-31T23:59:59Z
Success Rate: 100.00%
Successes: 151
Failures: 0
Flakes: 0
This is a clone of issue OCPBUGS-34870. The following is the description of the original issue:
—
In OCPBUGS-30951, we modified a check used in the Cinder CSI Driver Operator to relax the requirements for enabling topology support. Unfortunately in doing this we introduced a bug: we now attempt to access the volume AZ for each compute AZ, which isn't valid if there are more compute AZs than volume AZS. This needs to be addressed.
This affects 4.14 through to master (unreleased 4.17).
Always.
1. Deploy OCP-on-OSP on a cluster with fewer storage AZs than compute AZs
Operator fails due to out-of-range error.
Operator should not fail.
None.
This is a clone of issue OCPBUGS-37718. The following is the description of the original issue:
—
Description of problem:
The cluster-api-provider-openstack branch used for e2e testing in cluster-capi-operator is not pinned to a branch. As such the go version used in the two projects goes out of sync causing the test to fail starting.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cloud-credential-operator/pull/689
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
RHTAP builds are failing because with the addition of request serving node schedueler [1] we use max() [2] function that required a golang version 1.21 however the containerfile [3] that is used for building the HO binary in rhtap is using 1.20 [1] https://issues.redhat.com/browse/HOSTEDCP-1478 [2] https://github.com/openshift/hypershift/pull/3776/files#diff-a7f22add63b0067c0a7c9813255519d1432821f431f6eea0c3373d0646d1a855R489 [3] https://github.com/openshift/hypershift/blob/main/Containerfile.operator#L1
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100%
Steps to Reproduce:
1.Run rhtap on main branch 2. 3.
Actual results:
Fail
Expected results:
Pass
Additional info:
This is a clone of issue OCPBUGS-35069. The following is the description of the original issue:
—
Description of problem:
Reviewing https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&arch=amd64&baseEndTime=2024-02-28%2023%3A59%3A59&baseRelease=4.15&baseStartTime=2024-02-01%2000%3A00%3A00&capability=operator-conditions&component=Cloud%20Compute%20%2F%20Other%20Provider&confidence=95&environment=ovn%20no-upgrade%20amd64%20azure%20standard&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&network=ovn&pity=5&platform=azure&platform=azure&sampleEndTime=2024-06-05%2023%3A59%3A59&sampleRelease=4.15&sampleStartTime=2024-05-30%2000%3A00%3A00&testId=Operator%20results%3A6d9ee55972f66121016367d07d52f0a9&testName=operator%20conditions%20control-plane-machine-set&upgrade=no-upgrade&upgrade=no-upgrade&variant=standard&variant=standard, it appears that the Azure tests are failing frequently with "Told to stop trying". Check failed before until passed. Reviewing this, it appears that the rollout happened as expected, but the until function got a non-retryable error and exited, while the check saw that the Deletion timestamp was set and the Machine went into Running, which caused it to fail. We should investigate why the until failed in this case as it should have seen the same machines and therefore should have seen a Running machine and passed.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/sdn/pull/596
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Before: Warning FailedCreatePodSandBox 8s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "82187d55b1379aad1e6c02b3394df7a8a0c84cc90902af413c1e0d9d56ddafb0": plugin type="multus" name="multus-cni-network" failed (add): [default/netshoot-deployment-59898b5dd9-hhvfn/89e6349b-9797-4e03-8828-ebafe224dfaf:whereaboutsexample]: error adding container to network "whereaboutsexample": error at storage engine: Could not allocate IP in range: ip: 2000::1 / - 2000::ffff:ffff:ffff:fffe / range: net.IPNet{IP:net.IP{0x20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, Mask:net.IPMask{0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}} After: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 6s default-scheduler Successfully assigned default/netshoot-deployment-59898b5dd9-kk2zm to whereabouts-worker Normal AddedInterface 6s multus Add eth0 [10.244.2.2/24] from kindnet Warning FailedCreatePodSandBox 6s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "23dd45e714db09380150b5df74be37801bf3caf73a5262329427a5029ef44db1": plugin type="multus" name="multus-cni-network" failed (add): [default/netshoot-deployment-59898b5dd9-kk2zm/142de5eb-9f8a-4818-8c5c-6c7c85fe575e:whereaboutsexample]: error adding container to network "whereaboutsexample": error at storage engine: Could not allocate IP in range: ip: 2000::1 / - 2000::ffff:ffff:ffff:fffe / range: 2000::/64 / excludeRanges: [2000::/32]
Fixed upstream in #366 https://github.com/k8snetworkplumbingwg/whereabouts/pull/366
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/99
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-35315. The following is the description of the original issue:
—
Description of problem:
If infrastructure or machine provisioning is slow, the installer may wait several minutes before declaring provisioning successful due to the exponential backoff. For instance, if dns resolution from load balancers is slow to propogate and we
Version-Release number of selected component (if applicable):
How reproducible:
Sometimes, it depends on provisioning being slow.
Steps to Reproduce:
1. Provision a cluster in an environment that has slow dns resolution (unclear how to set this up) 2. 3.
Actual results:
The installer will only check for infrastructure or machine readiness at intervals of several minutes after a certain threshold (say 10 minutes).
Expected results:
Installer should just check regularly, e.g. every 15 seconds.
Additional info:
It may not be possible to definitively test this. We may want to just check ci logs for an improvement in provisioning time and check for lack of regressions.
Description of problem:
in UPI cluster, there is no MachineSets and Machines resource, when user visits Machines and MachineSets list page, we will see simple text 'Not found'
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-16-113018
How reproducible:
Always
Steps to Reproduce:
1. setup UPI cluster 2. goes to MachineSets and Machines list page, check the empty state message
Actual results:
2. we just simply show 'Not found' text
Expected results:
2. for other resources, we show richer text 'No <resourcekind> found', so we should also show 'No Machines found' and 'No MachineSets found' for these pages
Additional info:
This is a clone of issue OCPBUGS-33561. The following is the description of the original issue:
—
Description of problem:
The valid values for installconfig.platform.vsphere.diskType are thin, thick, and eagerZeroedThick.But no matter the diskType is set to thick or eagerZeroedThick, the actual check result is thin. govc vm.info --json /DEVQEdatacenter/vm/wwei-511d-gtbqd/wwei-511d-gtbqd-master-1 | jq -r .VirtualMachines[].Layout.Disk[].DiskFile[][vsanDatastore] e7323f66-86ef-9947-a2b9-507c6f3b795c/wwei-511d-gtbqd-master-1.vmdk[fedora@preserve-wwei ~]$ govc datastore.disk.info -ds /DEVQEdatacenter/datastore/vsanDatastore e7323f66-86ef-9947-a2b9-507c6f3b795c/wwei-511d-gtbqd-master-1.vmdk |grep Type Type: thin
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-07-025557
How reproducible:
Setting installconfig.platform.vsphere.diskType to thick or eagerZeroedThick and continue installation.
Steps to Reproduce:
1.Setting installconfig.platform.vsphere.diskType to thick or eagerZeroedThick 2.continue installation
Actual results:
diskType is thin when install-config setting diskType: thick/eagerZeroedThick
Expected results:
The check result for disk info should match the setting in install-config
Additional info:
Description of problem:
When a customer certificate and sre certificate are configured and approved, revocation of customer certificate causes access to the cluster using kubeconfig with sre cert to be denied
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Create a cluster 2. Configure a customer cert and a sre cert, they are approved 3. Revoke a customer cert, access to the cluster using kubeconfig with sre cert gets denied
Actual results:
Revoke a customer cert, access to the cluster using kubeconfig with sre cert gets denied
Expected results:
Revoke a customer cert, access to the cluster using kubeconfig with sre cert succeeds
Additional info:
As OCP user, I want storage operators restarted quickly and newly started operator to start leading immediately without ~3 minute wait.
This means that the old operator should release its leadership after it receives SIGTERM and before it exists. Right now, storage operators fail to release the leadership in ~50% of cases.
Steps to reproduce:
This is an hack'n'hustle "work", not tied to any Epic, I'm using it just to get proper QE and tracking what operators are being updated (see linked github PRs).
Currently, the plugin template gives you instructions for running the console using a container image, which is a lightweight to do development and avoids the need to build the console source code from scratch. The image we reference uses a production version of React, however. This means that you aren't able to use the React browser plugin to debug your application.
We should look at alternatives that allow you to use React Developer Tools. Perhaps we can publish a different image that uses a development build. Or at least we need to better document building console locally instead of using an image to allow development builds.
Please review the following PR: https://github.com/openshift/cluster-api-provider-azure/pull/293
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/image-registry/pull/390
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
https://github.com/openshift/installer/pull/7778 introduced a bug where an error is always returned while retrieving a marketplace image.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Configure marketplace image in the install-config 2. openshift-install create manifests 3.
Actual results:
$ ./openshift-install create manifests --dir ipi1 --log-level debug DEBUG OpenShift Installer 4.16.0-0.test-2023-12-12-020559-ci-ln-xkqmlqk-latest DEBUG Built from commit 456ae720a83e39dffd9918c5a71388ad873b6a38 DEBUG Fetching Master Machines... DEBUG Loading Master Machines... DEBUG Loading Cluster ID... DEBUG Loading Install Config... DEBUG Loading SSH Key... DEBUG Loading Base Domain... DEBUG Loading Platform... DEBUG Loading Cluster Name... DEBUG Loading Base Domain... DEBUG Loading Platform... DEBUG Loading Pull Secret... DEBUG Loading Platform... INFO Credentials loaded from file "/home/fedora/.azure/osServicePrincipal.json" ERROR failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: [controlPlane.platform.azure.osImage: Invalid value: azure.OSImage{Plan:"", Publisher:"redhat", Offer:"rh-ocp-worker", SKU:"rh-ocp-worker", Version:"413.92.2023101700"}: could not get marketplace image: %!w(<nil>), compute[0].platform.azure.osImage: Invalid value: azure.OSImage{Plan:"", Publisher:"redhat", Offer:"rh-ocp-worker", SKU:"rh-ocp-worker", Version:"413.92.2023101700"}: could not get marketplace image: %!w(<nil>)]
Expected results:
Success
Additional info:
When {{errors.Wrap(err, ...)}} was replaced by {{fmt.Errorf(...)}}, there is a slight difference in behavior in which {{errors.Wrap}} returns {{nil}} if {{err}} is {{nil}} but {{fmt.Errorf}} always returns an error.
Description of problem:
Since OCP 4.15 we see issue with OLM deployed operator unable to operate in watched namespaces (multiple). It works fine with single watched namespace (subscription). Also, same test passes if we don't deploy operator using OLM, but using files. It seems like it is permission issue based on operator log. Same test works fine on any other previous OCP 4.14 and older.
Version-Release number of selected component (if applicable):
Server Version: 4.15.0-ec.3 Kubernetes Version: v1.28.3+20a5764
How reproducible:
Always
Steps to Reproduce:
0. oc login OCP4.15 1. git clone https://gitlab.cee.redhat.com/amq-broker/claire 2. make -f Makefile.downstream build ARTEMIS_VERSION=7.11.4 RELEASE_TYPE=released 3. make -f Makefile.downstream operator_test OLM_IIB=registry-proxy.engineering.redhat.com/rh-osbs/iib:636350 OLM_CHANNEL=7.11.x TESTS=ClusteredOperatorSmokeTests TEST_LOG_LEVEL=debug DISABLE_RANDOM_NAMESPACES=true
Actual results:
Can't deploy artemis broker custom resource in given namespace (permission issue - see details below)
Expected results:
Successfully deployed broker on watched namespaces
Additional info:
Log from AMQ Broker operator - seems like some permission issues since 4.15
E0103 10:04:54.425202 1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1beta1.ActiveMQArtemis: failed to list *v1beta1.ActiveMQArtemis: activemqartemises.broker.amq.io is forbidden: User "system:serviceaccount:cluster-tests:amq-broker-controller-manager" cannot list resource "activemqartemises" in API group "broker.amq.io" in the namespace "cluster-testsa" E0103 10:04:54.425207 1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1beta1.ActiveMQArtemisSecurity: failed to list *v1beta1.ActiveMQArtemisSecurity: activemqartemissecurities.broker.amq.io is forbidden: User "system:serviceaccount:cluster-tests:amq-broker-controller-manager" cannot list resource "activemqartemissecurities" in API group "broker.amq.io" in the namespace "cluster-testsa" E0103 10:04:54.425221 1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:cluster-tests:amq-broker-controller-manager" cannot list resource "pods" in API group "" in the namespace "cluster-testsa" W0103 10:04:54.425296 1 reflector.go:324] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: failed to list *v1beta1.ActiveMQArtemisScaledown: activemqartemisscaledowns.broker.amq.io is forbidden: User "system:serviceaccount:cluster-tests:amq-broker-controller-manager" cannot list resource "activemqartemisscaledowns" in API group "broker.amq.io" in the namespace "cluster-testsa"
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/109
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The Monitoring topic is used by the console by reviewing this file.
Description of problem:
With `oc` version 4.15 on OCP 4.15, the following command fails:
$ ~/openshift-client-linux-4.15.6/oc version Client Version: 4.15.6 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Kubernetes Version: v1.28.7+f1b5f6c $ ~/openshift-client-linux-4.15.6/oc create job manual-skrenger-from-oc-415 --from=cronjob/pi error: failed to create job: jobs.batch "manual-skrenger-from-oc-415" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>
With older versions of `oc`, this command executes as expected:
$ ~/openshift-client-linux-4.14.19/oc version Client Version: 4.14.19 Kustomize Version: v5.0.1 Kubernetes Version: v1.28.7+f1b5f6c $ ~/openshift-client-linux-4.14.19/oc create job manual-skrenger-with-oc-414 --from=cronjob/pi job.batch/manual-skrenger-with-oc-414 created
Version-Release number of selected component (if applicable):
$ ~/openshift-client-linux-4.15.6/oc version Client Version: 4.15.6 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Kubernetes Version: v1.28.7+f1b5f6c
How reproducible:
Always
Steps to Reproduce:
1. Set up a cluster using OCP 4.15 and set up IDP
2. Ensure a 4.15 version of `oc` client is used by executing "oc version"
3. Log in with a regular user, NOT cluster-admin (this is important)
4. Create a new project using "oc new-project example"
6. Create a Cronjob using the instructions in the documentation: https://docs.openshift.com/container-platform/4.15/nodes/jobs/nodes-nodes-jobs.html#nodes-nodes-jobs-creating-cron_nodes-nodes-jobs
7. Execute the following command to manually create a job from this cronjob: "oc create job manual-example --from=cronjob/pi"
Actual results:
Creating the job fails with:
$ ~/openshift-client-linux-4.15.6/oc create job manual-skrenger-from-oc-415 --from=cronjob/pi error: failed to create job: jobs.batch "manual-skrenger-from-oc-415" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>
This is likely due to the missing permission on "cronjobs/finalizers". We would expect the "admin" role to have these permissions (see comments below).
Expected results:
Job is created as expected
Additional info:
Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/631
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-43564. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-43508. The following is the description of the original issue:
—
Description of problem:
These two tests have been flaking more often lately. The TestLeaderElection flake is partially (but not solely) connected to OCPBUGS-41903. TestOperandProxyConfiguration seems to fail in the teardown while waiting for other cluster operators to become available. Although these flakes aren't customer facing, they considerably slow development cycles (due to retests) and also consume more resources than they should (every retest runs on a new cluster), so we want to backport the fixes.
Version-Release number of selected component (if applicable):
4.18, 4.17, 4.16, 4.15, 4.14
How reproducible:
Sometimes
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
hypershift#1614 gave us the router Deployment (descended from the private-router Deployment), but it lacks PDB coverage. For example:
$ git --no-pager log -1 --oneline origin/main f3f421bc7 (origin/release-4.16, origin/release-4.15, origin/main, origin/HEAD) Merge pull request #3183 from muraee/azure-kms $ git --no-pager grep 'func [^(]*\(Deployment\|PodDisruptionBudget\)' f3f421bc7 -- control-plane-operator/controllers/hostedcontrolplane/{ingress,kas} f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/ingress/router.go:func ReconcileRouterDeployment(deployment *appsv1.Deployment, ownerRef config.OwnerRef, deploymentConfig config.DeploymentConfig, image string, config *corev1.ConfigMap) error { f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/deployment.go:func ReconcileKubeAPIServerDeployment(deployment *appsv1.Deployment, f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/pdb.go:func ReconcilePodDisruptionBudget(pdb *policyv1.PodDisruptionBudget, p *KubeAPIServerParams) error {
Both the ingress and kas packages have Reconcile*Deployment methods. Only kas has a ReconcilePodDisruptionBudget method.
This bug is asking for router to get a covering PDB too, because being able to simultaneously evict all router-* pods simultaneously (for the cluster flavors that have replicas > 1 on that Deployment) can make the incoming traffic unreachable. And some of that Route traffic looks like stuff that folks would want to be reliably reachable:
$ git --no-pager grep 'func Reconcile[^(]*Route(' f3f421bc7 -- control-plane-operator/controllers/hostedcontrolplane/{ingress,kas} f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileExternalPublicRoute(route *routev1.Route, owner *metav1.OwnerReference, hostname string) error { f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileExternalPrivateRoute(route *routev1.Route, owner *metav1.OwnerReference, hostname string) error { f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileInternalRoute(route *routev1.Route, owner *metav1.OwnerReference) error { f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileKonnectivityExternalRoute(route *routev1.Route, ownerRef config.OwnerRef, hostname string, defaultIngressDomain string) error { f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileKonnectivityInternalRoute(route *routev1.Route, ownerRef config.OwnerRef) error {
Test plan:
1. Install a hosted cluster.
2. Log into the managment cluster, and find the namespace of the hosted cluster $NAMESPACE.
3. Evict both router pods (using a raw create, because there isn't more convenient syntax yet):
oc -n "${NAMESPACE}" get -l app=private-router -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' pods | while read NAME do oc create -f - <<EOF --raw "/api/v1/namespaces/${NAMESPACE}/pods/${NAME}/eviction" {"apiVersion": "policy/v1", "kind": "Eviction", "metadata": {"name": "${NAME}"}} EOF done
If that clears out both router pods right after the other, ingress will probably hiccup. And with the PDB in place, I'd expect the second eviction to fail.
Please review the following PR: https://github.com/openshift/installer/pull/7816
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-operator/pull/78
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
As this shows tls: bad certificate from kube-apiserver operator, for example, https://reportportal-openshift.apps.ocp-c1.prod.psi.redhat.com/ui/#prow/launches/all/470214, checked its must-gather: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-aws-ipi-imdsv2-fips-f14/1726036030588456960/artifacts/aws-ipi-imdsv2-fips-f14/gather-must-gather/artifacts/
MacBook-Pro:~ jianzhang$ omg logs prometheus-operator-admission-webhook-6bbdbc47df-jd5mb | grep "TLS handshake" 2023-11-27 10:11:50.687 | WARNING | omg.utils.load_yaml:<module>:10 - yaml.CSafeLoader failed to load, using SafeLoader 2023-11-19T00:57:08.318983249Z ts=2023-11-19T00:57:08.318923708Z caller=stdlib.go:105 caller=server.go:3215 msg="http: TLS handshake error from 10.129.0.35:48334: remote error: tls: bad certificate" 2023-11-19T00:57:10.336569986Z ts=2023-11-19T00:57:10.336505695Z caller=stdlib.go:105 caller=server.go:3215 msg="http: TLS handshake error from 10.129.0.35:48342: remote error: tls: bad certificate" ... MacBook-Pro:~ jianzhang$ omg get pods -A -o wide | grep "10.129.0.35" 2023-11-27 10:12:16.382 | WARNING | omg.utils.load_yaml:<module>:10 - yaml.CSafeLoader failed to load, using SafeLoader openshift-kube-apiserver-operator kube-apiserver-operator-f78c754f9-rbhw9 1/1 Running 2 5h27m 10.129.0.35 ip-10-0-107-238.ec2.internal
for more information slack - https://redhat-internal.slack.com/archives/CC3CZCQHM/p1700473278471309
Running the command coreos-installer iso kargs show no longer works with the 4.13 Agent ISO. Instead we get this error:
$ coreos-installer iso kargs show agent.x86_64.iso Writing manifest to image destination Storing signatures Error: No karg embed areas found; old or corrupted CoreOS ISO image.
This is almost certainly due to the way we repack the ISO as part of embedding the agent-tui binary in it.
It worked fine in 4.12. I have tested both with every version of coreos-installer from 0.14 to 0.17
Description of problem:
Deleting the node with the Ingress VIP using oc delete node causes a keepalived split-brain
Version-Release number of selected component (if applicable):
4.12, 4.14
How reproducible:
100%
Steps to Reproduce:
1. In an OpenShift cluster installed via vSphere IPI, check the node with the Ingress VIP. 2. Delete the node. 3. Check the discrepancy between machines objects and nodes. There will be more machines than nodes. 4. SSH to the deleted node, and check the VIP is still mounted and keepalived pods are running. 5. Check the VIP is also mounted in another worker. 6. SSH to the node and check the VIP is still present.
Actual results:
The deleted node still has the VIP present and the ingress fails sometimes
Expected results:
The deleted node should not have the VIP present and the ingress should not fail.
Additional info:
Description of problem:
make verify uses the latest version of setup-envtest, regardless of what go version the repo is currently on
How reproducible:
100%
Steps to Reproduce:
Run `make verify` without a local image of setup-envtest should cause the issue
Actual results:
go: sigs.k8s.io/controller-runtime/tools/setup-envtest@latest: sigs.k8s.io/controller-runtime/tools/setup-envtest@v0.0.0-20240323114127-e08b286e313e requires go >= 1.22.0 (running go 1.21.7; GOTOOLCHAIN=local) Go compliance shim [5685] [rhel-8-golang-1.21][openshift-golang-builder]: Exited with: 1
Expected results:
make verify should be able to run without build errors
Additional info:
We merged this ART PR which bumps base images. And then bumper [reverted the changes here|https://github.com/openshift/operator-framework-operator-controller/pull/88/files].
I still see the ART bump commit in main, but there is "Add OpenShift specific files" commit on top of it with older images. Actually now we have two "Add OpenShift specific files" commits in main:
And every UPSTREAM: <carry>-prefixed commit seems to be duplicated on top of synced changes.
Expected result:
This is a clone of issue OCPBUGS-34389. The following is the description of the original issue:
—
Description of problem:
When publish: internal, bootstrap SSH rules are still open to public internet (0.0.0.0/0) instead of restricted to the machine cidr.
Version-Release number of selected component (if applicable):
How reproducible:
all private clusters
Steps to Reproduce:
1. set publish: internal in installconfig 2. inspect ssh rule 3.
Actual results:
ssh is open to public internet
Expected results:
should be restricted to machine network
Additional info:
Bump prometheus-operator to 0.73.2
Description of problem:
Pull image from gcp artifact registry failed
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%
Steps to Reproduce:
1. Create repo for gcp artifact registry: zhsun-repo1 2. Login to registry gcloud auth login gcloud auth configure-docker us-central1-docker.pkg.dev 3. Push image to registry $ docker pull openshift/hello-openshift $ docker tag openshift/hello-openshift:latest us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest $ docker push us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest 4. Create pod $ oc new-project hello-gcr $ oc new-app --name hello-gcr --allow-missing-images \ --image us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest 5. Check pod status
Actual results:
Pull image failed. must-gather: https://drive.google.com/file/d/1o9cyJB53vQtHNmL5EV_hIx9I_LzMTB0K/view?usp=sharing kubelet log: https://drive.google.com/file/d/1tL7HGc4fEOjH5_v6howBpx2NuhjGKsTp/view?usp=sharing $ oc get po NAME READY STATUS RESTARTS AGE hello-gcr-658f7f9869-76ssg 0/1 ImagePullBackOff 0 3h24m $ oc describe po hello-gcr-658f7f9869-76ssg Warning Failed 14s (x2 over 15s) kubelet Error: ImagePullBackOff Normal Pulling 2s (x2 over 16s) kubelet Pulling image "us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest" Warning Failed 1s (x2 over 16s) kubelet Failed to pull image "us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest": rpc error: code = Unknown desc = Requesting bearer token: invalid status code from registry 403 (Forbidden)
Expected results:
Can pull image from artifact registry succeed
Additional info:
gcr.io works as expected. us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest doesn't work. $ oc get po -n hello-gcr NAME READY STATUS RESTARTS AGE hello-gcr-658f7f9869-76ssg 0/1 ImagePullBackOff 0 156m hello-gcr2-6d98c475ff-vjkt5 1/1 Running 0 163m $ oc get po -n hello-gcr -o yaml | grep image - image: us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest - image: gcr.io/openshift-qe/hello-gcr:latest
[sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel]
{ fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:360]: Expected <[]string | len:1, cap:1>: [ "Operator \"cluster-node-tuning-operator\" produces more watch requests than expected: watchrequestcount=209, upperbound=184, ratio=1.14", ] to be empty Ginkgo exit error 1: exit with code 1}
This is a clone of issue OCPBUGS-35798. The following is the description of the original issue:
—
Description of problem:
In PowerVS, when I try and deploy a 4.17 cluster, I see the following ProbeError event: Liveness probe error: Get "https://192.168.169.11:10258/healthz": dial tcp 192.168.169.11:10258: connect: connection refused
Version-Release number of selected component (if applicable):
release-ppc64le:4.17.0-0.nightly-ppc64le-2024-06-14-211304
How reproducible:
Always
Steps to Reproduce:
1. Create a cluster
Description of the problem:
Non-Nutanix node was successfully added to Nutanix day1 cluster
How reproducible:
100%
Steps to reproduce:
1. Deploy Nutanix day1 cluster
4. Try to add non-Nutanix day-2 node to Nutanix cluster
Actual results:
Day-2 node installation started and host installed
Expected results:
Day-2 node doesn't pass pre-installation checks
Description of problem:
There are built-in cluster roles to provide access to the default OpenShift SCCs. The "hostmount-anyuid" SCC does not have a functioning build-in cluster role, as it appears to have a typo in the name.
Version-Release number of selected component (if applicable):
How reproducible:
Consistent
Steps to Reproduce:
1. Attempt to use "system:openshift:scc:hostmount" cluster role 2. 3.
Actual results:
No access provided as the name of the SCC is typod
Expected results:
Access provided to use the SCC
Additional info:
This is a clone of issue OCPBUGS-35908. The following is the description of the original issue:
—
ConsoleYAMLSample CRD redirect to home ensure perspective switcher is set to Administrator 1) creates, displays, tests and deletes a new ConsoleYAMLSample instance 0 passing (2m) 1 failing 1) ConsoleYAMLSample CRD creates, displays, tests and deletes a new ConsoleYAMLSample instance: AssertionError: Timed out retrying after 30000ms: Expected to find element: `[data-test-action="View instances"]:not([disabled])`, but never found it. at Context.eval (webpack:///./support/selectors.ts:47:5)
When upgrading a HC from 4.13 to 4.14, after admin-acking the API deprecation check, the upgrade is still blocked by the ClusterVersionUpgradeble condition on the HC being Unknown. This is because the CVO in the guest cluster does not have an Upgradeable condition anymore.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-34590. The following is the description of the original issue:
—
Description of problem:
Storage degraded by VSphereProblemDetectorStarterStaticControllerDegraded during uprading to 4.16.0-0.nightly
Version-Release number of selected component (if applicable):
How reproducible:
Once
Steps to Reproduce:
1.Run prow ci job: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.15-vsphere-ipi-disk-encryption-tang-fips-f28/1790991142867701760 2.Storage degraded by VSphereProblemDetectorStarterStaticControllerDegraded during uprading to 4.16.0-0.nightly from 4.15.13: Last Transition Time: 2024-05-16T09:35:05Z Message: VSphereProblemDetectorStarterStaticControllerDegraded: "vsphere_problem_detector/04_clusterrole.yaml" (string): client rate limiter Wait returned an error: context canceled VSphereProblemDetectorStarterStaticControllerDegraded: "vsphere_problem_detector/05_clusterrolebinding.yaml" (string): client rate limiter Wait returned an error: context canceled VSphereProblemDetectorStarterStaticControllerDegraded: "vsphere_problem_detector/10_service.yaml" (string): client rate limiter Wait returned an error: context canceled VSphereProblemDetectorStarterStaticControllerDegraded: Reason: VSphereProblemDetectorStarterStaticController_SyncError Status: True Type: Degraded 3.must-gather is available: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.15-vsphere-ipi-disk-encryption-tang-fips-f28/1790991142867701760/artifacts/vsphere-ipi-disk-encryption-tang-fips-f28/gather-must-gather/
Actual results:
Storage degraded by VSphereProblemDetectorStarterStaticControllerDegraded during uprading to 4.16.0-0.nightly from 4.15.13
Expected results:
Upgrade should be successful
Additional info:
Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/177
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-36678. The following is the description of the original issue:
—
Description of problem:
Quick starts are no longer working in the kubevirt-plugin while in dev mode. The cause appears to be the use of different instances of QuickStartContext.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100%
Steps to Reproduce:
1. Run kubevirt-plugin in dev mode using the instructions in the plugin's README 2. Navigate to Virtualization > Overview 3. Click on a quick start in the QuickStarts section of the Getting Started card
Actual results:
The following error is thrown: setActiveQuickStart is not a function TypeError: setActiveQuickStart is not a function at onClick (http://localhost:9000/api/plugins/kubevirt-plugin/exposed-ClusterOverviewPage-chunk.js:8938:21) at onClick (http://localhost:9000/api/plugins/kubevirt-plugin/node_modules_patternfly_react-core_dist_esm_components_SimpleList_index_js-_c4a10-chunk.js:142:25) at Object.qe (http://localhost:9000/static/vendors~main-chunk-371c85e9324f56231546.min.js:117:16073) at Je (http://localhost:9000/static/vendors~main-chunk-371c85e9324f56231546.min.js:117:16227) at http://localhost:9000/static/vendors~main-chunk-371c85e9324f56231546.min.js:117:34214 at Sn (http://localhost:9000/static/vendors~main-chunk-371c85e9324f56231546.min.js:117:34308) at An (http://localhost:9000/static/vendors~main-chunk-371c85e9324f56231546.min.js:117:34722) at http://localhost:9000/static/vendors~main-chunk-371c85e9324f56231546.min.js:117:40370 at Re (http://localhost:9000/static/vendors~main-chunk-371c85e9324f56231546.min.js:117:116041) at http://localhost:9000/static/vendors~main-chunk-371c85e9324f56231546.min.js:117:36181
Expected results:
The quick start opens
Additional info:
Description of the problem:
Prepare cluster for installation , add applied configuration from api code.
When we try to install cluster it backs to ready as expected but in case we fix configuration and try to install again it ALWAYS fails on the first attemp to install ( not related to time)
Only on the second install it works ! without changing the configuration.
How reproducible:
Always
Steps to reproduce:
1.Prepare cluster for installation
2.Create invalid applied configuration to verify that preparing-for-installation returns to ready state as expected.
invalid_override_config = {
"capabilities":
}
3. Start installation , back to ready as expected.
4. fix applied configuration and try to install again -> Fails
Actual results:
On the first attemp after config change installation fails.
It work only on the second try
Expected results:
This is a clone of issue OCPBUGS-41498. The following is the description of the original issue:
—
Description of problem:
The e2e test "upgrade CRD with deprecated version" in the test/e2e/installplan_e2e_test.go suite is flaking
Version-Release number of selected component (if applicable):
How reproducible:
Hard to reproduce, could be related to other tests running at the same time, or any number of things.
Steps to Reproduce:
It might be worthwhile trying to re-rerun the test multiple times against a ClusterBot, or OpenShift Local, cluster
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/csi-driver-nfs/pull/136
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-34656. The following is the description of the original issue:
—
knative-ci.feature testi is failing with:
Logging in as kubeadmin Installing operator: "Red Hat OpenShift Serverless" Operator Red Hat OpenShift Serverless was not yet installed. Performing Serverless post installation steps User has selected namespace knative-serving 1) "before all" hook for "Create knative workload using Container image with extrenal registry on Add page: KN-05-TC05 (example #1)" 0 passing (3m) 1 failing 1) Perform actions on knative service and revision "before all" hook for "Create knative workload using Container image with extrenal registry on Add page: KN-05-TC05 (example #1)": AssertionError: Timed out retrying after 40000ms: Expected to find element: `[title="knativeservings.operator.knative.dev"]`, but never found it. Because this error occurred during a `before all` hook we are skipping all of the remaining tests. Although you have test retries enabled, we do not retry tests when `before all` or `after all` hooks fail at createKnativeServing (webpack:////go/src/github.com/openshift/console/frontend/packages/dev-console/integration-tests/support/pages/functions/knativeSubscriptions.ts:15:5) at performPostInstallationSteps (webpack:////go/src/github.com/openshift/console/frontend/packages/dev-console/integration-tests/support/pages/functions/installOperatorOnCluster.ts:176:26) at verifyAndInstallOperator (webpack:////go/src/github.com/openshift/console/frontend/packages/dev-console/integration-tests/support/pages/functions/installOperatorOnCluster.ts:221:2) at verifyAndInstallKnativeOperator (webpack:////go/src/github.com/openshift/console/frontend/packages/dev-console/integration-tests/support/pages/functions/installOperatorOnCluster.ts:231:27) at Context.eval (webpack:///./support/commands/hooks.ts:7:33) [mochawesome] Report JSON saved to /go/src/github.com/openshift/console/frontend/gui_test_screenshots/cypress_report_knative.json (Results) ┌────────────────────────────────────────────────────────────────────────────────────────────────┐ │ Tests: 16 │ │ Passing: 0 │ │ Failing: 1 │ │ Pending: 0 │ │ Skipped: 15 │ │ Screenshots: 1 │ │ Video: true │ │ Duration: 3 minutes, 8 seconds │ │ Spec Ran: knative-ci.feature │ └────────────────────────────────────────────────────────────────────────────────────────────────┘ (Screenshots) - /go/src/github.com/openshift/console/frontend/gui_test_screenshots/cypress/scree (1280x720) nshots/knative-ci.feature/Create knative workload using Container image with ext renal registry on Add page KN-05-TC05 (example #1) -- before all hook (failed).p ng
This is a clone of issue OCPBUGS-35397. The following is the description of the original issue:
—
Description of problem:
The runbook was added in https://issues.redhat.com/browse/MON-3862 The alert is more likely to fire in >=4.16
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
We have ClusterOperatorDown and ClusterOperatorDegraded in this space for ClusterOperator conditions. We should wire that up for ClusterVersion as well.
Description of problem:
After upgrade from 4.13.x to 4.14.10, the workload images that the customer stored inside the internal registry are lost, resulting the applications pods into error "Back-off pulling image". Even when manually pulling with podman, it fails then with "manifest unknown" because the image cannot be found in the registry anymore. - This behavior was found and reproduced 100% on ARO clusters, where the internal registry is by default backed up by the Storage Account created by the ARO RP service principal, which is the Containers blob service. - I do not know if in non-managed Azure clusters or any other architecture the same behavior is found.
Version-Release number of selected component (if applicable):
4.14.10
How reproducible:
100% with an ARO cluster (Managed cluster)
Steps to Reproduce: Attached.
The workaround found so far is to rebuild the apps or re-import the images. But those tasks are lengthy and costly specially if it is a production cluster.
Description of problem:
Oc-mirror get the wrong index.json and failed when ImageSetConfig containing OCI FBC
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202403070215.p0.gc4f8295.assembly.stream.el9-c4f8295", GitCommit:"c4f829512107f7d0f52a057cd429de2030b9b3b3", GitTreeState:"clean", BuildDate:"2024-03-07T03:46:24Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
always
Steps to Reproduce:
1) Copy the operator as OCI format to localhost: `skopeo copy docker://registry.redhat.io/redhat/redhat-operator-index:v4.12 oci:///app1/noo/redhat-operator-index --remove-signatures` 2) Use following imagesetconfigure for mirror: cat config-oci.yaml apiVersion: mirror.openshift.io/v1alpha2 kind: ImageSetConfiguration storageConfig: registry: imageURL: registryhost:5000/metadata:latest mirror: additionalImages: - name: quay.io/openshifttest/bench-army-knife@sha256:078db36d45ce0ece589e58e8de97ac1188695ac155bc668345558a8dd77059f6 platform: channels: - name: stable-4.12 type: ocp graph: true operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12 packages: - name: elasticsearch-operator - catalog: oci:///app1/noo/redhat-operator-index packages: - name: cluster-kube-descheduler-operator - name: odf-operator `oc-mirror --config config-oci.yaml file://outoci --v2`
Actual results:
2) In the configuration we are use oci:///app1/noo/redhat-operator-index, so should not to check index.json under outoci/working-dir/operator-images/redhat-operator-index/index.json oc-mirror --config config-oci.yaml file://outoci --v2 --v2 flag identified, flow redirected to the oc-mirror v2 version. PLEASE DO NOT USE that. V2 is still under development and it is not ready to be used. 2024/03/25 06:23:06 [INFO] : mode mirrorToDisk 2024/03/25 06:23:06 [INFO] : local storage registry will log to /app1/0321/outoci/working-dir/logs/registry.log 2024/03/25 06:23:06 [INFO] : starting local storage on localhost:55000 2024/03/25 06:23:06 [INFO] : detected minimum version as 4.12.53 2024/03/25 06:23:06 [INFO] : detected minimum version as 4.12.53 2024/03/25 06:23:07 [INFO] : Found update 4.12.53 2024/03/25 06:23:07 [INFO] : signature b584f5458fb946115b0cf0f1793dc9224c5e6a4567e74018f0590805a03eb523 2024/03/25 06:23:07 [WARN] : signature for b584f5458fb946115b0cf0f1793dc9224c5e6a4567e74018f0590805a03eb523 not in cache 2024/03/25 06:23:07 [INFO] : content {"critical": {"image": {"docker-manifest-digest": "sha256:b584f5458fb946115b0cf0f1793dc9224c5e6a4567e74018f0590805a03eb523"}, "type": "atomic container signature", "identity": {"docker-reference": "quay.io/openshift-release-dev/ocp-release:4.12.53-x86_64"}}, "optional": {"creator": "Red Hat OpenShift Signing Authority 0.0.1"}} 2024/03/25 06:23:07 [INFO] : image found : quay.io/openshift-release-dev/ocp-release:4.12.53-x86_64 2024/03/25 06:23:07 [INFO] : public Key : 567E347AD0044ADE55BA8A5F199E2F91FD431D51 2024/03/25 06:23:07 [INFO] : copying quay.io/openshift-release-dev/ocp-release:4.12.53-x86_64 2024/03/25 06:23:12 [INFO] : copying cincinnati response to outoci/working-dir/release-filters 2024/03/25 06:23:12 [INFO] : creating graph data image 2024/03/25 06:23:15 [INFO] : graph image created and pushed to cache. 2024/03/25 06:23:15 [INFO] : total release images to copy 185 2024/03/25 06:23:15 [INFO] : copying operator image registry.redhat.io/redhat/redhat-operator-index:v4.12 2024/03/25 06:23:18 [INFO] : manifest 7b9891532a76194c1b18698518abad9be4aca7f1152ac73f450aa8bfadef538f 2024/03/25 06:23:18 [INFO] : label /configs 2024/03/25 06:23:36 [INFO] : copying operator image oci:///app1/noo/redhat-operator-index error closing log file registry.log: close outoci/working-dir/logs/registry.log: file already closed 2024/03/25 06:23:36 [ERROR] : open outoci/working-dir/operator-images/redhat-operator-index/index.json: no such file or directory
Expected results:
2) oc-mirror find correct path for index.json and not fail
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Please review the following PR: https://github.com/openshift/openshift-apiserver/pull/415
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The CVO managed manifest, that CMO ships lack capability annotations as defined in https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md#manifest-annotations.
The dashboards should be tied to the console capability so that when CMO deploys on a cluster without the Console capability, CVO doesn't deploy the dashboards configmap.
Description of problem:
The agent-based installer does not support the TechPreviewNoUpgrade featureSet, and by extension nor does it support any of the features gated by it. Because of this, there is no warning about one of these features being specified - we expect the TechPreviewNoUpgrade feature gate to error out when any of them are used.
However, we don't warn about TechPreviewNoUpgrade itself being ignored, so if the user does specify it then they can use some of these non-supported features without being warned that their configuration is ignored.
We should fail with an error when TechPreviewNoUpgrade is specified, until such time as AGENT-554 is implemented.
Description of problem:
HyperShift-managed components use the default RevisionHistoryLimit of 10. This significantly impacts etcd load and scalability on the management cluster.
Version-Release number of selected component (if applicable):
4.9, 4.10, 4.11, 4.12, 4.13, 4.14, 4.15, 4.16
How reproducible:
100% (may vary depending on resource availablility on management cluster)
Steps to Reproduce:
1. Create 375+ HostedCluster 2. Observe etcd performance on management cluster 3.
Actual results:
etcd hitting storage space limits
Expected results:
Able to manage HyperShift control planes at scale (375+ HostedClusters)
Additional info:
Description of problem:
On OCP console if we added a parameter related to VMware,add the same value back again and click on save the nodes are rebooted
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Steps to Reproduce:
1. On any 4.14+ cluster go to ocp console page 2. Click on the vmware plugin 3. Edit any parameter and add the same value again. 4. Click on save
Actual results:
The nodes reboot to pickup change
Expected results:
nodes should not reboot if the same values are entered
Additional info:
Description of problem:
When mirroring content with oc-mirror v2, some required images for OpenShift installation are missing from the registry
Version-Release number of selected component (if applicable):
OpenShift installer version: v4.15.17 [admin@registry ~]$ oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202406131906.p0.g7c0889f.assembly.stream.el9-7c0889f", GitCommit:"7c0889f4bd343ccaaba5f33b7b861db29b1e5e49", GitTreeState:"clean", BuildDate:"2024-06-13T22:07:44Z", GoVersion:"go1.21.9 (Red Hat 1.21.9-1.el9_4) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Use oc-mirror v2 to mirror content. $ cat imageset-config-ocmirrorv2-v4.15.yaml kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: platform: channels: - name: stable-4.15 minVersion: 4.15.17 type: ocp operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 full: false packages: - name: ansible-automation-platform-operator - name: cluster-logging - name: datagrid - name: devworkspace-operator - name: multicluster-engine - name: multicluster-global-hub-operator-rh - name: odf-operator - name: quay-operator - name: rhbk-operator - name: skupper-operator - name: servicemeshoperator - name: submariner - name: lvms-operator - name: odf-lvm-operator - catalog: registry.redhat.io/redhat/certified-operator-index:v4.15 full: false packages: - name: crunchy-postgres-operator - name: nginx-ingress-operator - catalog: registry.redhat.io/redhat/community-operator-index:v4.15 full: false packages: - name: argocd-operator - name: cockroachdb - name: infinispan - name: keycloak-operator - name: mariadb-operator - name: nfs-provisioner-operator - name: postgresql - name: skupper-operator additionalImages: - name: registry.redhat.io/ubi8/ubi:latest - name: registry.access.redhat.com/ubi8/nodejs-18 - name: registry.redhat.io/openshift4/ose-prometheus:v4.14.0 - name: registry.redhat.io/service-interconnect/skupper-router-rhel9:2.4.3 - name: registry.redhat.io/service-interconnect/skupper-config-sync-rhel9:1.4.4 - name: registry.redhat.io/service-interconnect/skupper-service-controller-rhel9:1.4.4 - name: registry.redhat.io/service-interconnect/skupper-flow-collector-rhel9:1.4.4 helm: {} Run oc-mirror using the command: oc-mirror --v2 \ -c imageset-config-ocmirrorv2-v4.15.yaml \ --workspace file:////data/oc-mirror/workdir/ \ docker://registry.local.momolab.io:8443/mirror
Steps to Reproduce:
1. Install Red Hat Quay mirror registry 2. Mirror using oc-mirror v2 command and steps above 3. Install OpenShift
Actual results:
Installation fails
Expected results:
Installation succeeds
Additional info:
## Check logs on coreos: [core@sno1 ~]$ journalctl -b -f -u release-image.service -u bootkube.service Jul 02 03:46:22 sno1.local.momolab.io bootkube.sh[13486]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06: (Mirrors also failed: [registry.local.momolab.io:8443/mirror/openshift/release@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06: reading manifest sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06 in registry.local.momolab.io:8443/mirror/openshift/release: name unknown: repository not found]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06: reading manifest sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized ## Check if that image was pulled: [admin@registry ~]$ cat /data/oc-mirror/workdir/working-dir/dry-run/mapping.txt | grep -i f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06 docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06=docker://registry.local.momolab.io:8443/mirror/openshift-release-dev/ocp-v4.0-art-dev@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06 ## Problem is, it doesn't exist on the registry (also via UI): [admin@registry ~]$ podman pull registry.local.momolab.io:8443/mirror/openshift-release-dev/ocp-v4.0-art-dev@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06 Trying to pull registry.local.momolab.io:8443/mirror/openshift-release-dev/ocp-v4.0-art-dev@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06... Error: initializing source docker://registry.local.momolab.io:8443/mirror/openshift-release-dev/ocp-v4.0-art-dev@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06: reading manifest sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06 in registry.local.momolab.io:8443/mirror/openshift-release-dev/ocp-v4.0-art-dev: manifest unknown
Description of problem:
When I create a pod with empty security context as a user that has access to all SCCs, the SCC annotation shows "privileged"
Version-Release number of selected component (if applicable):
4.12
How reproducible:
100%
Steps to Reproduce:
1. create a bare pod with an empty security context 2. look at the "openshift.io/scc" annotation
Actual results:
privileged
Expected results:
anyuid
Additional info:
kind: Pod apiVersion: v1 metadata: name: mypod spec: restartPolicy: Never containers: - name: fedora image: fedora:latest command: - sleep args: - "infinity"
Description of problem:
checked with 4.15.0-0.nightly-2023-12-11-033133, there are not PodMetrics/NodeMetrics in server
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.0-0.nightly-2023-12-11-033133 True False 122m Cluster version is 4.15.0-0.nightly-2023-12-11-033133 $ oc api-resources | grep -i metrics nodes metrics.k8s.io/v1beta1 false NodeMetrics pods metrics.k8s.io/v1beta1 true PodMetrics $ oc explain PodMetrics the server doesn't have a resource type "PodMetrics" $ oc explain NodeMetrics the server doesn't have a resource type "NodeMetrics" $ oc get NodeMetrics error: the server doesn't have a resource type "NodeMetrics" $ oc get PodMetrics -A error: the server doesn't have a resource type "PodMetrics"
no issue with 4.14.0-0.nightly-2023-12-11-135902
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-12-11-135902 True False 88m Cluster version is 4.14.0-0.nightly-2023-12-11-135902 $ oc api-resources | grep -i metrics nodes metrics.k8s.io/v1beta1 false NodeMetrics pods metrics.k8s.io/v1beta1 true PodMetrics $ oc explain PodMetrics GROUP: metrics.k8s.io KIND: PodMetrics VERSION: v1beta1DESCRIPTION: PodMetrics sets resource usage metrics of a pod. ... $ oc explain NodeMetrics GROUP: metrics.k8s.io KIND: NodeMetrics VERSION: v1beta1DESCRIPTION: NodeMetrics sets resource usage metrics of a node. ... $ oc get PodMetrics -A NAMESPACE NAME CPU MEMORY WINDOW openshift-apiserver apiserver-65f777466-4m8nj 9m 297512Ki 5m0s openshift-apiserver apiserver-65f777466-g7n72 10m 313308Ki 5m0s openshift-apiserver apiserver-65f777466-xzd8l 12m 293008Ki 5m0s openshift-apiserver-operator openshift-apiserver-operator-54945b8bbd-bxkcj 3m 119264Ki 5m0s ... $ oc get NodeMetrics NAME CPU MEMORY WINDOW ip-10-0-20-163.us-east-2.compute.internal 765m 8349848Ki 5m0s ip-10-0-22-189.us-east-2.compute.internal 388m 5363132Ki 5m0s ip-10-0-41-231.us-east-2.compute.internal 1274m 7243548Ki 5m0s ...
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-11-033133
How reproducible:
always
Steps to Reproduce:
1. see the description
Actual results:
4.15 server does not have PodMetrics/NodeMetrics
Expected results:
should have
Description of problem:
Customer is trying to create RHOCP cluster with Domain name 123mydomain.com In RedHat Hybrid Cloud Console customer is getting below error : ~~~ Failed to update the cluster DNS format mismatch: 123mydomain.com domain name is not valid ~~~ *** As per regex check as described in KCS - https://access.redhat.com/solutions/5517531 The domain name starting with numeric character is valid e.g. 123mydomain.com Below is the RegeX check to find the domain name validity : [a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')] *** From the validations the assisted installer does as per : https://github.com/openshift/assisted-service/blob/master/pkg/validations/validations.go The below regexps are applied: baseDomainRegex = `^[a-z]([\-]*[a-z\d]+)+$` dnsNameRegex = `^([a-z]([\-]*[a-z\d]+)*\.)+[a-z\d]+([\-]*[a-z\d]+)+$` wildCardDomainRegex = `^(validateNoWildcardDNS\.).+\.?$` hostnameRegex = `^[a-z0-9][a-z0-9\-\.]{0,61}[a-z0-9]$` installerArgsValuesRegex = `^[A-Za-z0-9@!#$%*()_+-=//.,";':{}\[\]]+$` This means the domain name must start with a letter [a-z].
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Open RedHat Hybrid Cloud Console 2. Go to Clusters 3. Create Cluster 4. Go to Datacenter 5. Under Assisted Installer -> Create Cluster 6. Enter Cluster Name mytestcluster and enter Domain Name 123mydomain.com 7. Click on Next
Actual results:
Domain name with numeric character first and then letters e.g. 123mydomain.com showing invalid in RedHat Hybrid Cloud Console Assisted Installer , throwing error :- Failed to create new cluster DNS format mismatch: 123mydomain.com domain name is not valid
Expected results:
Domain name with numeric character first and then letters e.g. 123mydomain.com must be valid in OpenShift RedHat Hybrid Cloud Console Assisted Installer
This fix contains the following changes coming from updated version of kubernetes up to v1.29.5:
Changelog:
v1.29.5: https://github.com/kubernetes/kubernetes/blob/release-1.29/CHANGELOG/CHANGELOG-1.29.md#changelog-since-v1294
This is a clone of issue OCPBUGS-36619. The following is the description of the original issue:
—
The labels added by PAC have been deprecated and added to PLR annotations. So, use annotations to get the value in the repository list page, repository PLRs list page, and on the PLR details page.
Description of problem:
When running 4.15 installer full function test, detect below one arm64 instance families and verified, need to append them in installer doc[1]: - standardBpsv2Family [1] https://github.com/openshift/installer/blob/master/docs/user/azure/tested_instance_types_aarch64.md
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Since approximately 12 April, all FIPS CI is broken, with the authentication operator failing to come up.
The oauth-openshift containers are failing with the message:
Copying system trust bundle FIPS mode is enabled, but the required OpenSSL backend is unavailable
This is due to https://github.com/openshift/oauth-server/commit/8a6f3a11a4b25e3e22152252720490b9f355ce53 changing the base image to RHEL 9 while leaving the builder image as RHEL 8. When the binary starts, it can not find the RHEL 8 OpenSSL it was linked against.
This is a clone of OCPBUGS-35335.
Description of problem:
user.openshift.io and oauth.openshift.io APIs are not unavailable in external oidc cluster, that conducts all the common pull/push blob from/to image registry failed.
Version-Release number of selected component (if applicable):
4.15.15
How reproducible:
always
Steps to Reproduce:
1.Create a ROSA HCP cluster which configured external oidc users 2.Push data to image registry under a project oc new-project wxj1 oc new-build httpd~https://github.com/openshift/httpd-ex.git 3.
Actual results:
$ oc logs -f build/httpd-ex-1 Cloning "https://github.com/openshift/httpd-ex.git" ... Commit: 1edee8f58c0889616304cf34659f074fda33678c (Update httpd.json) Author: Petr Hracek <phracek@redhat.com> Date: Wed Jun 5 13:00:09 2024 +0200time="2024-06-12T09:55:13Z" level=info msg="Not using native diff for overlay, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled"I0612 09:55:13.306937 1 defaults.go:112] Defaulting to storage driver "overlay" with options [mountopt=metacopy=on].Caching blobs under "/var/cache/blobs".Trying to pull image-registry.openshift-image-registry.svc:5000/openshift/httpd@sha256:765aa645587f34e310e49db7cdc97e82d34122adb0b604eea891e0f98050aa77...Warning: Pull failed, retrying in 5s ...Trying to pull image-registry.openshift-image-registry.svc:5000/openshift/httpd@sha256:765aa645587f34e310e49db7cdc97e82d34122adb0b604eea891e0f98050aa77...Warning: Pull failed, retrying in 5s ...Trying to pull image-registry.openshift-image-registry.svc:5000/openshift/httpd@sha256:765aa645587f34e310e49db7cdc97e82d34122adb0b604eea891e0f98050aa77...Warning: Pull failed, retrying in 5s ...error: build error: After retrying 2 times, Pull image still failed due to error: unauthorized: unable to validate token: NotFound oc logs -f deploy/image-registry -n openshift-image-registry time="2024-06-12T09:55:13.36003996Z" level=error msg="invalid token: the server could not find the requested resource (get users.user.openshift.io ~)" go.version="go1.20.12 X:strictfipsruntime" http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=0c380b81-99d4-4118-8de3-407706e8767c http.request.method=GET http.request.remoteaddr="10.130.0.35:50550" http.request.uri="/openshift/token?account=serviceaccount&scope=repository%3Aopenshift%2Fhttpd%3Apull" http.request.useragent="containers/5.28.0 (github.com/containers/image)"
Expected results:
Should pull/push blob from/to image registry on external oidc cluster
Additional info:
The new-in-4.15 ClusterVersion spec.signatureStores should implement the ca property.
4.15 and 4.15.
Every time, for TechPreviewNoUpgrade clusters where signatureStores exists.
1. Install a TechPreviewNoUpgrade cluster.
2. Set up a signature store in the cluster behind the self-signed ingress/router CA:
FIXME
3. Patch ClusterVersion to ask the CVO to use that store.
FIXME
4. Ask the cluster to update to a release whose signature is in the custom store:
FIXME
FIXME
The update is accepted and begins rolling out, as shown by oc adm upgrade. Whether the update successfully completes or not is not relevant.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-35874. The following is the description of the original issue:
—
Description of problem:
The ovnkube-sbdb route removal is missing a management cluster capabilities check and thus fails on a Kubernetes based management cluster.
Version-Release number of selected component (if applicable):
4.15.z, 4.16.0, 4.17.0
How reproducible:
Always
Steps to Reproduce:
Deploy an OpenShift version 4.16.0-rc.6 cluster control plane using HyperShift on a Kubernetes based management cluster.
Actual results:
Cluster control plane deployment fails because the cluster-network-operator pod is stuck in Init state due to the following error: {"level":"error","ts":"2024-06-19T20:51:37Z","msg":"Reconciler error","controller":"hostedcontrolplane","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedControlPlane","HostedControlPlane":{"name":"cppjslm10715curja3qg","namespace":"master-cppjslm10715curja3qg"},"namespace":"master-cppjslm10715curja3qg","name":"cppjslm10715curja3qg","reconcileID":"037842e8-82ea-4f6e-bf28-deb63abc9f22","error":"failed to update control plane: failed to reconcile cluster network operator: failed to clean up ovnkube-sbdb route: error getting *v1.Route: no matches for kind \"Route\" in version \"route.openshift.io/v1\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}
Expected results:
Cluster control plane deployment succeeds.
Additional info:
https://ibm-argonauts.slack.com/archives/C01C8502FMM/p1718832205747529
Any upgrade up to 4.15.{current-z}
Any non-Microshift cluster with an operator installed via OLM before upgrade to 4.15. After upgrading to 4.15, re-installing a previously uninstalled operator may also cause this issue.
OLM Operators can't be upgraded and may incorrectly report failed status.
Delete the resources associated with the OLM installation related to the failure message in the olm-operator.
A failure message similar to this may appear on the CSV:
InstallComponentFailed install strategy failed: rolebindings.rbac.authorization.k8s.io "openshift-gitops-operator-controller-manager-service-auth-reader" already exists |
The following resource types have been observed to encounter this issue and should be safe to delete:
Under no circumstances should a user delete a CustomResourceDefinition (CRD) if the same error occurs and names such a resource as data loss may occur. Note that we have not seen this type of resource named in the error from any of our users so far.
Labeling the problematic resources with olm.managed: "true" then restarting the olm-operator pod in the openshift-operator-lifecycle-manager namespace may also resolve the issue if the resource appears risky to delete.
Yes, functionality which worked in 4.14 may break after upgrading to 4.15.Not a regression, this is a new issue related to performance improvements added to OLM in 4.15
https://issues.redhat.com/browse/OCPBUGS-24009
Description of problem:
After migration works complete, “pod-identity-webhook” deployment is not in the namespace "openshift-cloud-credential-operator".
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1.Prepare an Azure OpenShift cluster. 2.Migration to Azure AD workload Identity using procedure https://github.com/openshift/cloud-credential-operator/blob/master/docs/azure_workload_identity.md#steps-to-in-place-migrate-an-openshift-cluster-to-azure-ad-workload-identity. 3.
Actual results:
Azure pod identity webhook is not being created. [hmx@fedora CCO]$ oc get po -n openshift-cloud-credential-operator NAME READY STATUS RESTARTS AGE cloud-credential-operator-78b94ffb4-587rh 2/2 Running 0 3h7m
Expected results:
Additional info:
Tested migration to Azure AD workload Identity on following Azure cluster type: 1. Default public Azure cluster. 2. Single-node cluster. 3. Azure private cluster. 4. Disconnected Azure cluster. This issue exists in all of the above cluster types.
Description of problem:
Currently console frontend and backend is using OpenShift centric UserKind type. In order for the console to work without OAuth server, iow. with. external OIDC it needs to use k8s UserInfo type, which is retrieved querying SelfSubjectReview API
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Console is not working with external OIDC provider
Expected results:
Console will be working with external OIDC provider
Additional info:
This is mainly an API change.
This is a clone of issue OCPBUGS-33717. The following is the description of the original issue:
—
Description of problem:
Starting with OCP 4.14, we have decided to start using OCP's own "bridge" CNI build instead of our "cnv-bridge" rebuild. To make sure that current users of "cnv-bridge" don't have to change their configuration, we kept "cnv-bridge" as a symlink to "bridge". While the old name still functions, we should make an effort to move users to "bridge". To do that, we can start by changing UI so it generates NADs of the type "bridge" instead of "cnv-bridge".
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Use the NetworkAttachmentDefinition dialog to create a network of type bridge 2. Read the generated yaml
Actual results:
It has "type": "cnv-bridge"
Expected results:
It should have "type": "bridge"
Additional info:
The same should be done to any instance of "cnv-tuning" by changing it to "tuning".
This is a clone of issue OCPBUGS-23758. The following is the description of the original issue:
—
When switching from ipForwarding: Global to Restricted, sysctl settings are not adjusted
Switch from:
# oc edit network.operator/cluster apiVersion: operator.openshift.io/v1 kind: Network metadata: annotations: networkoperator.openshift.io/ovn-cluster-initiator: 10.19.1.66 creationTimestamp: "2023-11-22T12:14:46Z" generation: 207 name: cluster resourceVersion: "235152" uid: 225d404d-4e26-41bf-8e77-4fc44948f239 spec: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 defaultNetwork: ovnKubernetesConfig: egressIPConfig: {} gatewayConfig: ipForwarding: Global (...)
To:
# oc edit network.operator/cluster apiVersion: operator.openshift.io/v1 kind: Network metadata: annotations: networkoperator.openshift.io/ovn-cluster-initiator: 10.19.1.66 creationTimestamp: "2023-11-22T12:14:46Z" generation: 207 name: cluster resourceVersion: "235152" uid: 225d404d-4e26-41bf-8e77-4fc44948f239 spec: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 defaultNetwork: ovnKubernetesConfig: egressIPConfig: {} gatewayConfig: ipForwarding: Restricted
You'll see that the pods are updated:
# oc get pods -o yaml -n openshift-ovn-kubernetes ovnkube-node-fnl9z | grep sysctl -C10 fi admin_network_policy_enabled_flag= if [[ "false" == "true" ]]; then admin_network_policy_enabled_flag="--enable-admin-network-policy" fi # If IP Forwarding mode is global set it in the host here. ip_forwarding_flag= if [ "Restricted" == "Global" ]; then sysctl -w net.ipv4.ip_forward=1 sysctl -w net.ipv6.conf.all.forwarding=1 else ip_forwarding_flag="--disable-forwarding" fi NETWORK_NODE_IDENTITY_ENABLE= if [[ "true" == "true" ]]; then NETWORK_NODE_IDENTITY_ENABLE=" --bootstrap-kubeconfig=/var/lib/kubelet/kubeconfig --cert-dir=/etc/ovn/ovnkube-node-certs --cert-duration=24h
And that ovnkube correctly takes the settings:
# ps aux | grep disable-for root 74963 0.3 0.0 8085828 153464 ? Ssl Nov22 3:38 /usr/bin/ovnkube --init-ovnkube-controller master1.site1.r450.org --init-node master1.site1.r450.org --config-file=/run/ovnkube-config/ovnkube.conf --ovn-empty-lb-events --loglevel 4 --inactivity-probe=180000 --gateway-mode shared --gateway-interface br-ex --metrics-bind-address 127.0.0.1:29103 --ovn-metrics-bind-address 127.0.0.1:29105 --metrics-enable-pprof --metrics-enable-config-duration --export-ovs-metrics --disable-snat-multiple-gws --enable-multi-network --enable-multicast --zone master1.site1.r450.org --enable-interconnect --acl-logging-rate-limit 20 --enable-multi-external-gateway=true --disable-forwarding --bootstrap-kubeconfig=/var/lib/kubelet/kubeconfig --cert-dir=/etc/ovn/ovnkube-node-certs --cert-duration=24h root 2096007 0.0 0.0 3880 2144 pts/0 S+ 10:07 0:00 grep --color=auto disable-for
But sysctls are never restricted:
[root@master1 ~]# sysctl -a | grep forward net.ipv4.conf.0eca9d9e7fd3231.bc_forwarding = 0 net.ipv4.conf.0eca9d9e7fd3231.forwarding = 1 net.ipv4.conf.0eca9d9e7fd3231.mc_forwarding = 0 net.ipv4.conf.21a32cf76c3bcdf.bc_forwarding = 0 net.ipv4.conf.21a32cf76c3bcdf.forwarding = 1 net.ipv4.conf.21a32cf76c3bcdf.mc_forwarding = 0 net.ipv4.conf.22f9bca61beeaba.bc_forwarding = 0 net.ipv4.conf.22f9bca61beeaba.forwarding = 1 net.ipv4.conf.22f9bca61beeaba.mc_forwarding = 0 net.ipv4.conf.2ee438a7201c1f7.bc_forwarding = 0 net.ipv4.conf.2ee438a7201c1f7.forwarding = 1 net.ipv4.conf.2ee438a7201c1f7.mc_forwarding = 0 net.ipv4.conf.3560ce219f7b591.bc_forwarding = 0 net.ipv4.conf.3560ce219f7b591.forwarding = 1 net.ipv4.conf.3560ce219f7b591.mc_forwarding = 0 net.ipv4.conf.507c81eb9944c2e.bc_forwarding = 0 net.ipv4.conf.507c81eb9944c2e.forwarding = 1 net.ipv4.conf.507c81eb9944c2e.mc_forwarding = 0 net.ipv4.conf.6278633ca74482f.bc_forwarding = 0 net.ipv4.conf.6278633ca74482f.forwarding = 1 net.ipv4.conf.6278633ca74482f.mc_forwarding = 0 net.ipv4.conf.68b572ce18f3b82.bc_forwarding = 0 net.ipv4.conf.68b572ce18f3b82.forwarding = 1 net.ipv4.conf.68b572ce18f3b82.mc_forwarding = 0 net.ipv4.conf.7291c80dd47a6f3.bc_forwarding = 0 net.ipv4.conf.7291c80dd47a6f3.forwarding = 1 net.ipv4.conf.7291c80dd47a6f3.mc_forwarding = 0 net.ipv4.conf.76abdac44c6aee7.bc_forwarding = 0 net.ipv4.conf.76abdac44c6aee7.forwarding = 1 net.ipv4.conf.76abdac44c6aee7.mc_forwarding = 0 net.ipv4.conf.7f9abb486611f68.bc_forwarding = 0 net.ipv4.conf.7f9abb486611f68.forwarding = 1 net.ipv4.conf.7f9abb486611f68.mc_forwarding = 0 net.ipv4.conf.8cd86bfb8ea635f.bc_forwarding = 0 net.ipv4.conf.8cd86bfb8ea635f.forwarding = 1 net.ipv4.conf.8cd86bfb8ea635f.mc_forwarding = 0 net.ipv4.conf.8e87bd3f6ddc9f8.bc_forwarding = 0 net.ipv4.conf.8e87bd3f6ddc9f8.forwarding = 1 net.ipv4.conf.8e87bd3f6ddc9f8.mc_forwarding = 0 net.ipv4.conf.91079c8f5c1630f.bc_forwarding = 0 net.ipv4.conf.91079c8f5c1630f.forwarding = 1 net.ipv4.conf.91079c8f5c1630f.mc_forwarding = 0 net.ipv4.conf.92e754a12836f63.bc_forwarding = 0 net.ipv4.conf.92e754a12836f63.forwarding = 1 net.ipv4.conf.92e754a12836f63.mc_forwarding = 0 net.ipv4.conf.a5c01549a6070ab.bc_forwarding = 0 net.ipv4.conf.a5c01549a6070ab.forwarding = 1 net.ipv4.conf.a5c01549a6070ab.mc_forwarding = 0 net.ipv4.conf.a621d1234f0f25a.bc_forwarding = 0 net.ipv4.conf.a621d1234f0f25a.forwarding = 1 net.ipv4.conf.a621d1234f0f25a.mc_forwarding = 0 net.ipv4.conf.all.bc_forwarding = 0 net.ipv4.conf.all.forwarding = 1 net.ipv4.conf.all.mc_forwarding = 0 net.ipv4.conf.br-ex.bc_forwarding = 0 net.ipv4.conf.br-ex.forwarding = 1 net.ipv4.conf.br-ex.mc_forwarding = 0 net.ipv4.conf.br-int.bc_forwarding = 0 net.ipv4.conf.br-int.forwarding = 1 net.ipv4.conf.br-int.mc_forwarding = 0 net.ipv4.conf.c3f3da187245cf6.bc_forwarding = 0 net.ipv4.conf.c3f3da187245cf6.forwarding = 1 net.ipv4.conf.c3f3da187245cf6.mc_forwarding = 0 net.ipv4.conf.c7e518fff8ff973.bc_forwarding = 0 net.ipv4.conf.c7e518fff8ff973.forwarding = 1 net.ipv4.conf.c7e518fff8ff973.mc_forwarding = 0 net.ipv4.conf.d17c6fb6d3dd021.bc_forwarding = 0 net.ipv4.conf.d17c6fb6d3dd021.forwarding = 1 net.ipv4.conf.d17c6fb6d3dd021.mc_forwarding = 0 net.ipv4.conf.default.bc_forwarding = 0 net.ipv4.conf.default.forwarding = 1 net.ipv4.conf.default.mc_forwarding = 0 net.ipv4.conf.eno8303.bc_forwarding = 0 net.ipv4.conf.eno8303.forwarding = 1 net.ipv4.conf.eno8303.mc_forwarding = 0 net.ipv4.conf.eno8403.bc_forwarding = 0 net.ipv4.conf.eno8403.forwarding = 1 net.ipv4.conf.eno8403.mc_forwarding = 0 net.ipv4.conf.ens1f0.bc_forwarding = 0 net.ipv4.conf.ens1f0.forwarding = 1 net.ipv4.conf.ens1f0.mc_forwarding = 0 net.ipv4.conf.ens1f0/3516.bc_forwarding = 0 net.ipv4.conf.ens1f0/3516.forwarding = 1 net.ipv4.conf.ens1f0/3516.mc_forwarding = 0 net.ipv4.conf.ens1f0/3517.bc_forwarding = 0 net.ipv4.conf.ens1f0/3517.forwarding = 1 net.ipv4.conf.ens1f0/3517.mc_forwarding = 0 net.ipv4.conf.ens1f0/3518.bc_forwarding = 0 net.ipv4.conf.ens1f0/3518.forwarding = 1 net.ipv4.conf.ens1f0/3518.mc_forwarding = 0 net.ipv4.conf.ens1f1.bc_forwarding = 0 net.ipv4.conf.ens1f1.forwarding = 1 net.ipv4.conf.ens1f1.mc_forwarding = 0 net.ipv4.conf.ens3f0.bc_forwarding = 0 net.ipv4.conf.ens3f0.forwarding = 1 net.ipv4.conf.ens3f0.mc_forwarding = 0 net.ipv4.conf.ens3f1.bc_forwarding = 0 net.ipv4.conf.ens3f1.forwarding = 1 net.ipv4.conf.ens3f1.mc_forwarding = 0 net.ipv4.conf.fcb6e9468a65d70.bc_forwarding = 0 net.ipv4.conf.fcb6e9468a65d70.forwarding = 1 net.ipv4.conf.fcb6e9468a65d70.mc_forwarding = 0 net.ipv4.conf.fcd96084b7f5a9a.bc_forwarding = 0 net.ipv4.conf.fcd96084b7f5a9a.forwarding = 1 net.ipv4.conf.fcd96084b7f5a9a.mc_forwarding = 0 net.ipv4.conf.genev_sys_6081.bc_forwarding = 0 net.ipv4.conf.genev_sys_6081.forwarding = 1 net.ipv4.conf.genev_sys_6081.mc_forwarding = 0 net.ipv4.conf.lo.bc_forwarding = 0 net.ipv4.conf.lo.forwarding = 1 net.ipv4.conf.lo.mc_forwarding = 0 net.ipv4.conf.ovn-k8s-mp0.bc_forwarding = 0 net.ipv4.conf.ovn-k8s-mp0.forwarding = 1 net.ipv4.conf.ovn-k8s-mp0.mc_forwarding = 0 net.ipv4.conf.ovs-system.bc_forwarding = 0 net.ipv4.conf.ovs-system.forwarding = 1 net.ipv4.conf.ovs-system.mc_forwarding = 0 net.ipv4.ip_forward = 1 net.ipv4.ip_forward_update_priority = 1 net.ipv4.ip_forward_use_pmtu = 0 net.ipv6.conf.0eca9d9e7fd3231.forwarding = 1 net.ipv6.conf.0eca9d9e7fd3231.mc_forwarding = 0 net.ipv6.conf.21a32cf76c3bcdf.forwarding = 1 net.ipv6.conf.21a32cf76c3bcdf.mc_forwarding = 0 net.ipv6.conf.22f9bca61beeaba.forwarding = 1 net.ipv6.conf.22f9bca61beeaba.mc_forwarding = 0 net.ipv6.conf.2ee438a7201c1f7.forwarding = 1 net.ipv6.conf.2ee438a7201c1f7.mc_forwarding = 0 net.ipv6.conf.3560ce219f7b591.forwarding = 1 net.ipv6.conf.3560ce219f7b591.mc_forwarding = 0 net.ipv6.conf.507c81eb9944c2e.forwarding = 1 net.ipv6.conf.507c81eb9944c2e.mc_forwarding = 0 net.ipv6.conf.6278633ca74482f.forwarding = 1 net.ipv6.conf.6278633ca74482f.mc_forwarding = 0 net.ipv6.conf.68b572ce18f3b82.forwarding = 1 net.ipv6.conf.68b572ce18f3b82.mc_forwarding = 0 net.ipv6.conf.7291c80dd47a6f3.forwarding = 1 net.ipv6.conf.7291c80dd47a6f3.mc_forwarding = 0 net.ipv6.conf.76abdac44c6aee7.forwarding = 1 net.ipv6.conf.76abdac44c6aee7.mc_forwarding = 0 net.ipv6.conf.7f9abb486611f68.forwarding = 1 net.ipv6.conf.7f9abb486611f68.mc_forwarding = 0 net.ipv6.conf.8cd86bfb8ea635f.forwarding = 1 net.ipv6.conf.8cd86bfb8ea635f.mc_forwarding = 0 net.ipv6.conf.8e87bd3f6ddc9f8.forwarding = 1 net.ipv6.conf.8e87bd3f6ddc9f8.mc_forwarding = 0 net.ipv6.conf.91079c8f5c1630f.forwarding = 1 net.ipv6.conf.91079c8f5c1630f.mc_forwarding = 0 net.ipv6.conf.92e754a12836f63.forwarding = 1 net.ipv6.conf.92e754a12836f63.mc_forwarding = 0 net.ipv6.conf.a5c01549a6070ab.forwarding = 1 net.ipv6.conf.a5c01549a6070ab.mc_forwarding = 0 net.ipv6.conf.a621d1234f0f25a.forwarding = 1 net.ipv6.conf.a621d1234f0f25a.mc_forwarding = 0 net.ipv6.conf.all.forwarding = 1 net.ipv6.conf.all.mc_forwarding = 0 net.ipv6.conf.br-ex.forwarding = 1 net.ipv6.conf.br-ex.mc_forwarding = 0 net.ipv6.conf.br-int.forwarding = 1 net.ipv6.conf.br-int.mc_forwarding = 0 net.ipv6.conf.c3f3da187245cf6.forwarding = 1 net.ipv6.conf.c3f3da187245cf6.mc_forwarding = 0 net.ipv6.conf.c7e518fff8ff973.forwarding = 1 net.ipv6.conf.c7e518fff8ff973.mc_forwarding = 0 net.ipv6.conf.d17c6fb6d3dd021.forwarding = 1 net.ipv6.conf.d17c6fb6d3dd021.mc_forwarding = 0 net.ipv6.conf.default.forwarding = 1 net.ipv6.conf.default.mc_forwarding = 0 net.ipv6.conf.eno8303.forwarding = 1 net.ipv6.conf.eno8303.mc_forwarding = 0 net.ipv6.conf.eno8403.forwarding = 1 net.ipv6.conf.eno8403.mc_forwarding = 0 net.ipv6.conf.ens1f0.forwarding = 1 net.ipv6.conf.ens1f0.mc_forwarding = 0 net.ipv6.conf.ens1f0/3516.forwarding = 0 net.ipv6.conf.ens1f0/3516.mc_forwarding = 0 net.ipv6.conf.ens1f0/3517.forwarding = 0 net.ipv6.conf.ens1f0/3517.mc_forwarding = 0 net.ipv6.conf.ens1f0/3518.forwarding = 0 net.ipv6.conf.ens1f0/3518.mc_forwarding = 0 net.ipv6.conf.ens1f1.forwarding = 1 net.ipv6.conf.ens1f1.mc_forwarding = 0 net.ipv6.conf.ens3f0.forwarding = 1 net.ipv6.conf.ens3f0.mc_forwarding = 0 net.ipv6.conf.ens3f1.forwarding = 1 net.ipv6.conf.ens3f1.mc_forwarding = 0 net.ipv6.conf.fcb6e9468a65d70.forwarding = 1 net.ipv6.conf.fcb6e9468a65d70.mc_forwarding = 0 net.ipv6.conf.fcd96084b7f5a9a.forwarding = 1 net.ipv6.conf.fcd96084b7f5a9a.mc_forwarding = 0 net.ipv6.conf.genev_sys_6081.forwarding = 1 net.ipv6.conf.genev_sys_6081.mc_forwarding = 0 net.ipv6.conf.lo.forwarding = 1 net.ipv6.conf.lo.mc_forwarding = 0 net.ipv6.conf.ovn-k8s-mp0.forwarding = 1 net.ipv6.conf.ovn-k8s-mp0.mc_forwarding = 0 net.ipv6.conf.ovs-system.forwarding = 1 net.ipv6.conf.ovs-system.mc_forwarding = 0
It's logical that this is happening, because nowhere in the code is there a mechanism to tune the global sysctl back to 0 when the mode is switched from `Global` to `Restricted`. There's also no mechanism to sequentially reboot the nodes so that they'd reboot back to their defaults (= sysctl ip forward off).
Description of problem:
Package that we use for Power VS has recently been revealed to be unmaintained. We should remove it in favor of maintained solutions.
Version-Release number of selected component (if applicable):
4.13.0 onward
How reproducible:
It's always used
Steps to Reproduce:
1. Deploy with IPI on Power VS 2. Use bluemix-go 3.
Actual results:
bluemix-go is used
Expected results:
bluemix-go should be avoided
Additional info:
This is a clone of issue OCPBUGS-35504. The following is the description of the original issue:
—
Description of problem:
The BYO Public IPv4 feature[1] for AWS added on Terraform version[2] was merged on capi upstream/CAPA[3] after branch cut. The installer PR supporting CAPA provisioning BYO IPv4 was also merged[4] in the active branch (4.17). The feature is exercised by CI tests[5][6], the step[7] is running by default on CI runs to consume from existing CI IPv4 Pool when using terraform version. [1] https://issues.redhat.com/browse/OCPSTRAT-1154 [2] https://issues.redhat.com/browse/SPLAT-1432 https://github.com/openshift/installer/pull/7983 [3] https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/4905 [4] https://github.com/openshift/installer/pull/7983 [5] https://github.com/openshift/release/pull/48467 [6] https://github.com/openshift/release/pull/50653 [7] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_installer/8592/pull-ci-openshift-installer-master-e2e-aws-ovn/1801525554881499136/artifacts/e2e-aws-ovn/ipi-conf-aws-byo-ipv4-pool-public/build-log.txt
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. create a cluster with setting platform.aws.publicIpv4Pool on install-config.yaml. 2. create a cluster with CAPA on 4.16
Actual results:
the field will be ignored
Expected results:
installer provision resources claiming public IPv4 IPs from custom pools provided by AWS.
Additional info:
Description of problem:
Automate E2E tests of Dynamic OVS Pinning. This bug is created for merging
https://github.com/openshift/cluster-node-tuning-operator/pull/746
Version-Release number of selected component (if applicable):
4.15.0
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-33651. The following is the description of the original issue:
—
Description of problem:
oc command cannot be used with RHEL 8 based bastion
Version-Release number of selected component (if applicable):
4.16.0-rc.1
How reproducible:
Very
Steps to Reproduce:
1. Have a bastion for z/VM installation at Red Hat Enterprise Linux release 8.9 (Ootpa) 2. Download and install the 4.16.0-rc.1 client on the bastion 3.Attempt to use the oc command
Actual results:
oc get nodes oc: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by oc) oc: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by oc) oc: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by oc)
Expected results:
oc command returns without error
Additional info:
This was introduced in 4.16.0-rc.1 - 4.16.0-rc.0 works fine
This is a clone of issue OCPBUGS-41941. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-41936. The following is the description of the original issue:
—
Description of problem:
IBM Cloud CCM was reconfigured to use loopback as the bind address in 4.16. However, the liveness probe was not configured to use loopback too, so the CCM constantly fails the liveness probe and restarts continuously.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%
Steps to Reproduce:
1. Create a IPI cluster on IBM Cloud 2. Watch the IBM Cloud CCM pod and restarts, increase every 5 mins (liveness probe timeout)
Actual results:
# oc --kubeconfig cluster-deploys/eu-de-4.17-rc2-3/auth/kubeconfig get po -n openshift-cloud-controller-manager NAME READY STATUS RESTARTS AGE ibm-cloud-controller-manager-58f7747d75-j82z8 0/1 CrashLoopBackOff 262 (39s ago) 23h ibm-cloud-controller-manager-58f7747d75-l7mpk 0/1 CrashLoopBackOff 261 (2m30s ago) 23h Normal Killing 34m (x2 over 40m) kubelet Container cloud-controller-manager failed liveness probe, will be restarted Normal Pulled 34m (x2 over 40m) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5ac9fb24a0e051aba6b16a1f9b4b3f9d2dd98f33554844953dd4d1e504fb301e" already present on machine Normal Created 34m (x3 over 45m) kubelet Created container cloud-controller-manager Normal Started 34m (x3 over 45m) kubelet Started container cloud-controller-manager Warning Unhealthy 29m (x8 over 40m) kubelet Liveness probe failed: Get "https://10.242.129.4:10258/healthz": dial tcp 10.242.129.4:10258: connect: connection refused Warning ProbeError 3m4s (x22 over 40m) kubelet Liveness probe error: Get "https://10.242.129.4:10258/healthz": dial tcp 10.242.129.4:10258: connect: connection refused body:
Expected results:
CCM runs continuously, as it does on 4.15 # oc --kubeconfig cluster-deploys/eu-de-4.15.10-1/auth/kubeconfig get po -n openshift-cloud-controller-manager NAME READY STATUS RESTARTS AGE ibm-cloud-controller-manager-66d4779cb8-gv8d4 1/1 Running 0 63m ibm-cloud-controller-manager-66d4779cb8-pxdrs 1/1 Running 0 63m
Additional info:
IBM Cloud have a PR open to fix the liveness probe. https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/360
To facilitate testing manifest generation, extract OpenStack API calls from the function body.
Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/197
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
We would like to include the CEL IP and CIDR validations in 4.16. They have been mergeded upstream and can be backported into OpenShift to improve out validation downstream. Upstream PR: https://github.com/kubernetes/kubernetes/pull/121912
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/sdn/pull/599
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
A regression was identified creating LoadBalancer services in ARO in new 4.14 clusters (handled for new installations in OCPBUGS-24191) The same regression has been also confirmed in ARO clusters upgraded to 4.14
Version-Release number of selected component (if applicable):
4.14.z
How reproducible:
On any ARO cluster upgraded to 4.14.z
Steps to Reproduce:
1. Install an ARO cluster 2. Upgrade to 4.14 from fast channel 3. oc create svc loadbalancer test-lb -n default --tcp 80:8080
Actual results:
# External-IP stuck in Pending $ oc get svc test-lb -n default NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE test-lb LoadBalancer 172.30.104.200 <pending> 80:30062/TCP 15m # Errors in cloud-controller-manager being unable to map VM to nodes $ oc logs -l infrastructure.openshift.io/cloud-controller-manager=Azure -n openshift-cloud-controller-manager I1215 19:34:51.843715 1 azure_loadbalancer.go:1533] reconcileLoadBalancer for service(default/test-lb) - wantLb(true): started I1215 19:34:51.844474 1 event.go:307] "Event occurred" object="default/test-lb" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer" I1215 19:34:52.253569 1 azure_loadbalancer_repo.go:73] LoadBalancerClient.List(aro-r5iks3dh) success I1215 19:34:52.253632 1 azure_loadbalancer.go:1557] reconcileLoadBalancer for service(default/test-lb): lb(aro-r5iks3dh/mabad-test-74km6) wantLb(true) resolved load balancer name I1215 19:34:52.528579 1 azure_vmssflex_cache.go:162] Could not find node () in the existing cache. Forcely freshing the cache to check again... E1215 19:34:52.714678 1 azure_vmssflex.go:379] fs.GetNodeNameByIPConfigurationID(/subscriptions/fe16a035-e540-4ab7-80d9-373fa9a3d6ae/resourceGroups/aro-r5iks3dh/providers/Microsoft.Network/networkInterfaces/mabad-test-74km6-master0-nic/ipConfigurations/pipConfig) failed. Error: failed to map VM Name to NodeName: VM Name mabad-test-74km6-master-0 E1215 19:34:52.714888 1 azure_loadbalancer.go:126] reconcileLoadBalancer(default/test-lb) failed: failed to map VM Name to NodeName: VM Name mabad-test-74km6-master-0 I1215 19:34:52.714956 1 azure_metrics.go:115] "Observed Request Latency" latency_seconds=0.871261893 request="services_ensure_loadbalancer" resource_group="aro-r5iks3dh" subscription_id="fe16a035-e540-4ab7-80d9-373fa9a3d6ae" source="default/test-lb" result_code="failed_ensure_loadbalancer" E1215 19:34:52.715005 1 controller.go:291] error processing service default/test-lb (will retry): failed to ensure load balancer: failed to map VM Name to NodeName: VM Name mabad-test-74km6-master-0
Expected results:
# The LoadBalancer gets an External-IP assigned $ oc get svc test-lb -n default NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE test-lb LoadBalancer 172.30.193.159 20.242.180.199 80:31475/TCP 14s
Additional info:
In cloud-provider-config cm in openshift-config namespace, vmType="" When vmType gets changed to "standard" explicitly, the provisioning of the LoadBalancer completes and an ExternalIP gets assigned without errors.
This is a clone of issue OCPBUGS-42362. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-42106. The following is the description of the original issue:
—
Description of problem:
Test Platform has detected a large increase in the amount of time spent waiting for pull secrets to be initialized. Monitoring the audit log, we can see nearly continuous updates to the SA pull secrets in the cluster (~2 per minute for every SA pull secret in the cluster). Controller manager is filled with entries like: - "Internal registry pull secret auth data does not contain the correct number of entries" ns="ci-op-tpd3xnbx" name="deployer-dockercfg-p9j54" expected=5 actual=4" - "Observed image registry urls" urls=["172.30.228.83:5000","image-registry.openshift-image-registry.svc.cluster.local:5000","image-registry.openshift-image-registry.svc:5000","registry.build01.ci.openshift.org","registry.build01.ci.openshift.org" In this "Observed image registry urls" log line, notice the duplicate entries for "registry.build01.ci.openshift.org" . We are not sure what is causing this but it leads to duplicate entry, but when actualized in a pull secret map, the double entry is reduced to one. So the controller-manager finds the cardinality mismatch on the next check. The duplication is evident in OpenShiftControllerManager/cluster: dockerPullSecret: internalRegistryHostname: image-registry.openshift-image-registry.svc:5000 registryURLs: - registry.build01.ci.openshift.org - registry.build01.ci.openshift.org But there is only one hostname in config.imageregistry.operator.openshift.io/cluster: routes: - hostname: registry.build01.ci.openshift.org name: public-routes secretName: public-route-tls
Version-Release number of selected component (if applicable):
4.17.0-rc.3
How reproducible:
Constant on build01 but not on other build farms
Steps to Reproduce:
1. Something ends up creating duplicate entries in the observed configuration of the openshift-controller-manager. 2. 3.
Actual results:
- Approximately 400K secret patches an hour on build01 vs ~40K on other build farms. Intialization times have increased by two orders of magnitude in new ci-operator namespaces. - The openshift-controller-manager is hot looping and experiencing client throttling.
Expected results:
1. Initialization of pull secrets in a namespace should take < 1 seconds. On build01, it can take over 1.5 minutes. 2. openshift-controller-manager should not possess duplicate entries. 3. If duplicate entries are a configuration error, openshift-controller-manager should de-dupe the entries. 4. There should be alerting when the openshift-controller-manager experiences client-side throttling / pathological behavior.
Additional info:
Please review the following PR: https://github.com/openshift/cluster-api-provider-aws/pull/491
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
A table in a dashboard relies on the order of the metric labels to merge results
Create a dashboard with a table including this query:
label_replace(sort_desc(sum(sum_over_time(ALERTS{alertstate="firing"}[24h])) by ( alertstate, alertname)), "aaa", "$1", "alertstate", "(.+)")
A single row will be displayed as the query is simulating that the first label `aaa` has a single value.
Expected result:
The table should not rely on a single metric label to merge results but consider all the labels so the expected rows are displayed.
I was identifying what remains with:
cat e2e-events_20231204-183144.json| jq '.items[] | select(has("tempSource") | not)'
I think I've cleared all the difficult ones, hopefully these are just simple stragglers.
The agent-based installer and assisted-installer create a Deployment named assisted-installer-controller in the assisted-installer namespace. This deployment is responsible for running the assisted-installer-controller to finalise the installation, mainly by updating the status of the Nodes in the assisted-service API. It's also required to be able to install platform:vsphere without credentials in 4.13 and above.
We want the logs for this pod to be included in the must-gather file, so that we can easily debug any installation issues caused by this process. Currently it is not.
kdump crash logs are not created to the SSH remote when OVN is configured.
See https://issues.redhat.com/browse/OCPBUGS-28239
Please review the following PR: https://github.com/openshift/cluster-kube-scheduler-operator/pull/515
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Issue - Profiles are degraded [1]even after applied due to below [2]error:
[1]
$oc get profile -A NAMESPACE NAME TUNED APPLIED DEGRADED AGE openshift-cluster-node-tuning-operator master0 rdpmc-patch-master True True 5d openshift-cluster-node-tuning-operator master1 rdpmc-patch-master True True 5d openshift-cluster-node-tuning-operator master2 rdpmc-patch-master True True 5d openshift-cluster-node-tuning-operator worker0 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker1 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker10 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker11 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker12 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker13 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker14 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker15 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker2 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker3 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker4 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker5 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker6 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker7 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker8 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker9 rdpmc-patch-worker True True 5d
[2]
lastTransitionTime: "2023-12-05T22:43:12Z" message: TuneD daemon issued one or more sysctl override message(s) during profile application. Use reapply_sysctl=true or remove conflicting sysctl net.core.rps_default_mask reason: TunedSysctlOverride status: "True"
If we see in rdpmc-patch-master tuned:
NAMESPACE NAME TUNED APPLIED DEGRADED AGE openshift-cluster-node-tuning-operator master0 rdpmc-patch-master True True 5d openshift-cluster-node-tuning-operator master1 rdpmc-patch-master True True 5d openshift-cluster-node-tuning-operator master2 rdpmc-patch-master True True 5d
We are configuring below in rdpmc-patch-master tuned:
$ oc get tuned rdpmc-patch-master -n openshift-cluster-node-tuning-operator -oyaml |less
spec:
profile:
- data: |
[main]
include=performance-patch-master
[sysfs]
/sys/devices/cpu/rdpmc = 2
name: rdpmc-patch-master
recommend:
Below in Performance-patch-master which is included in above tuned:
spec: profile: - data: | [main] summary=Custom tuned profile to adjust performance include=openshift-node-performance-master-profile [bootloader] cmdline_removeKernelArgs=-nohz_full=${isolated_cores}
Below(which is coming in error) is in openshift-node-performance-master-profile included in above tuned:
net.core.rps_default_mask=${not_isolated_cpumask}
RHEL BUg has been raised for the same https://issues.redhat.com/browse/RHEL-18972
Version-Release number of selected component (if applicable):{code:none}
4.14
Description of problem:
The current api version used by the registry operator does not include the recently added "ChunkSizeMiB" feature gate. We need to bump the openshift/api to latest so that this feature gate becomes available for use.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
In .../openshift-versions?only_latest=true, the multi-arch release images are not retrieved as well.
How reproducible:
Always
Steps to reproduce:
1. Run master assisted-service
2. curl ".../openshift-versions?only_latest=true"
Actual results:
{ "4.10.67": { "cpu_architectures": [ "x86_64" ], "display_name": "4.10.67", "support_level": "production" }, "4.11.58": { "cpu_architectures": [ "x86_64" ], "display_name": "4.11.58", "support_level": "production" }, "4.12.53": { "cpu_architectures": [ "x86_64" ], "display_name": "4.12.53", "support_level": "production" }, "4.13.38": { "cpu_architectures": [ "x86_64" ], "display_name": "4.13.38", "support_level": "production" }, "4.14.18": { "cpu_architectures": [ "x86_64" ], "display_name": "4.14.18", "support_level": "production" }, "4.15.3": { "cpu_architectures": [ "x86_64" ], "default": true, "display_name": "4.15.3", "support_level": "production" }, "4.16.0-ec.4": { "cpu_architectures": [ "x86_64" ], "display_name": "4.16.0-ec.4", "support_level": "beta" }, "4.9.59": { "cpu_architectures": [ "x86_64" ], "display_name": "4.9.59", "support_level": "production" } }
Expected results:
{ "4.10.67": { "cpu_architectures": [ "x86_64", "arm64" ], "display_name": "4.10.67", "support_level": "production" }, "4.11.0-multi": { "cpu_architectures": [ "x86_64", "arm64", "ppc64le", "s390x" ], "display_name": "4.11.0-multi", "support_level": "production" }, "4.11.58": { "cpu_architectures": [ "x86_64", "arm64" ], "display_name": "4.11.58", "support_level": "production" }, "4.12.53": { "cpu_architectures": [ "x86_64", "arm64" ], "display_name": "4.12.53", "support_level": "production" }, "4.12.53-multi": { "cpu_architectures": [ "x86_64", "arm64", "ppc64le", "s390x" ], "display_name": "4.12.53-multi", "support_level": "production" }, "4.13.38": { "cpu_architectures": [ "x86_64", "arm64" ], "display_name": "4.13.38", "support_level": "production" }, "4.13.38-multi": { "cpu_architectures": [ "x86_64", "arm64", "ppc64le", "s390x" ], "display_name": "4.13.38-multi", "support_level": "production" }, "4.14.18": { "cpu_architectures": [ "x86_64", "arm64" ], "display_name": "4.14.18", "support_level": "production" }, "4.14.18-multi": { "cpu_architectures": [ "x86_64", "arm64", "ppc64le", "s390x" ], "display_name": "4.14.18-multi", "support_level": "production" }, "4.15.3": { "cpu_architectures": [ "x86_64", "arm64" ], "default": true, "display_name": "4.15.3", "support_level": "production" }, "4.15.3-multi": { "cpu_architectures": [ "x86_64", "arm64", "ppc64le", "s390x" ], "display_name": "4.15.3-multi", "support_level": "production" }, "4.16.0-ec.4": { "cpu_architectures": [ "x86_64", "arm64" ], "display_name": "4.16.0-ec.4", "support_level": "beta" }, "4.16.0-ec.4-multi": { "cpu_architectures": [ "x86_64", "arm64", "ppc64le", "s390x" ], "display_name": "4.16.0-ec.4-multi", "support_level": "beta" }, "4.9.59": { "cpu_architectures": [ "x86_64" ], "display_name": "4.9.59", "support_level": "production" } }
Seeing CI jobs with
> level=error msg=ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).
search shows 65 hits in the last 7 days
This is a clone of issue OCPBUGS-35293. The following is the description of the original issue:
—
Description of problem:
4.16 installs fail for ROSA STS installations time="2024-06-11T14:05:48Z" level=debug msg="\t[failed to apply security groups to load balancer \"jamesh-sts-52g29-int\": AccessDenied: User: arn:aws:sts::476950216884:assumed-role/ManagedOpenShift-Installer-Role/1718114695748673685 is not authorized to perform: elasticloadbalancing:SetSecurityGroups on resource: arn:aws:elasticloadbalancing:us-east-1:476950216884:loadbalancer/net/jamesh-sts-52g29-int/bf7ef748daa739ce because no identity-based policy allows the elasticloadbalancing:SetSecurityGroups action"
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
Every time
Steps to Reproduce:
1. Create an installer policy with the permissions listed in the installer [here|https://github.com/openshift/installer/blob/master/pkg/asset/installconfig/aws/permissions.go] 2. Run a install in AWS IPI
Actual results:
The installer fails to install a cluster in AWS The installer log should show AccessDenied messages for the IAM action elasticloadbalancing:SetSecurityGroups The installer should show the error message "failed to apply security groups to load balancer"
Expected results:
Install completes successfully
Additional info:
Managed OpenShift (ROSA) installs STS clusters with [this|https://github.com/openshift/managed-cluster-config/blob/master/resources/sts/4.16/sts_installer_permission_policy.json] permission policy for the installer which should be what is required from the installer [policy|https://github.com/openshift/installer/blob/master/pkg/asset/installconfig/aws/permissions.go] plus permissions needed for OCM to do pre install validation.
Description of problem:
While upgrading a loaded 250 node ROSA cluster from 4.13.13 to 4.14.rc2 the cluster failed to upgrade and was stuck at when network operator was trying
to upgrade.
Around 20 multus pods were in CrashLookpack state with the log
oc logs multus-4px8t 2023-10-10T00:54:34+00:00 [cnibincopy] Successfully copied files in /usr/src/multus-cni/rhel9/bin/ to /host/opt/cni/bin/upgrade_6dcb644a-4164-42a5-8f1e-4ae2c04dc315 2023-10-10T00:54:34+00:00 [cnibincopy] Successfully moved files in /host/opt/cni/bin/upgrade_6dcb644a-4164-42a5-8f1e-4ae2c04dc315 to /host/opt/cni/bin/ 2023-10-10T00:54:34Z [verbose] multus-daemon started 2023-10-10T00:54:34Z [verbose] Readiness Indicator file check 2023-10-10T00:55:19Z [error] have you checked that your default network is ready? still waiting for readinessindicatorfile @ /host/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition
Please review the following PR: https://github.com/openshift/azure-disk-csi-driver/pull/65
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Operation cannot be fulfilled on networks.operator.openshift.io during OVN live migration
Version-Release number of selected component (if applicable):
How reproducible:
Not always
Steps to Reproduce:
1. Enable features of egressfirewall, externalIP,multicast, multus, network-policy, service-idle. 2. Start migrate SDN to OVN cluster
Actual results:
[weliang@weliang ~]$ oc delete validatingwebhookconfigurations.admissionregistration.k8s.io/sre-techpreviewnoupgrade-validation validatingwebhookconfiguration.admissionregistration.k8s.io "sre-techpreviewnoupgrade-validation" deleted [weliang@weliang ~]$ oc edit featuregate cluster featuregate.config.openshift.io/cluster edited [weliang@weliang ~]$ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-20-154.ec2.internal Ready control-plane,master 86m v1.28.5+9605db4 ip-10-0-45-93.ec2.internal Ready worker 80m v1.28.5+9605db4 ip-10-0-49-245.ec2.internal Ready worker 74m v1.28.5+9605db4 ip-10-0-57-37.ec2.internal Ready infra,worker 60m v1.28.5+9605db4 ip-10-0-60-0.ec2.internal Ready infra,worker 60m v1.28.5+9605db4 ip-10-0-62-121.ec2.internal Ready control-plane,master 86m v1.28.5+9605db4 ip-10-0-62-56.ec2.internal Ready control-plane,master 86m v1.28.5+9605db4 [weliang@weliang ~]$ for f in $(oc get nodes -o jsonpath='{.items[*].metadata.name}') ; do oc debug node/"${f}" -- chroot /host cat /etc/kubernetes/kubelet.conf | grep NetworkLiveMigration ; done Starting pod/ip-10-0-20-154ec2internal-debug-9wvd8 ... To use host binaries, run `chroot /host`Removing debug pod ... "NetworkLiveMigration": true, Starting pod/ip-10-0-45-93ec2internal-debug-rwvls ... To use host binaries, run `chroot /host` "NetworkLiveMigration": true,Removing debug pod ... Starting pod/ip-10-0-49-245ec2internal-debug-rp9dt ... To use host binaries, run `chroot /host`Removing debug pod ... "NetworkLiveMigration": true, Starting pod/ip-10-0-57-37ec2internal-debug-q5thk ... To use host binaries, run `chroot /host`Removing debug pod ... "NetworkLiveMigration": true, Starting pod/ip-10-0-60-0ec2internal-debug-zp78h ... To use host binaries, run `chroot /host`Removing debug pod ... "NetworkLiveMigration": true, Starting pod/ip-10-0-62-121ec2internal-debug-42k2g ... To use host binaries, run `chroot /host`Removing debug pod ... "NetworkLiveMigration": true, Starting pod/ip-10-0-62-56ec2internal-debug-s99ls ... To use host binaries, run `chroot /host`Removing debug pod ... "NetworkLiveMigration": true, [weliang@weliang ~]$ oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/live-migration":""}},"spec":{"networkType":"OVNKubernetes"}}' network.config.openshift.io/cluster patched [weliang@weliang ~]$ [weliang@weliang ~]$ oc get co network NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE network 4.15.0-0.nightly-2024-01-06-062415 True False True 4h1m Internal error while updating operator configuration: could not apply (/, Kind=) /cluster, err: failed to apply / update (operator.openshift.io/v1, Kind=Network) /cluster: Operation cannot be fulfilled on networks.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again [weliang@weliang ~]$ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-2-52.ec2.internal Ready worker 3h54m v1.28.5+9605db4 ip-10-0-26-16.ec2.internal Ready control-plane,master 4h2m v1.28.5+9605db4 ip-10-0-32-116.ec2.internal Ready worker 3h54m v1.28.5+9605db4 ip-10-0-32-67.ec2.internal Ready infra,worker 3h38m v1.28.5+9605db4 ip-10-0-35-11.ec2.internal Ready infra,worker 3h39m v1.28.5+9605db4 ip-10-0-39-125.ec2.internal Ready control-plane,master 4h2m v1.28.5+9605db4 ip-10-0-6-117.ec2.internal Ready control-plane,master 4h2m v1.28.5+9605db4 [weliang@weliang ~]$ oc get Network.operator.openshift.io/cluster -o json { "apiVersion": "operator.openshift.io/v1", "kind": "Network", "metadata": { "creationTimestamp": "2024-01-08T13:28:07Z", "generation": 417, "name": "cluster", "resourceVersion": "236888", "uid": "37fb36f0-c13c-476d-aea1-6ebc1c87abe8" }, "spec": { "clusterNetwork": [ { "cidr": "10.128.0.0/14", "hostPrefix": 23 } ], "defaultNetwork": { "openshiftSDNConfig": { "enableUnidling": true, "mode": "NetworkPolicy", "mtu": 8951, "vxlanPort": 4789 }, "ovnKubernetesConfig": { "egressIPConfig": {}, "gatewayConfig": { "ipv4": {}, "ipv6": {}, "routingViaHost": false }, "genevePort": 6081, "mtu": 8901, "policyAuditConfig": { "destination": "null", "maxFileSize": 50, "maxLogFiles": 5, "rateLimit": 20, "syslogFacility": "local0" } }, "type": "OVNKubernetes" }, "deployKubeProxy": false, "disableMultiNetwork": false, "disableNetworkDiagnostics": false, "kubeProxyConfig": { "bindAddress": "0.0.0.0" }, "logLevel": "Normal", "managementState": "Managed", "migration": { "mode": "Live", "networkType": "OVNKubernetes" }, "observedConfig": null, "operatorLogLevel": "Normal", "serviceNetwork": [ "172.30.0.0/16" ], "unsupportedConfigOverrides": null, "useMultiNetworkPolicy": false }, "status": { "conditions": [ { "lastTransitionTime": "2024-01-08T13:28:07Z", "status": "False", "type": "ManagementStateDegraded" }, { "lastTransitionTime": "2024-01-08T17:29:52Z", "status": "False", "type": "Degraded" }, { "lastTransitionTime": "2024-01-08T13:28:07Z", "status": "True", "type": "Upgradeable" }, { "lastTransitionTime": "2024-01-08T17:26:38Z", "status": "False", "type": "Progressing" }, { "lastTransitionTime": "2024-01-08T13:28:20Z", "status": "True", "type": "Available" } ], "readyReplicas": 0, "version": "4.15.0-0.nightly-2024-01-06-062415" } } [weliang@weliang ~]$
Expected results:
OVN live migration pass
Additional info:
must-gather: https://people.redhat.com/~weliang/must-gather1.tar.gz
This is a clone of issue OCPBUGS-43389. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-43157. The following is the description of the original issue:
—
Description of problem:
When running the `make fmt` target in the repository the command can fail due to a mismatch of versions between the go language and the goimports dependency.
Version-Release number of selected component (if applicable):
4.16.z
How reproducible:
always
Steps to Reproduce:
1.checkout release-4.16 branch 2.run `make fmt`
Actual results:
INFO[2024-10-01T14:41:15Z] make fmt make[1]: Entering directory '/go/src/github.com/openshift/cluster-cloud-controller-manager-operator' hack/goimports.sh go: downloading golang.org/x/tools v0.25.0 go: golang.org/x/tools/cmd/goimports@latest: golang.org/x/tools@v0.25.0 requires go >= 1.22.0 (running go 1.21.11; GOTOOLCHAIN=local)
Expected results:
successful completion of `make fmt`
Additional info:
our goimports.sh script file reference `goimports@latest` which means that this problem will most likely affect older branches as well. we will need to set a specific version of the goimports package for those branches. given that the CCCMO includes golangci-lint, and uses it for a test, we should include goimports through golangci-lint which will solve this problem without needing special versions of goimports.
Description of problem:
When Hypershift runs large number of hosted clusters (> 370), the management cluster ETCD fills up and Hypershift begins to fail. One way to reduce the ETCD size while improving its performance is to reduce the number of stored objects, like config-maps. Currently if a hosted cluster's NodePool needs to reference multiple MachineConfig objects, each of those MachineConfigs has to be in its own config-map (referenced in NodePool spec.config). To reduce the number of config-maps Hypertshift needs ability to extract multiple MachineConfig objects from a single config-map. Currently if multiple MachineConfig objects is placed into a config-map, only the first one is recognized by the Nodepool Controller, all others are ignored. Nodepool controller code fix is required to support multiple MachineConfig objects in ignition-config config-maps.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create Hypershift-hosted cluster 2. Patch the cluster's default NodePool's spec.config to reference a single config-map that has multiple MachineConfig yamls inside it. 3. Obtain ignition data from the ignition server.
Actual results:
The ignition has data from the first MachineConfig object inside the config-map, but all other MachineConfig objects are not there.
Expected results:
The ignition should have all MachineConfig object inside the config-map.
Additional info:
Example of NodePool: ``` apiVersion: hypershift.openshift.io/v1beta1 kind: NodePool ... spec: arch: amd64 clusterName: cg319sf10ghnddkvo8j0 config: - name: ignition-config-98-ibm-machineconfig-cg319sf10ghnddkvo8j0 ... ``` The ignition-config-98-ibm-machineconfig-cg319sf10ghnddkvo8j0 config-map has multiple MachineConfig yamls inside it separated by "---": ``` apiVersion: v1 data: config: |+ --- apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 97-ibm-machineconfig-base spec: config: ignition: version: 2.2.0 storage: files: - contents: ... --- apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 98-ibm-machineconfig-satellite spec: config: ignition: version: 2.2.0 storage: ... ``` Currently only the first MachineConfig "97-ibm-machineconfig-base" is processed, the other one "98-ibm-machineconfig-satellite" is skipped.
This is a clone of issue OCPBUGS-41580. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-39133. The following is the description of the original issue:
—
Description of problem:
Debugging https://issues.redhat.com/browse/OCPBUGS-36808 (the Metrics API failing some of the disruption checks) and taking https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808 as a reproducer of the issue, I think the Kube-aggregator is behind the problem. According to the disruption checks which forward some relevant errors from the apiserver in the logs, looking at one of the new-connections check failures (from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808/artifacts/e2e-aws-ovn-upgrade-2/openshift-e2e-test/artifacts/junit/backend-disruption_20240816-155051.json) > "Aug 16 *16:43:17.672* - 2s E backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests reason/DisruptionBegan request-audit-id/c62b7d32-856f-49de-86f5-1daed55326b2 backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests stopped responding to GET requests over new connections: error running request: 503 Service Unavailable: error trying to reach service: dial tcp 10.128.2.31:10250: connect: connection refused" The "error trying to reach service" part comes from: https://github.com/kubernetes/kubernetes/blob/b3c725627b15bb69fca01b70848f3427aca4c3ef/staging/src/k8s.io/apimachinery/pkg/util/proxy/transport.go#L105, the apiserver failing to reach the metrics-server Pod, the problem is that the IP "10.128.2.31" corresponds to a Pod that was deleted some milliseconds before (as part of a node update/draining), as we can see in: > 2024-08-16T16:19:43.087Z|00195|binding|INFO|openshift-monitoring_metrics-server-7b9d8c5ddb-dtsmr: Claiming 0a:58:0a:80:02:1f 10.128.2.31 ... I0816 *16:43:17.650083* 2240 kubelet.go:2453] "SyncLoop DELETE" source="api" pods=["openshift-monitoring/metrics-server-7b9d8c5ddb-dtsmr"] ... The apiserver was using a stale IP to reach a Pod that no longer exists, even though a new Pod that had already replaced the other Pod (Metrics API backend runs on 2 Pods), some minutes before, was available. According to OVN, a fresher IP 10.131.0.12 of that Pod was already in the endpoints at that time: > I0816 16:40:24.711048 4651 lb_config.go:1018] Cluster endpoints for openshift-monitoring/metrics-server are: map[TCP/https:{10250 [10.128.2.31 10.131.0.12] []}] *I think, when "10.128.2.31" failed, the apiserver should have fallen back to "10.131.0.12", maybe it waits for some time/retries before doing so, or maybe it wasn't even aware of "10.131.0.12"* AFAIU, we have "--enable-aggregator-routing" set by default https://github.com/openshift/cluster-kube-apiserver-operator/blob/37df1b1f80d3be6036b9e31975ac42fcb21b6447/bindata/assets/config/defaultconfig.yaml#L101-L103 on the apiservers, so instead of forwarding to the metrics-server's service, apiserver directly reaches the Pods. For that it keeps track of the relevant services and endpoints https://github.com/kubernetes/kubernetes/blob/ad8a5f5994c0949b5da4240006d938e533834987/staging/src/k8s.io/kube-aggregator/pkg/apiserver/resolvers.go#L40 bad decisions may be made if the if the services and/or endpoints cache are stale. Looking at the metrics-server (the Metrics API backend) endpoints changes in the apiserver audit logs: > $ grep -hr Event . | grep "endpoints/metrics-server" | jq -c 'select( .verb | match("watch|update"))' | jq -r '[.requestReceivedTimestamp,.user.username,.verb] | @tsv' | sort 2024-08-16T15:39:57.575468Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T15:40:02.005051Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T15:40:35.085330Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T15:40:35.128519Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T16:19:41.148148Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T16:19:47.797420Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T16:20:23.051594Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T16:20:23.100761Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T16:20:23.938927Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T16:21:01.699722Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T16:39:00.328312Z system:serviceaccount:kube-system:endpoint-controller update ==> At around 16:39:XX the first Pod was rolled out 2024-08-16T16:39:07.260823Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T16:39:41.124449Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T16:43:23.701015Z system:serviceaccount:kube-system:endpoint-controller update ==> At around 16:43:23, the new Pod that replaced the second one was created 2024-08-16T16:43:28.639793Z system:serviceaccount:kube-system:endpoint-controller update 2024-08-16T16:43:47.108903Z system:serviceaccount:kube-system:endpoint-controller update We can see that just before the new-connections checks succeeded again at around "2024-08-16T16:43:23.", an UPDATE was received/treated which may have helped the apiserver sync its endpoints cache or/and chose a healthy Pod Also, no update was triggered when the second Pod was deleted at "16:43:17" which may explain the stale 10.128.2.31 endpoints entry on apiserver side. To summarize, I can see two problems here (maybe one is the consequence of the other): A Pod was deleted and an Endpoint pointing to it wasn't updated. Apparently the Endpoints controller had/has some sync issues https://github.com/kubernetes/kubernetes/issues/125638 The apiserver resolver had a endpoints cache with one stale and one fresh entry but it kept 4-5 times in a row trying to reach the stale entry OR The endpoints was updated "At around 16:39:XX the first Pod was rolled out, see above", but the apiserver resolver cache missed that and ended up with 2 stale entries in the cache, and had to wait until "At around 16:43:23, the new Pod that replaced the second one was created, see above" to sync and replace them with 2 fresh entries.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. See "Description of problem" 2. 3.
Actual results:
Expected results:
the kube-aggregator should detect stale Apiservice endpoints.
Additional info:
the kube-aggregator proxies requests to a stale Endpoints/Pod which makes Metrics API requests falsely fail.
Description of problem:
https://issues.redhat.com/browse/MGMT-15691 introduced the code restructuring related to external platform and oci via PR https://github.com/openshift/assisted-service/pull/5787 Assisted service needs to be re-vendored in the installer in 4.16 and 4.17 releases to make sure the assisted-service dependencies are consistent. The master branch or 4.18 do not need this revendoring as the master branch was recently revendored via https://github.com/openshift/installer/pull/9058
Version-Release number of selected component (if applicable):
4.17, 4.16
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-33615. The following is the description of the original issue:
—
Description of problem:
The coresPerSocket value set in install-config does not match the actual result. When setting controlPlane.platform.vsphere.cpus to 16 and controlPlane.platform.vsphere.coresPerSocket to 8.The actual result I checked was: "NumCPU": 16,"NumCoresPerSocket": 16, NumCoresPerSocket should match the setting in install-config instead of NumCPU. Check the setting in VSphereMachine-openshift-cluster-api-guests-wwei1215a-42n48-master-0.yaml, the numcorespersocket is 0: numcpus: 16 numcorespersocket: 0
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-08-222442
How reproducible:
See description
Steps to Reproduce:
1.setting coresPerSocket for control plane in install-config. cpu needs to be a multiple of corespersocket. 2.install the cluster
Actual results:
The NumCoresPerSocket is equal to NumCPU. In file VSphereMachine-openshift-cluster-api-guests-xxxx-xxxx-master-0.yaml, the numcorespersocket is 0. and in vm setting: "NumCoresPerSocket": 8.
Expected results:
The NumCoresPerSocket should match the setting in install-config.
Additional info:
installconfig setting: controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: vsphere: cpus: 16 coresPerSocket: 8 check result: "Hardware": { "NumCPU": 16, "NumCoresPerSocket": 16, the check result for compute node is expected. installconfig setting: compute:- architecture: amd64 hyperthreading: Enabled name: worker platform: vsphere: cpus: 8 coresPerSocket: 4 check result: "Hardware": { "NumCPU": 8, "NumCoresPerSocket": 4,
This is a clone of issue OCPBUGS-42060. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-41228. The following is the description of the original issue:
—
Description of problem:
The console crashes when the user selects SSH as the Authentication type for the git server under add secret in the start pipeline form
Version-Release number of selected component (if applicable):
How reproducible:
Everytime. Only in developer perspective and if the Pipelines dynamic plugin is enabled.
Steps to Reproduce:
1. Create a pipeline through add flow and open start pipeline page 2. Under show credentials select add secret 3. In the secret form select `Access to ` as Git server and `Authentication type` as SSH key
Actual results:
Console crashes
Expected results:
UI should work as expected
Additional info:
Attaching console log screenshot
https://drive.google.com/file/d/1bGndbq_WLQ-4XxG5ylU7VuZWZU15ywTI/view?usp=sharing
Steps to Reproduce:
1. Install a cluster using Azure Workload Identity 2. Check the value of the cco_credentials_mode metric
Actual results:
mode = manual
Expected results:
mode = manualpodidentity
Additional info:
The cco_credentials_mode metric reports manualpodidentity mode for an AWS STS cluster.
PR adding the metric in OCP 4.17: https://github.com/openshift/cluster-monitoring-operator/pull/2291
This will need backports to existing OCP versions.
This is a clone of issue OCPBUGS-33539. The following is the description of the original issue:
—
Description of problem:
VirtualizedTable component in console dynamic plugin don't have default sorting column. We need default sorting column for list pages.
https://github.com/openshift/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#virtualizedtable
Description of problem:
The network resource provisioning playbook for 4.15 dualstack UPI contains a task for adding an IPv6 subnet to the existing external router [1]. This task fails with: - ansible-2.9.27-1.el8ae.noarch & ansible-collections-openstack-1.8.0-2.20220513065417.5bb8312.el8ost.noarch in OSP 16 env (RHEL 8.5) or - openstack-ansible-core-2.14.2-4.1.el9ost.x86_64 & ansible-collections-openstack-1.9.1-17.1.20230621074746.0e9a6f2.el9ost.noarch in OSP 17 env (RHEL 9.2)
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-22-160236
How reproducible:
Always
Steps to Reproduce:
1. Set the os_subnet6 in the inventory file for setting dualstack 2. Run the 4.15 network.yaml playbook
Actual results:
Playbook fails: TASK [Add IPv6 subnet to the external router] ********************************** fatal: [localhost]: FAILED! => {"changed": false, "extra_data": {"data": null, "details": "Invalid input for external_gateway_info. Reason: Validation of dictionary's keys failed. Expected keys: {'network_id'} Provided keys: {'external_fixed_ips'}.", "response": "{\"NeutronError\": {\"type\": \"HTTPBadRequest\", \"message\": \"Invalid input for external_gateway_info. Reason: Validation of dictionary's keys failed. Expected keys: {'network_id'} Provided keys: {'external_fixed_ips'}.\", \"detail\": \"\"}}"}, "msg": "Error updating router 8352c9c0-dc39-46ed-94ed-c038f6987cad: Client Error for url: https://10.46.43.81:13696/v2.0/routers/8352c9c0-dc39-46ed-94ed-c038f6987cad, Invalid input for external_gateway_info. Reason: Validation of dictionary's keys failed. Expected keys: {'network_id'} Provided keys: {'external_fixed_ips'}."}
Expected results:
Successful playbook execution
Additional info:
The router can be created in two different tasks, the playbook [2] worked for me.
[1] https://github.com/openshift/installer/blob/1349161e2bb8606574696bf1e3bc20ae054e60f8/upi/openstack/network.yaml#L43
[2] https://file.rdu.redhat.com/juriarte/upi/network.yaml
Please review the following PR: https://github.com/openshift/cluster-olm-operator/pull/47
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
It would help making debugging easier if we included the namespace in the message for these alerts: https://github.com/openshift/cluster-ingress-operator/blob/master/manifests/0000_90_ingress-operator_03_prometheusrules.yaml#L69
Version-Release number of selected component (if applicable):
4.12.x
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
No namespace in the alert message
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-dns-operator/pull/397
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
when I used the file to create CatalogSource, the creation failed and hit error: [root@preserve-fedora36 cluster-resources]# oc create -f cs-redhat-operator-index-v4-15.yaml The CatalogSource "cs-redhat-operator-index-v4-15" is invalid: * spec.icon.base64data: Required value * spec.icon.mediatype: Required value [root@preserve-fedora36 cluster-resources]# cat cs-redhat-operator-index-v4-15.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: creationTimestamp: null name: cs-redhat-operator-index-v4-15 namespace: openshift-marketplace spec: icon: {} image: ec2-3-144-93-237.us-east-2.compute.amazonaws.com:5000/redhat/redhat-operator-index:v4.15 sourceType: grpc status: {}
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202403070215.p0.gc4f8295.assembly.stream.el9-c4f8295", GitCommit:"c4f829512107f7d0f52a057cd429de2030b9b3b3", GitTreeState:"clean", BuildDate:"2024-03-07T03:46:24Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
always
Steps to Reproduce:
1. Use following imagesetconfigure to mirror to localhost: cat config.yaml kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 #archiveSize: 8 storageConfig: local: path: /app1/ocmirror/offline mirror: platform: channels: - name: stable-4.12 type: ocp minVersion: '4.12.46' maxVersion: '4.12.46' shortestPath: true graph: true operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 packages: - name: advanced-cluster-management channels: - name: release-2.9 - name: compliance-operator channels: - name: stable - name: multicluster-engine channels: - name: stable-2.4 - name: stable-2.5 additionalImages: - name: registry.redhat.io/ubi8/ubi:latest - name: registry.redhat.io/rhel8/support-tools:latest - name: registry.access.redhat.com/ubi8/nginx-120:latest - name: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.8.0 - name: registry.k8s.io/sig-storage/csi-resizer:v1.8.0 `oc-mirror --config config.yaml file://operatortest --v2` 2. mirror to registry : `oc-mirror --config config.yaml --from file://operatortest docker://ec2-3-144-93-237.us-east-2.compute.amazonaws.com:5000 --v2` 3. Create catalogsource with the created file: cat cs-redhat-operator-index-v4-15.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: creationTimestamp: null name: cs-redhat-operator-index-v4-15 namespace: openshift-marketplace spec: icon: {} image: ec2-3-144-93-237.us-east-2.compute.amazonaws.com:5000/redhat/redhat-operator-index:v4.15 sourceType: grpc status: {} oc create -f cs-redhat-operator-index-v4-15.yaml The CatalogSource "cs-redhat-operator-index-v4-15" is invalid: * spec.icon.base64data: Required value * spec.icon.mediatype: Required value
Actual results:
Failed to create catalogsource by the created file.
Expected results:
No error.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/266
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Recently a user was attempting to change the Virtual Machine Folder for a cluster installed on vSphere. The user used the configuration panel "vSphere Connection Configuration" to complete this process. Upon updating the path and clicking "Save Configuration" cluster wide issues emerged including nodes not coming back online after a reboot.
OpenShift nodes eventually crashed with an error resultant of an incorrectly parsed folder due to the string literal " " characters missing.
While this was exhibited on OCP 4.13, other versions may be affected.
This is a clone of issue OCPBUGS-31250. The following is the description of the original issue:
—
Description of problem:
1. For the Linux nodes, the container runtime is CRI-O and the port 9537 has a crio process listening on it.While, windows nodes doesn't have CRIO container runtime.
2. Prometheus is trying to connect to /metrics endpoint on the windows nodes on port 9537 which actually does not have any process listening on it.
3. TargetDown is alerting crio job since it cannot reach the endpoint http://windows-node-ip:9537/metrics.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Install 4.13 cluster with windows operator 2. In the Prometheus UI, go to > Status > Targets to know which targets are down.
Actual results:
It gives the alert for targetDown
Expected results:
It should not give any such alert.
Additional info:
Description of problem:
When console with custom route is disabled before cluster upgrade, and re-enabled after cluster upgrade, console could not be accessed successfully.
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-02-15-111607
How reproducible:
Always
Steps to Reproduce:
1. Launch a cluster with available update. 2. Create custom route for console in ingress configuration: # oc edit ingresses.config.openshift.io cluster spec: componentRoutes: - hostname: console-openshift-custom.apps.qe-413-0216.qe.devcluster.openshift.com name: console namespace: openshift-console - hostname: openshift-downloads-custom.apps.qe-413-0216.qe.devcluster.openshift.com name: downloads namespace: openshift-console domain: apps.qe-413-0216.qe.devcluster.openshift.com 3. After custom route is created, access console with custom route. 4. Remove console by setting managementState as Removed in console operator: # oc edit consoles.operator.openshift.io cluster spec: logLevel: Normal managementState: Removed operatorLogLevel: Normal 5. Upgrade cluster to a target version. 6. Enable console by setting managementState as Managed in console operator: # oc edit consoles.operator.openshift.io cluster spec: logLevel: Normal managementState: Managed operatorLogLevel: Normal 7. After console resources are created, access console url.
Actual results:
3. Console could be accessed through custom route. 4. Console resources are removed. And all cluster operators are in normal status # oc get all -n openshift-console No resources found in openshift-console namespace.
5. Upgrade succeeds, all cluster operators are in normal status
6. Console resources are created:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/console ClusterIP 172.30.226.112 <none> 443/TCP 3m50s
service/console-redirect ClusterIP 172.30.147.151 <none> 8444/TCP 3m50s
service/downloads ClusterIP 172.30.251.248 <none> 80/TCP 3m50s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/console 2/2 2 2 3m47s
deployment.apps/downloads 2/2 2 2 3m50s
NAME DESIRED CURRENT READY AGE
replicaset.apps/console-69d88985b 2 2 2 3m42s
replicaset.apps/console-6dbdd487d 0 0 0 3m47s
replicaset.apps/downloads-6b6b555d8d 2 2 2 3m50s
NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD
route.route.openshift.io/console console-openshift-console.apps.qe-413-0216.qe.devcluster.openshift.com console-redirect custom-route-redirect edge/Redirect None
route.route.openshift.io/console-custom console-openshift-custom.apps.qe-413-0216.qe.devcluster.openshift.com console https reencrypt/Redirect None
route.route.openshift.io/downloads downloads-openshift-console.apps.qe-413-0216.qe.devcluster.openshift.com downloads http edge/Redirect None
route.route.openshift.io/downloads-custom openshift-downloads-custom.apps.qe-413-0216.qe.devcluster.openshift.com downloads http edge/Redirect None
7. Could not open console url successfully. There is error info for console operator:
Expected results:
7. Should be able to access console successfully.
Additional info:
Description of problem:
4.15 control plane can't create a 4.14 node pool due to an issue with payload
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create an Hosted Cluster in 4.15 2. Create a Node Pool in 4.14 3. Node pool stuck in provisioning
Actual results:
No node pool is created
Expected results:
Node pool is created as we support N-2 version there
Additional info:
Possibly linked to OCPBUGS-26757
This is a clone of issue OCPBUGS-33756. The following is the description of the original issue:
—
Description of problem:
The "Auth Token GCP" filter in OperatorHub is displayed all the time, but in stead it should be rendered only for GPC cluster that have Manual creadential mode. When an GCP WIF capable operator is installed and the cluster is in GCP WIF mode, the Console should require the user to enter the necessary information about the GCP project, account, service account etc, which is in turn to be injected the operator's deployment via subscription.config (exactly how Azure and AWS STS got implemented in Console)
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Steps to Reproduce:
1. On a non-GCP cluster, navigate to OperatorHub 2. check available filters 3.
Actual results:
"Auth Token GCP" filter is available in OperatorHub
Expected results:
"Auth Token GCP" filter should not be available in OperatorHub for a non-GCP cluster. When selecting an operator that supports "Auth token GCP" as indicated by the annotation features.operators.openshift.io/token-auth-gcp: "true" the console needs to, aligned with how it works AWS/Azure auth capable operators, force the user to input the required information to auth against GCP via WIF in the form of env vars that are set up using subscription.config on the operator. The exact names need to come out of https://issues.redhat.com/browse/CCO-574
Additional info:
Azure PR - https://github.com/openshift/console/pull/13082 AWS PR - https://github.com/openshift/console/pull/12778
UI Screen Design can be taken from the existing implementation of the Console support short-lived token setup flow for AWS and Azure described here: https://docs.google.com/document/d/1iFNpyycby_rOY1wUew-yl3uPWlE00krTgr9XHDZOTNo/edit
Description of problem:
When destroying an HCP KubeVirt cluster using the cli and the --destroy-cloud-resources, pvcs are not cleaned up within the guest cluster due to the cli not properly honoring the --destroy-cloud-resources option for KubeVirt.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100%
Steps to Reproduce:
1. destroy an hcp kubevirt cluster using cli and --destroy-cloud-resources 2. 3.
Actual results:
the hosted cluster does not have the hypershift.openshift.io/cleanup-cloud-resources: "true" annotation added which ensures the hosted cluster config controller cleans up pvcs
Expected results:
the hypershift.openshift.io/cleanup-cloud-resources: "true" should get added to the hosted cluster during tear down when the --destroy-cloud-resources cli option is used
Additional info:
Please review the following PR: https://github.com/openshift/ironic-rhcos-downloader/pull/96
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-41896. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-41776. The following is the description of the original issue:
—
Description of problem:
the section is: https://docs.openshift.com/container-platform/4.16/installing/installing_aws/ipi/installing-aws-vpc.html#installation-aws-arm-tested-machine-types_installing-aws-vpc all tesed arm instances for 4.14+: c6g.* c7g.* m6g.* m7g.* r8g.* we need to ensure all sections include "Tested instance types for AWS on 64-bit ARM infrastructures" section been updated for 4.14+
Additional info:
Please review the following PR: https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/179
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Client side throttling observed when running the metrics controller.
Steps to Reproduce:
1. Install an AWS cluster in mint mode 2. Enable debug log by editing cloudcredential/cluster 3. Wait for the metrics loop to run for a few times 4. Check CCO logs
Actual results:
// 7s consumed by metrics loop which is caused by client-side throttling time="2024-01-20T19:43:56Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics I0120 19:43:56.251278 1 request.go:629] Waited for 176.161298ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-cloud-network-config-controller/secrets/cloud-credentials I0120 19:43:56.451311 1 request.go:629] Waited for 197.182213ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-cloud-network-config-controller/secrets/cloud-credentials I0120 19:43:56.651313 1 request.go:629] Waited for 197.171082ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-cloud-network-config-controller/secrets/cloud-credentials I0120 19:43:56.850631 1 request.go:629] Waited for 195.251487ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-cloud-network-config-controller/secrets/cloud-credentials ... time="2024-01-20T19:44:03Z" level=info msg="reconcile complete" controller=metrics elapsed=7.231061324s
Expected results:
No client-side throttling when running the metrics controller.
Description of problem:
HyperShift operator is applying control-plane-pki-operator RBAC resources regardless of if PKI reconciliation is disabled for the HostedCluster.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
100%
Steps to Reproduce:
1. Create 4.15 HostedCluster with PKI reconciliation disabled 2. Unused RBAC resources for control-plane-pki-operator is created
Actual results:
Unused RBAC resources for control-plane-pki-operator is created
Expected results:
RBAC resources for control-plane-pki-operator should not be created if deployment for control-plane-pki-operator itself is not created.
Additional info:
This is a clone of issue OCPBUGS-37541. The following is the description of the original issue:
—
Description of problem:
Apps exposed via NodePort do not return responses to client requests if the client's ephemeral port is 22623 or 22624.
When testing with curl command specifying the local port as shown below, a response is returned if the ephemeral port is 22622 or 22626, but it times out if the ephemeral port is 22623 or 22624.
[root@bastion ~]# for i in {22622..22626}; do echo localport:${i}; curl -m 10 -I 10.0.0.20:32325 --local-port ${i}; done localport:22622 HTTP/1.1 200 OK Server: nginx/1.22.1 Date: Thu, 25 Jul 2024 07:44:22 GMT Content-Type: text/html Content-Length: 37451 Last-Modified: Wed, 24 Jul 2024 12:20:19 GMT Connection: keep-alive ETag: "66a0f183-924b" Accept-Ranges: bytes localport:22623 curl: (28) Connection timed out after 10001 milliseconds localport:22624 curl: (28) Connection timed out after 10000 milliseconds localport:22625 HTTP/1.1 200 OK Server: nginx/1.22.1 Date: Thu, 25 Jul 2024 07:44:42 GMT Content-Type: text/html Content-Length: 37451 Last-Modified: Wed, 24 Jul 2024 12:20:19 GMT Connection: keep-alive ETag: "66a0f183-924b" Accept-Ranges: bytes localport:22626 HTTP/1.1 200 OK Server: nginx/1.22.1 Date: Thu, 25 Jul 2024 07:44:42 GMT Content-Type: text/html Content-Length: 37451 Last-Modified: Wed, 24 Jul 2024 12:20:19 GMT Connection: keep-alive ETag: "66a0f183-924b" Accept-Ranges: bytes
This issue has been occurring since upgrading to version 4.16. Confirmed that it does not occur in versions 4.14 and 4.12.
Version-Release number of selected component (if applicable):
OCP 4.16
How reproducible:
100%
Steps to Reproduce:
1. Prepare a 4.16 cluster.
2. Launch any web app pod (nginx, httpd, etc.).
3. Expose the application externally using NodePort.
4. Access the URL using curl --local-port option to specify 22623 or 22624.
Actual results:
No response is returned from the exposed application when the ephemeral port is 22623 or 22624.
Expected results:
A response is returned regardless of the ephemeral port.
Additional info:
This issue started occurring from version 4.16, so it is possible that this is due to changes in RHEL 9.4, particularly those related to nftables.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
Currently, assisted-service performs validation of the user pull secret during actions like registering / updating a cluster / infraenv, in which it checks that the pull secret contains the tokens for all release images. Therefore, currently we can't add nightly release images to stage environment as it blocks users without the right token from installing clusters with different release images.
How reproducible:
Always
Steps to reproduce:
1. Add nightly image to stage. (not recommended)
Actual results:
Expected results:
Register cluster successfully.
This fix contains the following changes coming from updated version of kubernetes up to v1.29.8:
Changelog:
v1.29.8: https://github.com/kubernetes/kubernetes/blob/release-1.29/CHANGELOG/CHANGELOG-1.29.md#changelog-since-v1297
Description of problem:
Invalid memory address or nil pointer dereference in Cloud Network Config Controller
Version-Release number of selected component (if applicable):
4.12
How reproducible:
sometimes
Steps to Reproduce:
1. Happens by itself sometimes 2. 3.
Actual results:
Panic and pod restarts
Expected results:
Panics due to Invalid memory address or nil pointer dereference should not occur
Additional info:
E0118 07:54:18.703891 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 93 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x203c8c0?, 0x3a27b20}) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0003bd090?}) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75 panic({0x203c8c0, 0x3a27b20}) /usr/lib/golang/src/runtime/panic.go:884 +0x212 github.com/openshift/cloud-network-config-controller/pkg/cloudprovider.(*Azure).AssignPrivateIP(0xc0001ce700, {0xc000696540, 0x10, 0x10}, 0xc000818ec0) /go/src/github.com/openshift/cloud-network-config-controller/pkg/cloudprovider/azure.go:146 +0xcf0 github.com/openshift/cloud-network-config-controller/pkg/controller/cloudprivateipconfig.(*CloudPrivateIPConfigController).SyncHandler(0xc000986000, {0xc000896a90, 0xe}) /go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/cloudprivateipconfig/cloudprivateipconfig_controller.go:327 +0x1013 github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem.func1(0xc000720d80, {0x1e640c0?, 0xc0003bd090?}) /go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:152 +0x11c github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem(0xc000720d80) /go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:162 +0x46 github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).runWorker(0xc000504ea0?) /go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:113 +0x25 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157 +0x3e k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x27b3220, 0xc000894480}, 0x1, 0xc0000aa540) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xb6 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x89 k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x0?, 0x0?) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92 +0x25 created by github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).Run /go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:99 +0x3aa panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x1a40b30] goroutine 93 [running]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0003bd090?}) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xd7 panic({0x203c8c0, 0x3a27b20}) /usr/lib/golang/src/runtime/panic.go:884 +0x212 github.com/openshift/cloud-network-config-controller/pkg/cloudprovider.(*Azure).AssignPrivateIP(0xc0001ce700, {0xc000696540, 0x10, 0x10}, 0xc000818ec0) /go/src/github.com/openshift/cloud-network-config-controller/pkg/cloudprovider/azure.go:146 +0xcf0 github.com/openshift/cloud-network-config-controller/pkg/controller/cloudprivateipconfig.(*CloudPrivateIPConfigController).SyncHandler(0xc000986000, {0xc000896a90, 0xe}) /go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/cloudprivateipconfig/cloudprivateipconfig_controller.go:327 +0x1013 github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem.func1(0xc000720d80, {0x1e640c0?, 0xc0003bd090?}) /go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:152 +0x11c github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem(0xc000720d80) /go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:162 +0x46 github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).runWorker(0xc000504ea0?) /go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:113 +0x25 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157 +0x3e k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x27b3220, 0xc000894480}, 0x1, 0xc0000aa540) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xb6 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x89 k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x0?, 0x0?) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92 +0x25 created by github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).Run /go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:99 +0x3aa
This is a clone of issue OCPBUGS-43917. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-42987. The following is the description of the original issue:
—
It is been observed that the esp_offload kernel module might be loaded by libreswan even if bond ESP offloads have been correctly turned off.
This might be because ipsec service and configure-ovs run at the same time, so it is possible that ipsec service starts when bond offloads are not yet turned off and trick libreswan into thinking they should be used.
The potential fix would be to run ipsec service after configure-ovs.
We had this CI job failing because clusteroperator/storage kept flip-flopping between progressing=True and progressing=False
: [sig-arch] events should not repeat pathologically for ns/openshift-cluster-storage-operator expand_less 0s { 1 events happened too frequently event happened 21 times, something is wrong: namespace/openshift-cluster-storage-operator deployment/cluster-storage-operator hmsg/cfc7e5cdbe - reason/OperatorStatusChanged Status for clusteroperator/storage changed: Progressing changed from True to False ("AWSEBSCSIDriverOperatorCRProgressing: All is well\nSHARESCSIDriverOperatorCRProgressing: All is well") From: 14:13:20Z To: 14:13:21Z result=reject }
This exposed OCPBUGS-24027 which is now fixed.
However, there are still an excessive number of progressing events from this job.
$ grep 'clusteroperator/storage changed: Progressing' events.txt > progressing.txt $ wc -l progressing.txt 28 progressing.txt
A small subset of those actually change between True and Flase
$ grep 'clusteroperator/storage changed: Progressing' events.txt | grep True openshift-cluster-storage-operator 143m Normal OperatorStatusChanged deployment/cluster-storage-operator Status for clusteroperator/storage changed: Progressing changed from Unknown to False ("All is well"),Available changed from Unknown to True ("DefaultStorageClassControllerAvailable: StorageClass provided by supplied CSI Driver instead of the cluster-storage-operator") openshift-cluster-storage-operator 143m Normal OperatorStatusChanged deployment/cluster-storage-operator Status for clusteroperator/storage changed: Progressing changed from False to True ("AWSEBSProgressing: Waiting for Deployment to act on changes") openshift-cluster-storage-operator 143m Normal OperatorStatusChanged deployment/cluster-storage-operator Status for clusteroperator/storage changed: Progressing message changed from "AWSEBSProgressing: Waiting for Deployment to deploy pods" to "AWSEBSCSIDriverOperatorCRProgressing: Waiting for AWSEBS operator to report status\nAWSEBSProgressing: Waiting for Deployment to deploy pods",Available changed from True to False ("AWSEBSCSIDriverOperatorCRAvailable: Waiting for AWSEBS operator to report status"),Upgradeable changed from Unknown to True ("All is well") openshift-cluster-storage-operator 136m Normal OperatorStatusChanged deployment/cluster-storage-operator Status for clusteroperator/storage changed: Progressing changed from True to False ("AWSEBSCSIDriverOperatorCRProgressing: All is well\nSHARESCSIDriverOperatorCRProgressing: All is well") openshift-cluster-storage-operator 45m Normal OperatorStatusChanged deployment/cluster-storage-operator Status for clusteroperator/storage changed: Progressing changed from False to True ("AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods") openshift-cluster-storage-operator 2m11s Normal OperatorStatusChanged deployment/cluster-storage-operator Status for clusteroperator/storage changed: Progressing changed from True to False ("AWSEBSCSIDriverOperatorCRProgressing: All is well\nSHARESCSIDriverOperatorCRProgressing: All is well") openshift-cluster-storage-operator 8m6s Normal OperatorStatusChanged deployment/cluster-storage-operator Status for clusteroperator/storage changed: Progressing changed from False to True ("SHARESCSIDriverOperatorCRProgressing: SharedResourcesDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods") openshift-cluster-storage-operator 2m12s Normal OperatorStatusChanged deployment/cluster-storage-operator Status for clusteroperator/storage changed: Progressing changed from False to True ("SHARESProgressing: Waiting for Deployment to deploy pods")
But then we end up with events like this for example, where CSO has just appended the status message with more noise between competing controllers:
openshift-cluster-storage-operator 142m Normal OperatorStatusChanged deployment/cluster-storage-operator Status for clusteroperator/storage changed: Progressing message changed from "AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverControllerServiceControllerProgressing: Waiting for Deployment to act on changes\nAWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods\nSHARESCSIDriverOperatorCRProgressing: SharedResourceCSIDriverWebhookControllerProgressing: Waiting for Deployment to deploy pods" to "AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods\nAWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods\nSHARESCSIDriverOperatorCRProgressing: SharedResourceCSIDriverWebhookControllerProgressing: Waiting for Deployment to deploy pods",Available message changed from "AWSEBSCSIDriverOperatorCRAvailable: AWSEBSDriverControllerServiceControllerAvailable: Waiting for Deployment\nAWSEBSCSIDriverOperatorCRAvailable: AWSEBSDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service\nSHARESCSIDriverOperatorCRAvailable: SharedResourceCSIDriverWebhookControllerAvailable: Waiting for Deployment\nSHARESCSIDriverOperatorCRAvailable: SharedResourcesDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service" to "AWSEBSCSIDriverOperatorCRAvailable: AWSEBSDriverControllerServiceControllerAvailable: Waiting for Deployment\nSHARESCSIDriverOperatorCRAvailable: SharedResourceCSIDriverWebhookControllerAvailable: Waiting for Deployment\nSHARESCSIDriverOperatorCRAvailable: SharedResourcesDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service"
There are multiple controllers for multiple operators updating the progressing condition, which generates an excessive number of events. This would be (at least) annoying on a live cluster, but it also leaves CSO succeptible to `events should not repeat pathologically` test flakes in CI.
This is a clone of issue OCPBUGS-35020. The following is the description of the original issue:
—
Description of problem:
When investigating https://issues.redhat.com/browse/OCPBUGS-34819 we encountered an issue with the LB creation but also noticed that masters are using an S3 stub ignition even though they don't have to. Although that can be harmless, we are adding an extra, useless hop that we don't need.
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Change the AWSMachineTemplate ignition.storageType to UnencryptedUserData
Please review the following PR: https://github.com/openshift/csi-driver-manila-operator/pull/213
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/ironic-rhcos-downloader/pull/94
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
On https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-gcp-ovn-upgrade-4.15-micro-release-openshift-release-analysis-aggregator/1756168710529224704 we failed three netpol tests on just one result, failure, with no successess. However the other 9 jobs seemed to run fine.
[sig-network] Netpol NetworkPolicy between server and client should allow ingress access from namespace on one named port [Feature:NetworkPolicy] [Skipped:Network/OVNKubernetes] [Skipped:Network/OpenShiftSDN/Multitenant] [Skipped:Network/OpenShiftSDN] [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Netpol NetworkPolicy between server and client should allow egress access on one named port [Feature:NetworkPolicy] [Skipped:Network/OVNKubernetes] [Skipped:Network/OpenShiftSDN/Multitenant] [Skipped:Network/OpenShiftSDN] [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Netpol NetworkPolicy between server and client should allow ingress access on one named port [Feature:NetworkPolicy] [Skipped:Network/OVNKubernetes] [Skipped:Network/OpenShiftSDN/Multitenant] [Skipped:Network/OpenShiftSDN] [Suite:openshift/conformance/parallel] [Suite:k8s]
Something seems funky with these tests, they're barely running, they don't seem to consistently report results. 4.15 has just 3 runs in the last two weeks, however 4.16 has just 1 but it passed
Whatever's going on, it's capable of taking out a payload, though it's not happening 100% of the time. https://amd64.ocp.releases.ci.openshift.org/releasestream/4.15.0-0.ci/release/4.15.0-0.ci-2024-02-10-040857
Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/277
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Align with the version-less-ness of `rhel-coreos` and `fedora-coreos` and shorten the overall tag.
Both tags are currently aliases, `centos-stream-coreos-9` will be removed in the future.
Description of problem:
After enabling user-defined monitoring on an HyperShift hosted cluster, PrometheusOperatorRejectedResources starts firing.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Start an hypershift-hosted cluster with cluster-bot 2. Enable user-defined monitoring 3.
Actual results:
PrometheusOperatorRejectedResources alert becomes firing
Expected results:
No alert firing
Additional info:
Need to reach out to the HyperShift folks as the fix should probably be in their code base.
In TRT-1476, we created a VM that served as an endpoint where we can test connectivity in gcp.
We want one for AWS.
In TRT-1477, we created some code in origin to send HTTP GETs to that endpoint as a test to ensure connectivity remains working. Do this also for AWS.
TRT members already have an AWS account so we don't need to request one.
Description of problem:
As of 4.16, we do not expect the Kube Controller Manager Operator to set any cloud provider related flags, eg `--cloud-provider`, `--cloud-config` or `--external-cloud-volume-plugin`. In 4.17 setting these flags will prevent the KCM from starting.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-olm-operator/pull/42
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-33890. The following is the description of the original issue:
—
Description of problem: Terraform should no longer be available for vSphere.
Description of problem:
Azure-File volume mount failed, it happens on arm cluster with multi payload $ oc describe pod Warning FailedMount 6m28s (x2 over 95m) kubelet MountVolume.MountDevice failed for volume "pvc-102ad3bf-3480-410b-a4db-73c64daeb3e2" : rpc error: code = InvalidArgument desc = GetAccountInfo(wduan-0319b-bkp2k-rg#clusterjzrlh#pvc-102ad3bf-3480-410b-a4db-73c64daeb3e2###wduan) failed with error: Retriable: true, RetryAfter: 0s, HTTPStatusCode: -1, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wduan-0319b-bkp2k-rg/providers/Microsoft.Storage/storageAccounts/clusterjzrlh/listKeys?api-version=2021-02-01: StatusCode=0 -- Original Error: adal: Failed to execute the refresh request. Error = 'Post "https://login.microsoftonline.com/6047c7e9-b2ad-488d-a54e-dc3f6be6a7ee/oauth2/token": dial tcp 20.190.190.193:443: i/o timeout'
The node log reports: W0319 09:41:30.745936 1 azurefile.go:806] GetStorageAccountFromSecret(azure-storage-account-clusterjzrlh-secret, wduan) failed with error: could not get secret(azure-storage-account-clusterjzrlh-secret): secrets "azure-storage-account-clusterjzrlh-secret" is forbidden: User "system:serviceaccount:openshift-cluster-csi-drivers:azure-file-csi-driver-node-sa" cannot get resource "secrets" in API group "" in the namespace "wduan"
Checked the role looks good, at least the same as previous: $ oc get clusterrole azure-file-privileged-role -o yaml ... rules: - apiGroups: - security.openshift.io resourceNames: - privileged resources: - securitycontextconstraints verbs: - use
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-multi-2024-03-13-031451
How reproducible:
2/2
Steps to Reproduce:
1. Checked in CI, azure-file cases failed due to this 2. Create one cluster with the same config and payload, create azure-file pvc and pod 3.
Actual results:
Pod could not be running
Expected results:
Pod should be running
Additional info:
Description of problem:
The image quay.io/centos7/httpd-24-centos7 used in TestMTLSWithCRLs and TestCRLUpdate is no longer being rebuilt, and has had its 'latest' tag removed. Containers using this image fail to start, and cause the tests to fail.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
Run 'TEST="(TestMTLSWithCRLs|TestCRLUpdate)" make test-e2e' from the cluster-ingress-operator repo
Actual results:
Both tests and all their subtests fail
Expected results:
Tests pass
Additional info:
Problem Description:
Installed the Red Hat Quay Container Security Operator on the 4.13.25 cluster .
Below are my test results :
```
sasakshi@sasakshi ~]$ oc version
Client Version: 4.12.7
Kustomize Version: v4.5.7
Server Version: 4.13.25
Kubernetes Version: v1.26.9+aa37255
[sasakshi@sasakshi ~]$ oc get csv -A | grep -i "quay" | tail -1
openshift container-security-operator.v3.10.2 Red Hat Quay Container Security Operator 3.10.2 container-security-operator.v3.10.1 Succeeded
[sasakshi@sasakshi ~]$ oc get subs -A
NAMESPACE NAME PACKAGE SOURCE CHANNEL
openshift-operators container-security-operator container-security-operator redhat-operators stable-3.10
[sasakshi@sasakshi ~]$ oc get imagemanifestvuln -A | wc -l
82
[sasakshi@sasakshi ~]$ oc get vuln --all-namespaces | wc -l
82
Console -> Administration -> Image Vulnerabitlites : 82
Home -> Overiview -> Status -> Image Vulnerabitlites : 66
```
Observations from My testing :
Kindly refer to the attached screenshots for reference .
Documentation link referred:
Description of problem:
Few of the CI tests are continuously failing for Jenkins Pipelines Build Strategy. As this strategy has been deprecated since 4.10, we should skip these to unblock the PRs.
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
Almost Everytime
Actual results:
Tests failing
Expected results:
All the tests should pass.
Additional info:
Observed in: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/343#issuecomment-2057851847
Description of problem:
CNO doesn't set operConfig status while deploying ipsec machine configs and ipsec daemonset, it must set progressing condition with true. When one of the components fail to deploy then it must report degraded condition set with true.
For more details, see the discussion here:
https://github.com/openshift/release/pull/50740#issuecomment-2076698580
Add an e2e check for sa token not being mounted on management side workloads unless necessary. This could be an extension of the needManamgementKasAccess check
Description of problem:
When a StatusItem lacks a timestamp, a <div> element is still output to the dom, resulting in a spacing issue above the message resulting in a veritcal misalignment with the icon on the left. This is the result of the fact that timestamp is an optional prop, but the <div> surrounding the optional prop output is not optional.
Description of problem:
When the installer gathers a log bundle after failure (either automatically or with gather bootstrap), the installer fails to return serial console logs if an SSH connection to the bootstrap node is refused. Even if the serial console logs were collected, the installer exits on error if ssh connection is refused: time="2024-03-09T20:59:26Z" level=info msg="Pulling VM console logs" time="2024-03-09T20:59:26Z" level=debug msg="Search for matching instances by tag in us-west-1 matching aws.Filter{\"kubernetes.io/cluster/ci-op-4ygffz3q-be93e-jnn92\":\"owned\"}" time="2024-03-09T20:59:26Z" level=debug msg="Search for matching instances by tag in us-west-1 matching aws.Filter{\"openshiftClusterID\":\"2f9d8822-46fd-4fcd-9462-90c766c3d158\"}" time="2024-03-09T20:59:27Z" level=debug msg="Attemping to download console logs for ci-op-4ygffz3q-be93e-jnn92-bootstrap" Instance=i-0413f793ffabe9339 time="2024-03-09T20:59:27Z" level=debug msg="Download complete" Instance=i-0413f793ffabe9339 time="2024-03-09T20:59:27Z" level=debug msg="Attemping to download console logs for ci-op-4ygffz3q-be93e-jnn92-master-0" Instance=i-0ab5f920818366bb8 time="2024-03-09T20:59:27Z" level=debug msg="Download complete" Instance=i-0ab5f920818366bb8 time="2024-03-09T20:59:27Z" level=debug msg="Attemping to download console logs for ci-op-4ygffz3q-be93e-jnn92-master-2" Instance=i-0b93963476818535d time="2024-03-09T20:59:27Z" level=debug msg="Download complete" Instance=i-0b93963476818535d time="2024-03-09T20:59:28Z" level=debug msg="Attemping to download console logs for ci-op-4ygffz3q-be93e-jnn92-master-1" Instance=i-0797728e092bfbeef time="2024-03-09T20:59:28Z" level=debug msg="Download complete" Instance=i-0797728e092bfbeef time="2024-03-09T20:59:28Z" level=info msg="Pulling debug logs from the bootstrap machine" time="2024-03-09T20:59:28Z" level=debug msg="Added /tmp/bootstrap-ssh3643557583 to installer's internal agent" time="2024-03-09T20:59:28Z" level=debug msg="Added /tmp/.ssh/ssh-privatekey to installer's internal agent" time="2024-03-09T21:01:39Z" level=error msg="Attempted to gather debug logs after installation failure: failed to connect to the bootstrap machine: dial tcp 13.57.212.80:22: connect: connection timed out" from: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_api/1788/pull-ci-openshift-api-master-e2e-aws-ovn/1766560949898055680 We can see the console logs were downloaded, they should be saved in the log bundle.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Failed install where SSH to bootstrap node fails. https://github.com/openshift/installer/pull/8137 provides a potential reproducer 2. 3.
Actual results:
Expected results:
Additional info:
Error handling needs to be reworked here: https://github.com/openshift/installer/blob/master/cmd/openshift-install/gather.go#L160-L190
Description of problem:
After upgrading the dynamic console plugin to PF5, the modal is not rendered correctly. The header and footer of the modal are not displayed.
Version-Release number of selected component (if applicable):
"@openshift-console/dynamic-plugin-sdk": "^1.1.0", "@patternfly/patternfly": "^5.2.0", "@patternfly/react-charts": "^7.2.0", "@patternfly/react-core": "^5.2.0", "@patternfly/react-icons": "^5.2.0", "@patternfly/react-table": "^5.2.0", "@patternfly/react-topology": "^5.2.0",
Steps to Reproduce:
1. Include Modal component in a PF5 dynamic console plugin 2. Render the modal component
Actual results:
The header and the footer of the modal are not displayed
Expected results:
The modal is rendered correctly
Additional info:
This issue is related to the next PF modal component (currently in beta) created in PF version 5.2.0. As a temporary workaround, downgrading PF library to version 5.1.x fixes the issue.
This is a clone of issue OCPBUGS-34413. The following is the description of the original issue:
—
Description of problem:
Cluster-ingress-operator logs an update when one didn't happen.
% grep -e 'successfully updated Infra CR with Ingress Load Balancer IPs' -m 1 -- ingress-operator.log 2024-05-17T14:46:01.434Z INFO operator.ingress_controller ingress/controller.go:326 successfully updated Infra CR with Ingress Load Balancer IPs % grep -e 'successfully updated Infra CR with Ingress Load Balancer IPs' -c -- ingress-operator.log 142
https://github.com/openshift/cluster-ingress-operator/pull/1016 has a logic error, which causes the operator to log this message even when it didn't do an update:
// If the lbService exists for the "default" IngressController, then update Infra CR's PlatformStatus with the Ingress LB IPs. if haveLB && ci.Name == manifests.DefaultIngressControllerName { if updated, err := computeUpdatedInfraFromService(lbService, infraConfig); err != nil { errs = append(errs, fmt.Errorf("failed to update Infrastructure PlatformStatus: %w", err)) } else if updated { if err := r.client.Status().Update(context.TODO(), infraConfig); err != nil { errs = append(errs, fmt.Errorf("failed to update Infrastructure CR after updating Ingress LB IPs: %w", err)) } } log.Info("successfully updated Infra CR with Ingress Load Balancer IPs") }
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%
Steps to Reproduce:
1. Create a LB service for the default Ingress Operator 2. Watch ingress operator logs for the search strings mentioned above
Actual results:
Lots of these log entries will be seen even though no further updates are made to the default ingress operator: 2024-05-17T14:46:01.434Z INFO operator.ingress_controller ingress/controller.go:326 successfully updated Infra CR with Ingress Load Balancer IPs
Expected results:
Only see this log entry when an update to Infra CR is made. Perhaps just one the first time you add a LB service to the default ingress operator.
Additional info:
https://github.com/openshift/cluster-ingress-operator/pull/1016 was backported to 4.15, so it would be nice to fix it and backport the fix to 4.15. It is rather noisy, and it's trivial to fix.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Currently importing the HyperShift API packages in other projects brings also the dependencies and conflicts of the rest of the non-API packages. This is a request to create Go submodule containing only the API packages.
Once we cut beta and the API is stable we should move it into its own repo
Description of problem:
The kube-apiserver has a container called audit-logs that keeps audit records stored in the logs of the container (just prints to stdout). We would like the ability to disable this container whenever the None policy is used on the cluster. As of today, this consumes about 1gb of storage for each apiserver pod on the system. As you scale up, that 1gb per master adds up. https://github.com/openshift/hypershift/issues/3764
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
NetworkAttachmentDefinition always gets created in the default namespace if I use the form method "from the console". It also doesn't honor the selected name and creates the NAD object with a different name (the selected name + random suffix).
Version-Release number of selected component (if applicable):
OCP 4.15.5
How reproducible:
From the console, under the Networking section, select NetworkAttachmentDefinitions and create a NAD using the Form method and not the YAML one.
Actual results:
The NAD gets created in the wrong namespace (always ends up in the default namespace) and with the wrong name.
Expected results:
The NAD resource gets created in the currently selected namespace with the chosen name
Description of problem:
The ValidatingAdmissionPolicy admission plugin is set in OpenShift 4.14+ kube-apiserver config, but is missing from the HyperShift config. It should be set.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
4.15: https://github.com/openshift/hypershift/blob/release-4.15/control-plane-operator/controllers/hostedcontrolplane/kas/config.go#L293-L341 4.14: https://github.com/openshift/hypershift/blob/release-4.14/control-plane-operator/controllers/hostedcontrolplane/kas/config.go#L283-L331
Expected results:
Expect to see ValidatingAdmissionPolicy
Additional info:
The customer is pointing that the memory max value scale up is based in GB value, while the min memory for scale down is based on GiB.
So, setting both values in GB makes scale down fail:
Skipping ocgc4preplatgt-98fwh-worker-c-sk2sz - minimal limit exceeded for [memory]
While setting both values in GiB makes the scale up fail.
Description of problem:
There is an 'Unhealthy Conditions' table on MachineHealthCheck details page, currently the first column is 'Status', user care more about Type then its Status
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-16-113018
How reproducible:
Always
Steps to Reproduce:
1. go to any MHC details page and check 'Unhealthy Conditions' table 2. 3.
Actual results:
in the table, 'Type' is the last column
Expected results:
we should put 'Type' as the first column since this is the most important factor user care for comparsion, we can check the 'Conditions' table on ClusterOperators details page, the order is Type -> Status -> other info which is very user friendly
Additional info:
Description of problem:
I've noticed that 'agent-cluster-install.yaml' and 'journal.export' from the agent gather process contain passwords. It's important not to expose password information in any of these generated files.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. Generate an agent ISO by utilising agent-config and install-config, including platform credentials 2. Boot the ISO that was created 3. Run the agent-gather command on the node 0 machine to generate files.
Actual results:
The 'agent-cluster-install.yaml' and 'journal.export' are containing the passwords information.
Expected results:
Password should be redacted.
Additional info:
When using an autoscaling MachinePool with OpenStack, setting minReplicas=0 results in a nil pointer panic.
See HIVE-2415 for context.
As a user, I want to be able to impersonate Groups from the Groups list page kebab menu or the Group details page Actions menu dropdown so that I can more easily impersonate a Group without having to find the corresponding RoleBinding.
AC:
Please review the following PR: https://github.com/openshift/cluster-api-provider-vsphere/pull/29
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Sometimes the prometheus-operator's informer will be stuck because it receives objects that can't be converted to *v1.PartialObjectMetadata.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Not always
Steps to Reproduce:
1. Unknown 2. 3.
Actual results:
prometheus-operator logs show errors like 2024-02-09T08:29:35.478550608Z level=warn ts=2024-02-09T08:29:35.478491797Z caller=klog.go:108 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:110: failed to list *v1.PartialObjectMetadata: Get \"https://172.30.0.1:443/api/v1/secrets?resourceVersion=29022\": dial tcp 172.30.0.1:443: connect: connection refused" 2024-02-09T08:29:35.478592909Z level=error ts=2024-02-09T08:29:35.478541608Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:110: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Get \"https://172.30.0.1:443/api/v1/secrets?resourceVersion=29022\": dial tcp 172.30.0.1:443: connect: connection refused"
Expected results:
No error
Additional info:
The bug has been introduced in v0.70.0 by https://github.com/prometheus-operator/prometheus-operator/pull/5993 so it only affects 4.16 and 4.15.
Description of problem:
While mirroring with the following command[1], it is observed that the command fails with error[2] as shown below: ~~~ [1] oc mirror --config=imageSet-config.yaml docker://<registry_url>:<Port>/<repository> ~~~ ~~~ [2] error: error rebuilding catalog images from file-based catalogs: error regenerating the cache for <registry_url>:<Port>/<repository>/community-operator-index:v4.15: exit status 1 ~~~
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Download `oc mirror` v:4.15.0 binary 2. Create ImageSet-config.yaml 3. Use the following command: ~~~ oc mirror --config=imageSet-config.yaml docker://<registry_url>:<Port>/<repository> ~~~ 4. Observe the mentioned error
Actual results:
Command failed to complete with the mentioned error.
Expected results:
ICSP and mapping.txt file should be created.
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Only customers have a break-glass certificate signer.
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
Always
Steps to Reproduce:
1.create CSR with any other signer chosen 2.does not work 3.
Actual results:
does not work
Expected results:
should work
Additional info:
Description of problem:
Clear Button in Upload Jar Form is not working, user need to close the form in-order to remove the previous selected JAR file.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Open Upload Jar File form from Add Page 2. Upload a JAR file 3. Remove the JAR the file by using clear button
Actual results:
The selected JAR file is not removed even after using "Clear" button
Expected results:
The "Clear" button should remove the selected file from the form.
Additional info:
This is a clone of issue OCPBUGS-33762. The following is the description of the original issue:
—
Description of problem:
the newly available TP upgrade status command have formatting issue while expanding update health using --details flag, a plural s:<resource> is displayed, which according to dev supposed to be added to group.kind, but only the plural itself is displayed instead Resources: s: version
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-08-222442
How reproducible:
100%
Steps to Reproduce:
oc adm upgrade status --details=all while there is any health issue with the cluster
Actual results:
Resources: s: ip-10-0-76-83.us-east-2.compute.internal Description: Node is unavailable
Resources: s: version Description: Cluster operator control-plane-machine-set is not available
Resources: s: ip-10-0-58-8.us-east-2.compute.internal Description: failed to set annotations on node: unable to update node "&Node{ObjectMeta:{ 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},}": Patch "https://api-int.evakhoni-1514.qe.devcluster.openshift.com:6443/api/v1/nodes/ip-10-0-58-8.us-east-2.compute.internal": read tcp 10.0.58.8:48328->10.0.27.41:6443: read: connection reset by peer
Expected results:
should mention the correct <group.kind>s:<resource> ?
Additional info:
OTA-1246
slackl thread
Description of problem:
Trying to define multiple receivers in a single user-defined AlertmanagerConfig
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
#### Monitoring for user-defined projects is enabled ``` oc -n openshift-monitoring get configmap cluster-monitoring-config -o yaml | head -4 ``` ``` apiVersion: v1 data: config.yaml: | enableUserWorkload: true ``` #### separate Alertmanager instance for user-defined alert routing is Enabled and Configured ``` oc -n openshift-user-workload-monitoring get configmap user-workload-monitoring-config -o yaml | head -6 ``` ``` apiVersion: v1 data: config.yaml: | alertmanager: enabled: true enableAlertmanagerConfig: true ``` create testing namespace oc new-project libor-alertmanager-testing ``` ## TESTING - MULTIPLE RECEIVERS IN ALERTMANAGERCONFIG Single AlertmanagerConfig `alertmanager_config_webhook_and_email_rootDefault.yaml` ``` apiVersion: monitoring.coreos.com/v1beta1 kind: AlertmanagerConfig metadata: name: libor-alertmanager-testing-email-webhook namespace: libor-alertmanager-testing spec: receivers: - name: 'libor-alertmanager-testing-webhook' webhookConfigs: - url: 'http://prometheus-msteams.internal-monitoring.svc:2000/occ-alerts' - name: 'libor-alertmanager-testing-email' emailConfigs: - to: USER@USER.CO requireTLS: false sendResolved: true - name: Default route: groupBy: - namespace receiver: Default groupInterval: 60s groupWait: 60s repeatInterval: 12h routes: - matchers: - name: severity value: critical matchType: '=' continue: true receiver: 'libor-alertmanager-testing-webhook' - matchers: - name: severity value: critical matchType: '=' receiver: 'libor-alertmanager-testing-email' ``` Once saved the continue statement is removed from the object. ``` the configuration applied to alertmanager contains continue false statements ``` oc exec -n openshift-user-workload-monitoring alertmanager-user-workload-0 -- amtool config show --alertmanager.url http://localhost:9093 ``` route: receiver: Default group_by: - namespace continue: false routes: - receiver: libor-alertmanager-testing/libor-alertmanager-testing-email-webhook/Default group_by: - namespace matchers: - namespace="libor-alertmanager-testing" continue: true routes: - receiver: libor-alertmanager-testing/libor-alertmanager-testing-email-webhook/libor-alertmanager-testing-webhook matchers: - severity="critical" continue: false <---- - receiver: libor-alertmanager-testing/libor-alertmanager-testing-email-webhook/libor-alertmanager-testing-email matchers: - severity="critical" continue: false <----- ``` If I update the statements to read `continue: true` and test here: https://prometheus.io/webtools/alerting/routing-tree-editor/ then I get the desired results workaround is to use 2 separate files - the continue statement is being added.
Actual results:
Once saved the continue statement is removed from the object.
Expected results:
continue true statement is retain and applied to alertmanager
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-api-provider-alibaba/pull/46
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The etcd team has introduced an e2e test that exercises a full etcd backup and restore cycle in OCP [1]. We run those tests as part of our PR builds and since 4.15 [2] (also 4.16 [3]), we have failed runs with the catalogd-controller-manager crash looping: 1 events happened too frequently event [namespace/openshift-catalogd node/ip-10-0-25-29.us-west-2.compute.internal pod/catalogd-controller-manager-768bb57cdb-nwbhr hmsg/47b381d71b - Back-off restarting failed container manager in pod catalogd-controller-manager-768bb57cdb-nwbhr_openshift-catalogd(aa38d084-ecb7-4588-bd75-f95adb4f5636)] happened 44 times} I assume something in that controller doesn't really deal gracefully with the restoration process of etcd, or the apiserver being down for some time. [1] https://github.com/openshift/origin/blob/master/test/extended/dr/recovery.go#L97 [2] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1205/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-etcd-recovery/1757443629380538368 [3] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1191/pull-ci-openshift-cluster-etcd-operator-release-4.15-e2e-aws-etcd-recovery/1752293248543494144
Version-Release number of selected component (if applicable):
> 4.15
How reproducible:
always by running the test
Steps to Reproduce:
Run the test: [sig-etcd][Feature:DisasterRecovery][Suite:openshift/etcd/recovery][Timeout:2h] [Feature:EtcdRecovery][Disruptive] Recover with snapshot with two unhealthy nodes and lost quorum [Serial] and observe the event invariant failing on it crash looping
Actual results:
catalogd-controller-manager crash loops and causes our CI jobs to fail
Expected results:
our e2e job is green again and catalogd-controller-manager doesn't crash loop
Additional info:
Description of problem:
Mirror failed due to {{manifest unknown}} on certain images for v2 format
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202403251146.p0.g03ce0ca.assembly.stream.el9-03ce0ca", GitCommit:"03ce0ca797e73b6762fd3e24100ce043199519e9", GitTreeState:"clean", BuildDate:"2024-03-25T16:34:33Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
always
Steps to Reproduce:
1) Test full==true with following imagesetconfig: cat config-full.yaml apiVersion: mirror.openshift.io/v1alpha2 kind: ImageSetConfiguration mirror: operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.16 full: true `oc-mirror --config config-full.yaml file://out-full --v2`
Actual results:
Mirror command always failed, but hit errors : 2024/04/08 02:50:52 [ERROR] : [Worker] errArray initializing source docker://registry.redhat.io/3scale-mas/zync-rhel8@sha256:8a108677b0b4100a3d58d924b2c7a47425292492df3dc6a2ebff33c58ca4e9e8: reading manifest sha256:8a108677b0b4100a3d58d924b2c7a47425292492df3dc6a2ebff33c58ca4e9e8 in registry.redhat.io/3scale-mas/zync-rhel8: manifest unknown 2024/04/08 09:12:55 [ERROR] : [Worker] errArray initializing source docker://registry.redhat.io/integration/camel-k-rhel8-operator@sha256:4796985f3efcd37b057dea0a35b526c02759f8ea63327921cdd2e504c575d3c0: reading manifest sha256:4796985f3efcd37b057dea0a35b526c02759f8ea63327921cdd2e504c575d3c0 in registry.redhat.io/integration/camel-k-rhel8-operator: manifest unknown 2024/04/08 09:12:55 [ERROR] : [Worker] errArray initializing source docker://registry.redhat.io/integration/camel-k-rhel8-operator@sha256:4796985f3efcd37b057dea0a35b526c02759f8ea63327921cdd2e504c575d3c0: reading manifest sha256:4796985f3efcd37b057dea0a35b526c02759f8ea63327921cdd2e504c575d3c0 in registry.redhat.io/integration/camel-k-rhel8-operator: manifest unknown
Expected results:
No error
ci/prow/test.local pipeline is currently broken due to the build04 cluster addressed to it in the buildfarm being a bit slow and making github.com/thanos-io/thanos/pkg/store go over the default 900s.
panic: test timed out after 15m0s1027running tests:1028 TestTSDBStoreSeries (4m3s)1029 TestTSDBStoreSeries/1SeriesWith10000000Samples (2m58s)1030
Extending it makes the test pass
ok github.com/thanos-io/thanos/pkg/store 984.344s
We'll be addressing this alongside a follow-up issue to address this with an env var in upstream Thanos.
Please review the following PR: https://github.com/openshift/router/pull/547
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
MachineAutoscaler resources with a minimum replica size of zero will display a hyphen ("-") instead of zero on the list and detail pages.
Version-Release number of selected component (if applicable):
4.14.7
How reproducible:
always
Steps to Reproduce:
1.create a MachineAutoscaler with "min: 0" field 2.save record 3.navigate to MachineAutoscalers page under the Compute tab
Actual results:
the min replicas indicates "-"
Expected results:
min replicas indicates "0"
Additional info:
attaching a screenshot
After NM introduced dns-change event, we are creating an infinite loop of on-prem-resolv-prepender.service runs. This is because our prepender script ALWAYS runs `nmcli general reload dns-rc`, no matter if the changes are needed for real or not.
Because of this, we have the following
1) NM change DNS
2) dispatcher script append a server to /etc/resolv.conf
3) dispatcher invoked again as new dns-change event.
4) dispatcher check again and creates new /etc/resolv.conf, the same as old
5) NM change DNS, dns-change event is invoked
6) goto 3
As a fix, prepender script should check if the newly generated file differs from existing /etc/resolv.conf and only apply change if needed.
Enable French and Spanish in the OCP Console
A.C.
This is a clone of issue OCPBUGS-34618. The following is the description of the original issue:
—
Description of problem:
When performing a UPI installation, the installer fails with: time="2024-05-29T14:38:59-04:00" level=fatal msg="failed to fetch Cluster API Machine Manifests: failed to generate asset \"Cluster API Machine Manifests\": unable to generate CAPI machines for vSphere unable to get network inventory path: unable to find network ci-vlan-896 in resource pool /cidatacenter/host/cicluster/Resources/ci-op-yrhjini6-9ef4a" If I pre-create the resource pool(s), the installation proceeds.
Version-Release number of selected component (if applicable):
4.16 nightly
How reproducible:
consistently
Steps to Reproduce:
1. Follow documentation to perform a UPI installation 2. Installation will fail during manifest creation 3.
Actual results:
Installation fails
Expected results:
Installation should proceed
Additional info:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/51894/rehearse-51894-periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-upi-zones/1795883271666536448
Please review the following PR: https://github.com/openshift/azure-file-csi-driver/pull/47
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
4.17.0-0.nightly-2024-05-16-195932 and 4.16.0-0.nightly-2024-05-17-031643 both have resource quota issues like
failed to create iam: LimitExceeded: Cannot exceed quota for OpenIdConnectProvidersPerAccount: 100 status code: 409, request id: f69bf82c-9617-408a-b281-92c1ef0ec974
failed to create infra: failed to create VPC: VpcLimitExceeded: The maximum number of VPCs has been reached. status code: 400, request id: f90dcc5b-7e66-4a14-aa22-cec9f602fa8e
Seth has indicated he is working to clean things up in https://redhat-internal.slack.com/archives/C01CQA76KMX/p1715913603117349?thread_ts=1715557887.529169&cid=C01CQA76KMX
Please review the following PR: https://github.com/openshift/origin/pull/28452
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
sig-network][endpoints] admission [apigroup:config.openshift.io] [It] blocks manual creation of EndpointSlices pointing to the cluster or service network [Suite:openshift/conformance/parallel] github.com/openshift/origin/test/extended/networking/endpoint_admission.go:81 [FAILED] error getting endpoint controller service account
Description of problem:
When the IPI installer creates a service instance for the user, PowerVS will now have the type as composite_instance rather than service_instance. Fixup delete cluster to account for this change.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create cluster 2. Destroy cluster 3.
Actual results:
The newly created service instance does not delete.
Expected results:
Additional info:
Description of problem:
After build02 is upgraded to 4.16.0-ec.4 from 4.16.0-ec.3, the CSRs are not auto-approved. As a result, provisioned machines cannot become nodes of the cluster.
Version-Release number of selected component (if applicable):
oc --context build02 get clusterversion version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-ec.4 True False 4h28m
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Michael McCune feels the group "system:serviceaccounts" was missing in the CSR.
https://redhat-internal.slack.com/archives/CBZHF4DHC/p1710875084740869?thread_ts=1710861842.471739&cid=CBZHF4DHC
An inspection of the namespace openshift-cluster-machine-approver:
https://redhat-internal.slack.com/archives/CBZHF4DHC/p1710863462860809?thread_ts=1710861842.471739&cid=CBZHF4DHC
A workaround to approve the CSRs manually on b02:
https://github.com/openshift/release/pull/50016
Component Readiness has found a potential regression in [Unknown][invariant] alert/KubePodNotReady should not be at or above info in all the other namespaces.
Probability of significant regression: 100.00%
Sample (being evaluated) Release: 4.16
Start Time: 2024-04-18T00:00:00Z
End Time: 2024-04-24T23:59:59Z
Success Rate: 93.14%
Successes: 95
Failures: 7
Flakes: 0
Base (historical) Release: 4.15
Start Time: 2024-02-01T00:00:00Z
End Time: 2024-02-28T23:59:59Z
Success Rate: 100.00%
Successes: 482
Failures: 0
Flakes: 0
Description of problem:
When executing oc mirror using an oci path, you can end up with in an error state when the destination is a file://<path> destination (i.e. mirror to disk).
Version-Release number of selected component (if applicable):
4.14.2
How reproducible:
always
Steps to Reproduce:
At IBM we use the ibm-pak tool to generate a OCI catalog, but this bug is reproducible using a simple skopeo copy. Once you've copied the image locally you can move it around using file system copy commands to test this in different ways. 1. Make a directory structure like this to simulate how ibm-pak creates its own catalogs. The problem seems to be related to the path you use, so this represents the failure case: mkdir -p /root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list 2. make a location where the local storage will live: mkdir -p /root/.ibm-pak/oc-mirror-storage 3. Next, copy the image locally using skopeo: skopeo copy docker://icr.io/cpopen/ibm-zcon-zosconnect-catalog@sha256:8d28189637b53feb648baa6d7e3dd71935656a41fd8673292163dd750ef91eec oci:///root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list --all --format v2s2 4. You can copy the OCI catalog content to a location where things will work properly so you can see a working example: cp -r /root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list /root/ibm-zcon-zosconnect-catalog 5. You'll need an ISC... I've included both the oci references in the example (the commented out one works, but the oci:///root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list reference fails). kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 mirror: operators: - catalog: oci:///root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list #- catalog: oci:///root/ibm-zcon-zosconnect-catalog packages: - name: ibm-zcon-zosconnect channels: - name: v1.0 full: true targetTag: 27ba8e targetCatalog: ibm-catalog storageConfig: local: path: /root/.ibm-pak/oc-mirror-storage 6. run oc mirror (remember the ISC has oci refs for good and bad scenarios). You may want to change your working directory to different locations between running the good/bad examples. oc mirror --config /root/.ibm-pak/data/publish/latest/image-set-config.yaml "file://zcon --dest-skip-tls --max-per-registry=6
Actual results:
Logging to .oc-mirror.log Found: zcon/oc-mirror-workspace/src/publish Found: zcon/oc-mirror-workspace/src/v2 Found: zcon/oc-mirror-workspace/src/charts Found: zcon/oc-mirror-workspace/src/release-signatures error: ".ibm-pak/data/publish/latest/catalog-oci/manifest-list/kubebuilder/kube-rbac-proxy@sha256:db06cc4c084dd0253134f156dddaaf53ef1c3fb3cc809e5d81711baa4029ea4c" is not a valid image reference: invalid reference format
Expected results:
Simple example where things were working with the oci:///root/ibm-zcon-zosconnect-catalog reference (this was executed in the same workspace so no new images were detected). Logging to .oc-mirror.log Found: zcon/oc-mirror-workspace/src/publish Found: zcon/oc-mirror-workspace/src/v2 Found: zcon/oc-mirror-workspace/src/charts Found: zcon/oc-mirror-workspace/src/release-signatures 3 related images processed in 668.063974ms Writing image mapping to zcon/oc-mirror-workspace/operators.1700092336/manifests-ibm-zcon-zosconnect-catalog/mapping.txt No new images detected, process stopping
Additional info:
I debugged the error that happened and captured one of the instances where the ParseReference call fails. This is only for reference to help narrow down the issue. github.com/openshift/oc/pkg/cli/image/imagesource.ParseReference (/root/go/src/openshift/oc-mirror/vendor/github.com/openshift/oc/pkg/cli/image/imagesource/reference.go:111) github.com/openshift/oc-mirror/pkg/image.ParseReference (/root/go/src/openshift/oc-mirror/pkg/image/image.go:79) github.com/openshift/oc-mirror/pkg/cli/mirror.(*MirrorOptions).addRelatedImageToMapping (/root/go/src/openshift/oc-mirror/pkg/cli/mirror/fbc_operators.go:194) github.com/openshift/oc-mirror/pkg/cli/mirror.(*OperatorOptions).plan.func3 (/root/go/src/openshift/oc-mirror/pkg/cli/mirror/operator.go:575) golang.org/x/sync/errgroup.(*Group).Go.func1 (/root/go/src/openshift/oc-mirror/vendor/golang.org/x/sync/errgroup/errgroup.go:75) runtime.goexit (/usr/local/go/src/runtime/asm_amd64.s:1594) Also, I wanted to point out that because we use a period in the path (i.e. .ibm-pak) I wonder if that's causing the issue? This is just a guess and something to consider. *FOLLOWUP* ... I just removed the period from ".ibm-pak" and that seemed to make the error go away.
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/127
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-30950. The following is the description of the original issue:
—
Description of problem: ovnkube-node and multus DaemonSets have hostPath volumes which prevent clean unmount of CSI Volumes because of missing "mountPropagation: HostToContainer" parameter in volumeMount
Version-Release number of selected component (if applicable): OpenShift 4.14
How reproducible: Always
Steps to Reproduce:
1. on a node mount a file system underneath /var/lib/kubelet/ simulating the mount of a CSI driver PersistentVolume
2. restart the ovnkube-node pod running on that node
3. unmount the filesystem from 1. The mount will then be removed from the host list of mounted devices however a copy of the mount is still active in the mount namespace of the ovnkube-node pod.
This is blocking some CSI drivers relying on multipath to properly delete a block device, since mounts are still registered on the block device.
Actual results:
CSI Volume Mount cleanly unmounted.
Expected results:
CSI Volume Mount uncleanly unmounted.
Additional info:
The mountPropagation parameter is already implememted in the volumeMount for the host rootFS:
- name: host-slash
readOnly: true
mountPath: /host
mountPropagation: HostToContainer
However the same parameter is missing for the volumeMount of /var/lib/kubelet
It is possible to workaround the issue with a kubectl patch command like this:
$ kubectl patch daemonset ovnkube-node --type='json' -p='[
{
"op": "replace",
"path": "/spec/template/spec/containers/7/volumeMounts/1",
"value": {
"name": "host-kubelet",
"mountPath": "/var/lib/kubelet",
"mountPropagation": "HostToContainer",
"readOnly": true
}
}
]'
Affected Platforms: Platform Agnostic UPI
This is a clone of issue OCPBUGS-42812. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-42514. The following is the description of the original issue:
—
Description of problem:
When configuring the OpenShift image registry to use a custom Azure storage account in a different resource group, following the official documentation [1], the image-registy CO degrade and upgrade from version 4.14.x to 4.15.x fails. The image registry operator reports misconfiguration errors related to Azure storage credentials, preventing the upgrade and causing instability in the control plane.
[1] Configuring registry storage in Azure user infrastructure
Version-Release number of selected component (if applicable):
4.14.33, 4.15.33
How reproducible:
Steps to Reproduce:
We got the error
NodeCADaemonProgressing: The daemon set node-ca is deployed Progressing: Unable to apply resources: unable to sync storage configuration: client misconfigured, missing 'TenantID', 'ClientID', 'ClientSecret', 'FederatedTokenFile', 'Creds', 'SubscriptionID' option(s)
The oeprator will also genreate a new secret image-registry-private-configuration with the same content as image-registry-private-configuration-user
$ oc get secret image-registry-private-configuration -o yaml apiVersion: v1 data: REGISTRY_STORAGE_AZURE_ACCOUNTKEY: xxxxxxxxxxxxxxxxx kind: Secret metadata: annotations: imageregistry.operator.openshift.io/checksum: sha256:524fab8dd71302f1a9ade9b152b3f9576edb2b670752e1bae1cb49b4de992eee creationTimestamp: "2024-09-26T19:52:17Z" name: image-registry-private-configuration namespace: openshift-image-registry resourceVersion: "126426" uid: e2064353-2511-4666-bd43-29dd020573fe type: Opaque
2. then we delete the secret image-registry-private-configuration-user
now the secret image-registry-private-configuration will still exisit with the same content, but image-registry CO got a new error
NodeCADaemonProgressing: The daemon set node-ca is deployed Progressing: Unable to apply resources: unable to sync storage configuration: failed to get keys for the storage account arojudesa: storage.AccountsClient#ListKeys: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource 'Microsoft.Storage/storageAccounts/arojudesa' under resource group 'aro-ufjvmbl1' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix"
3. apply the workaround to manually changeing the secret installer-cloud-credentials azure_resourcegroup key with custom storage account resourcegroup
$ oc get secret installer-cloud-credentials -o yaml apiVersion: v1 data: azure_client_id: xxxxxxxxxxxxxxxxx azure_client_secret: xxxxxxxxxxxxxxxxx azure_region: xxxxxxxxxxxxxxxxx azure_resource_prefix: xxxxxxxxxxxxxxxxx azure_resourcegroup: xxxxxxxxxxxxxxxxx <<<<<-----THIS azure_subscription_id: xxxxxxxxxxxxxxxxx azure_tenant_id: xxxxxxxxxxxxxxxxx kind: Secret metadata: annotations: cloudcredential.openshift.io/credentials-request: openshift-cloud-credential-operator/openshift-image-registry-azure creationTimestamp: "2024-09-26T16:49:57Z" labels: cloudcredential.openshift.io/credentials-request: "true" name: installer-cloud-credentials namespace: openshift-image-registry resourceVersion: "133921" uid: d1268e2c-1825-49f0-aa44-d0e1cbcda383 type: Opaque
The image-registry report healthy and this help the continue the upgrade
Actual results:
The image registry seems still use the service principal way for Azure storage account authentication
Expected results:
We expect the REGISTRY_STORAGE_AZURE_ACCOUNTKEY should the only thing image registry operator need for storage account authentication if Customer provide
Additional info:
Slack : https://redhat-internal.slack.com/archives/CCV9YF9PD/p1727379313014789
With 4.15, resource watch completes whenever it is interrupted. But for 4.16 jobs, it does not until 1h grace period kicks in and the job is terminated by ci-operator. This means:
This was discovered when investigating this slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1705578623724259
This update is compatible with recent kube-openapi changes so that we can drop our replace k8s.io/kube-openapi => k8s.io/kube-openapi v0.0.0-20230928195430-ce36a0c3bb67 introduced in 53b387f4f54c8426526478afd0fd3e2b4e7aec66.
The cluster-dns-operator repository vendors k8s.io/* v0.28.3 and controller-runtime v0.16.3. OpenShift 4.16 is based on Kubernetes 1.29.
4.16.
Always.
Check https://github.com/openshift/cluster-dns-operator/blob/release-4.16/go.mod.
The k8s.io/* packages are at v0.28.3, and the sigs.k8s.io/controller-runtime package is at v0.16.3.
The k8s.io/* packages are at v0.29.0 or newer, and the sigs.k8s.io/controller-runtime package is at v0.17.0 or newer.
The controller-runtime v0.17 release includes some breaking changes, such as the removal of apiutil.NewDiscoveryRESTMapper; see the release notes at https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.17.0. The k8s.io/api v0.29 release drops flowcontrol/v1alpha1, which means we also need to bump openshift/api in order to get https://github.com/openshift/api/pull/1647. https://github.com/openshift/cluster-dns-operator/pull/394 will include a openshift/api bump that includes the removal of flowcontrol/v1alpha1 from openshift/api, so better to merge #394 first, and then bump k8s.io/api and controller-runtime after that.
Description of problem:
For high scalability, we need an option to disable unused machine management control plane components.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Create HostedCluster/HostedControlPlane 2. 3.
Actual results:
Machine management components (cluster-api, machine-approver, auto-scaler, etc) are deployed
Expected results:
Should have option to disable as some use cases they provide no utility.
Additional info:
Please review the following PR: https://github.com/openshift/azure-workload-identity/pull/16
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
[Azuredisk-csi-driver] allocatable volumes count incorrect in csinode for Standard_B4as_v2 instance types
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-09-02-132842
How reproducible:
Always
Steps to Reproduce:
1. Install Azure OpenShift cluster use the Standard_B4as_v2 instance type 2. Check the csinode object allocatable volumes count 3. Create a pod with the max allocatable volumes count pvcs(provision by azuredisk-csi-driver)
Actual results:
In step 2 the allocatable volumes count is 16. $ oc get csinode pewang-0908s-r6lwd-worker-southcentralus3-tvwwr -ojsonpath='{.spec.drivers[?(@.name=="disk.csi.azure.com")].allocatable.count}' 16 In step 3 the pod stuck at containerCreating that caused by attach volume failed of 09-07 22:38:28.758 "message": "The maximum number of data disks allowed to be attached to a VM of this size is 8.",\r
Expected results:
In step 2 the allocatable volumes count should be 8. In step 3 the pod should be Running well and all volumes could be read and written data
Additional info:
$ az vm list-skus -l eastus --query "[?name=='Standard_B4as_v2']"| jq -r '.[0].capabilities[] | select(.name =="MaxDataDiskCount")' { "name": "MaxDataDiskCount", "value": "8" } Currently in 4.14 we use the v1.28.1 driver, I checked the upstream issues and PRs, the issue fixed in v1.28.2 https://github.com/kubernetes-sigs/azuredisk-csi-driver/releases/tag/v1.28.2
This is a clone of issue OCPBUGS-35494. The following is the description of the original issue:
—
Description of problem:
ROSA Cluster creation goes into error status sometimes with version 4.16.0-0.nightly-2024-06-14-072943
Version-Release number of selected component (if applicable):
How reproducible:
60%
Steps to Reproduce:
1. Prepare VPC 2. Create a rosa sts cluster cluster with subnets 3. Wait for cluster ready
Actual results:
Cluster goes into error status
Expected results:
Cluster get ready
Additional info:
The failure happens by CI job triggering Here are the Jobs:
This is a clone of issue OCPBUGS-34533. The following is the description of the original issue:
—
Description of problem:
I have a CU who reported that they are not able to edit the "Until" option from developers perspective.
Version-Release number of selected component (if applicable):
OCP v4.15.11
Screenshot
https://redhat-internal.slack.com/archives/C04BSV48DJS/p1716889816419439
The API documentation for the status.componentRoutes.currentHostnames field in the ingress config API has developer notes from the Go definition.
OpenShift 4.11 and all subsequent versions of OpenShift so far.
100%.
1. Read the documentation for the API field: oc explain ingresses.status.componentRoutes.currentHostnames --api-version=config.openshift.io/v1
The ingresses.config.openshift.io CRD has developer notes in the description of the status.componentRoutes.currentHostnames field:
% oc explain ingresses.status.componentRoutes.currentHostnames --api-version=config.openshift.io/v1 KIND: Ingress VERSION: config.openshift.io/v1 FIELD: currentHostnames <[]string> DESCRIPTION: currentHostnames is the list of current names used by the route. Typically, this list should consist of a single hostname, but if multiple hostnames are supported by the route the operator may write multiple entries to this list. Hostname is an alias for hostname string validation. The left operand of the | is the original kubebuilder hostname validation format, which is incorrect because it allows upper case letters, disallows hyphen or number in the TLD, and allows labels to start/end in non-alphanumeric characters. See https://bugzilla.redhat.com/show_bug.cgi?id=2039256. ^([a-zA-Z0-9\p{S}\p{L}]((-?[a-zA-Z0-9\p{S}\p{L}]{0,62})?)|([a-zA-Z0-9\p{S}\p{L}](([a-zA-Z0-9-\p{S}\p{L}]{0,61}[a-zA-Z0-9\p{S}\p{L}])?)(\.)){1,}([a-zA-Z\p{L}]){2,63})$ The right operand of the | is a new pattern that mimics the current API route admission validation on hostname, except that it allows hostnames longer than the maximum length: ^(([a-z0-9][-a-z0-9]{0,61}[a-z0-9]|[a-z0-9]{1,63})[\.]){0,}([a-z0-9][-a-z0-9]{0,61}[a-z0-9]|[a-z0-9]{1,63})$ Both operand patterns are made available so that modifications on ingress spec can still happen after an invalid hostname was saved via validation by the incorrect left operand of the | operator.
The second paragraph should be omitted from the CRD:
% oc explain ingresses.status.componentRoutes.currentHostnames --api-version=config.openshift.io/v1 KIND: Ingress VERSION: config.openshift.io/v1 FIELD: currentHostnames <[]string> DESCRIPTION: currentHostnames is the list of current names used by the route. Typically, this list should consist of a single hostname, but if multiple hostnames are supported by the route the operator may write multiple entries to this list.
The API field was introduced in OpenShift 4.8: https://github.com/openshift/api/pull/852/commits/c53c57f3d465f28b27ee4fad48763f049228486e
The developer note was added in OpenShift 4.11: https://github.com/openshift/api/pull/1120/commits/1fec415423985530a8925a5fd8c87e1741d8c2fb
Description of problem:
'kubeadmin' user unable to logout when logged with 'kube:admin' IDP, clicking on 'Log out' does nothing
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-04-06-020637
How reproducible:
Always
Steps to Reproduce:
1. Login to console with 'kube:admin' IDP, type username 'kubeadmin' and its password 2. Try to Log out from console
Actual results:
2. unable to log out successfully
Expected results:
2. any user should be able to log out successfully
Additional info:
A runbook for the alert TargetDown will useful for Openshift users.
This runbook should explain:
1. how to identify which targets are down
2. how to investigate the reason why the target goes offline
3. resolution of common causes bringing down the target
Related cases:
This is a clone of issue OCPBUGS-43788. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-18007. The following is the description of the original issue:
—
Description of problem:
When the TelemeterClientFailures alert fires, there's no runbook link explaining the meaning of the alert and what to do about it.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Check the TelemeterClientFailures alerting rule's annotations 2. 3.
Actual results:
No runbook_url annotation.
Expected results:
runbook_url annotation is present.
Additional info:
This is a consequence of a telemeter server outage that triggered questions from customers about the alert: https://issues.redhat.com/browse/OHSS-25947 https://issues.redhat.com/browse/OCPBUGS-17966 Also in relation to https://issues.redhat.com/browse/OCPBUGS-17797
Description of problem:
Openshift Console shows "Info alert:Non-printable file detected. File contains non-printable characters. Preview is not available." while edit an XML file type configmaps.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create configmap from file: # oc create cm test-cm --from-file=server.xml=server.xml configmap/test-cm created 2. If we try to edit the configmap in the OCP console we see the following error: Info alert:Non-printable file detected. File contains non-printable characters. Preview is not available.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-baremetal-operator/pull/393
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Port 22 is added to the worker node security group in TF install [1]: resource "aws_security_group_rule" "worker_ingress_ssh" { type = "ingress" security_group_id = aws_security_group.worker.id description = local.description protocol = "tcp" cidr_blocks = var.cidr_blocks from_port = 22 to_port = 22 } But it's missing in SDK install [2] [1] https://github.com/openshift/installer/blob/master/data/data/aws/cluster/vpc/sg-worker.tf#L39-L48 [2] https://github.com/openshift/installer/pull/7676/files#diff-c89a0152f7d51be6e3830081d1c166d9333628982773c154d8fc9a071c8ff765R272
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-03-31-180021
How reproducible:
Always
Steps to Reproduce:
1. Create a cluster using SDK installation method 2. 3.
Actual results:
See description.
Expected results:
Port 22 is added to worker node's security group.
Additional info:
Description of problem:
CMPS was supported in 4.15 on vsphere platform when enable TechPreviewNoUpgrade. but after I build the cluster with no failure domains/single failure domain setting in install-config. there were three duplicated failure domains.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-11-033133
How reproducible:
install a cluster with TP enabled and don't set failure domain (or set single failure doamin) in install-config.
Steps to Reproduce:
1. do not config failure domain in install-config (or set single failure doamin). 2. install a cluster with TP enabled 3. check CPMS with command: oc get controlplanemachineset -oyaml
Actual results:
duplicated failure domains. failureDomains: platform: VSphere vsphere: - name: generated-failure-domain - name: generated-failure-domain - name: generated-failure-domain metadata: labels:
Expected results:
failure domain should not duplicated when setting single failure domain in install-config. failure domain should not exists when not setting failure domain in install-config.
Additional info:
Description of problem:
If GloballyDisableIrqLoadBalancing in disabled in the performance profile then irqs should be balanced across all cpus minus the cpus that are explicitly removed by crio via the pod annotation irq-load-balancing.crio.io: "disable" There's an issue when the scheduler plugin in tuned will attempt to affine all irqs to the non-isolated cores. Isolated here means non-reserved, not truly isolated cores. This is directly at odds with the user intent. So now we have tuned fighting with crio/irqbalance both trying to do different things. Scenarios - If a pod get’s launched with the annotation after tuned has started, runtime or after a reboot - ok - On a reboot if tuned recovers after the guaranteed pod has been launched - broken - If tuned restarts at runtime for any reason - broken
Version-Release number of selected component (if applicable):
4.14 and likely earlier
How reproducible:
See description
Steps to Reproduce:
1.See description 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem: OCP doesn't resume from "hibernation" (shutdown/restart of cloud instances).
NB: This is not related to certs.
Version-Release number of selected component (if applicable): 4.16 nightlies, at least 4.16.0-0.nightly-2024-05-14-095225 through 4.16.0-0.nightly-2024-05-21-043355
How reproducible: 100%
Steps to Reproduce:
1. Install 4.16 nightly on AWS. (Other platforms may be affected, don't know.)
2. Shut down all instances. (I've done this via hive hibernation; Vadim Rutkovsky has done it via cloud console.)
3. Start instances. (Ditto.)
Actual results: OCP doesn't start. Per Vadim:
"kubelet says host IP unknown; known addresses: [] so etcd can't start."
Expected results: OCP starts normally.
Additional info: We originally thought this was related to OCPBUGS-30860, but reproduced with nightlies containing the updated AMIs.
In https://issues.redhat.com/browse/OCPBUGS-24195 Lukasz is working on a solution to a problem both the auth and apiserver operators have where a large number of identical kube events can be emitted. The kube apiserver was granted an exception here, but the linked bug was never fixed.
These OpenShiftAPICheckFailed events are reportedly originating during bootstrap, and if bootstrap takes too long many can be emitted, which can trip a test that watches for this sort of thing.
Ideally the problem should be fixed and it sounds like Lukasz is on the path to one which we hope could be used for the apiserver operator as well. (start a controller monitoring the aggregated API only after the bootstrap is complete)
Fix here would hopefully be to leverage what comes out of OCPBUGS-24195, apply it for the apiserver operator, and then remove the exception linked above in origin.
This is a clone of issue OCPBUGS-23922. The following is the description of the original issue:
—
Description of problem:
In https://issues.redhat.com//browse/STOR-1453: TLSSecurityProfile feature, storage clustercsidriver.spec.observedConfig will get the value from APIServer.spec.tlsSecurityProfile to set cipherSuites and minTLSVersion in all corresponding csi driver, but it doesn't work well in hypershift cluster when only setting different value in the hostedclusters.spec.configuration.apiServer.tlsSecurityProfile in management cluster, the APIServer.spec in hosted cluster is not synced and CSI driver doesn't get the updated value as well.
Version-Release number of selected component (if applicable):
Pre-merge test with openshift/csi-operator#69,openshift/csi-operator#71
How reproducible:
Always
Steps to Reproduce:
1. Have a hypershift cluster, the clustercsidriver get the default value like "minTLSVersion": "VersionTLS12" $ oc get clustercsidriver ebs.csi.aws.com -ojson | jq .spec.observedConfig.targetcsiconfig.servingInfo { "cipherSuites": [ "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256", "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256" ], "minTLSVersion": "VersionTLS12" } 2. set the tlsSecurityProfile in hostedclusters.spec.configuration.apiServer in mgmtcluster, like the "minTLSVersion": "VersionTLS11": $ oc -n clusters get hostedclusters hypershift-ci-14206 -o json | jq .spec.configuration { "apiServer": { "audit": { "profile": "Default" }, "tlsSecurityProfile": { "custom": { "ciphers": [ "ECDHE-ECDSA-CHACHA20-POLY1305", "ECDHE-RSA-CHACHA20-POLY1305", "ECDHE-RSA-AES128-GCM-SHA256", "ECDHE-ECDSA-AES128-GCM-SHA256" ], "minTLSVersion": "VersionTLS11" }, "type": "Custom" } } } 3. This doesn't pass to apiserver in hosted cluster oc get apiserver cluster -ojson | jq .spec { "audit": { "profile": "Default" } } 4. CSI Driver still use the default value which is different from mgmtcluster.hostedclusters.spec.configuration.apiServer $ oc get clustercsidriver ebs.csi.aws.com -ojson | jq .spec.observedConfig.targetcsiconfig.servingInfo { "cipherSuites": [ "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256", "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256" ], "minTLSVersion": "VersionTLS12" }
Actual results:
The tlsSecurityProfile doesn't get synced
Expected results:
The tlsSecurityProfile should get synced
Additional info:
As a developer I want to remove the NoUpgrade annotation from the CAPI IPAM CRDs so that I can promote them to General Availability
The SPLAT team is planning to have the CAPI IPAM CRDs promoted to GA because they need them in a component they are promoting to GA.
Description of problem:
To operate HyperShift at high scale, we need an option to disable dedicated request serving isolation, if not used.
Version-Release number of selected component (if applicable):
4.16, 4.15, 4.14, 4.13
How reproducible:
100%
Steps to Reproduce:
1. Install hypershift operator for versions 4.16, 4.15, 4.14, or 4.13 2. Observe start-up logs 3. Dedicated request serving isolation controllers are started
Actual results:
Dedicated request serving isolation controllers are started
Expected results:
Dedicated request serving isolation controllers to not start, if unneeded
Additional info:
This is a clone of issue OCPBUGS-25758. The following is the description of the original issue:
—
Description of problem:
router pod is in CrashLoopBackup after y-stream upgrade from 4.13->4.14
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. create a cluster with 4.13 2. upgrade HC to 4.14 3.
Actual results:
router pod in CrashLoopBackoff
Expected results:
router pod is running after upgrade HC from 4.13->4.14
Additional info:
images: ====== HO image: 4.15 upgrade HC from 4.13.0-0.nightly-2023-12-19-114348 to 4.14.0-0.nightly-2023-12-19-120138 router pod log: ============== jiezhao-mac:hypershift jiezhao$ oc get pods router-9cfd8b89-plvtc -n clusters-jie-test NAME READY STATUS RESTARTS AGE router-9cfd8b89-plvtc 0/1 CrashLoopBackOff 11 (45s ago) 32m jiezhao-mac:hypershift jiezhao$ Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 27m default-scheduler Successfully assigned clusters-jie-test/router-9cfd8b89-plvtc to ip-10-0-42-36.us-east-2.compute.internal Normal AddedInterface 27m multus Add eth0 [10.129.2.82/23] from ovn-kubernetes Normal Pulling 27m kubelet Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3d2acba15f69ea3648b3c789111db34ff06d9230a4371c5949ebe3c6218e6ea3" Normal Pulled 27m kubelet Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3d2acba15f69ea3648b3c789111db34ff06d9230a4371c5949ebe3c6218e6ea3" in 14.309s (14.309s including waiting) Normal Created 26m (x3 over 27m) kubelet Created container private-router Normal Started 26m (x3 over 27m) kubelet Started container private-router Warning BackOff 26m (x5 over 27m) kubelet Back-off restarting failed container private-router in pod router-9cfd8b89-plvtc_clusters-jie-test(e6cf40ad-32cd-438c-8298-62d565cf6c6a) Normal Pulled 26m (x3 over 27m) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3d2acba15f69ea3648b3c789111db34ff06d9230a4371c5949ebe3c6218e6ea3" already present on machine Warning FailedToRetrieveImagePullSecret 2m38s (x131 over 27m) kubelet Unable to retrieve some image pull secrets (router-dockercfg-q768b); attempting to pull the image may not succeed. jiezhao-mac:hypershift jiezhao$ jiezhao-mac:hypershift jiezhao$ oc logs router-9cfd8b89-plvtc -n clusters-jie-test [NOTICE] (1) : haproxy version is 2.6.13-234aa6d [NOTICE] (1) : path to executable is /usr/sbin/haproxy [ALERT] (1) : config : [/usr/local/etc/haproxy/haproxy.cfg:52] : 'server ovnkube_sbdb/ovnkube_sbdb' : could not resolve address 'None'. [ALERT] (1) : config : Failed to initialize server(s) addr. jiezhao-mac:hypershift jiezhao$ notes: ===== not sure if it has the same root cause as https://issues.redhat.com/browse/OCPBUGS-24627
Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver-operator/pull/66
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
To bump some dependencies for CVE fixes, we added `replace` directives in the go.mod file. These dependencies have since moved way past the pinned version. We should drop the replaces before we run into problems from having deps pinned to versions that are too old. For example, I've seen PRs with the following diff: # golang.org/x/net v0.23.0 => golang.org/x/net v0.5.0 which is not really what we want.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Some dependencies are not upgraded because they are pinned.
Expected results:
Additional info:
When we apply a machine config with additional ssh key info, this action only needs to uncordon the node, when uncordon is happening, condition Cordoned = True. it will make the user confuse. maybe we can refine this design to show status of cordon/uncordon separately
lastTransitionTime: '2023-11-28T16:53:58Z' message: 'Action during previous iteration: (Un)Cordoned node. The node is reporting Unschedulable = false' reason: UpdateCompleteCordoned status: 'False' type: Cordoned
Description of problem:
Cluster install failed on ibm cloud and machine-api-controllers stucks in CrashLoopBackOff
Version-Release number of selected component (if applicable):
from 4.16.0-0.nightly-2024-02-02-224339
How reproducible:
Always
Steps to Reproduce:
1. Install cluster on IBMCloud 2. 3.
Actual results:
Cluster install failed $ oc get node NAME STATUS ROLES AGE VERSION maxu-16-gp2vp-master-0 Ready control-plane,master 7h11m v1.29.1+2f773e8 maxu-16-gp2vp-master-1 Ready control-plane,master 7h11m v1.29.1+2f773e8 maxu-16-gp2vp-master-2 Ready control-plane,master 7h11m v1.29.1+2f773e8 $ oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE maxu-16-gp2vp-master-0 7h15m maxu-16-gp2vp-master-1 7h15m maxu-16-gp2vp-master-2 7h15m maxu-16-gp2vp-worker-1-xfvqq 7h5m maxu-16-gp2vp-worker-2-5hn7c 7h5m maxu-16-gp2vp-worker-3-z74z2 7h5m openshift-machine-api machine-api-controllers-6cb7fcdcdb-k6sv2 6/7 CrashLoopBackOff 92 (31s ago) 7h1m $ oc logs -n openshift-machine-api -c machine-controller machine-api-controllers-6cb7fcdcdb-k6sv2 I0204 10:53:34.336338 1 main.go:120] Watching machine-api objects only in namespace "openshift-machine-api" for reconciliation.panic: runtime error: invalid memory address or nil pointer dereference[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x285fe72] goroutine 25 [running]:k8s.io/klog/v2/textlogger.(*tlogger).Enabled(0x0?, 0x0?) /go/src/github.com/openshift/machine-api-provider-ibmcloud/vendor/k8s.io/klog/v2/textlogger/textlogger.go:81 +0x12sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).Enabled(0xc000438100, 0x0?) /go/src/github.com/openshift/machine-api-provider-ibmcloud/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:114 +0x92github.com/go-logr/logr.Logger.Info({{0x3232210?, 0xc000438100?}, 0x0?}, {0x2ec78f3, 0x17}, {0x0, 0x0, 0x0}) /go/src/github.com/openshift/machine-api-provider-ibmcloud/vendor/github.com/go-logr/logr/logr.go:276 +0x72sigs.k8s.io/controller-runtime/pkg/metrics/server.(*defaultServer).Start(0xc0003bd2c0, {0x322e350?, 0xc00058a140}) /go/src/github.com/openshift/machine-api-provider-ibmcloud/vendor/sigs.k8s.io/controller-runtime/pkg/metrics/server/server.go:185 +0x75sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1(0xc0002c4540) /go/src/github.com/openshift/machine-api-provider-ibmcloud/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:223 +0xc8created by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile in goroutine 24 /go/src/github.com/openshift/machine-api-provider-ibmcloud/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:207 +0x19d
Expected results:
Cluster install succeed
Additional info:
may relate to this pr https://github.com/openshift/machine-api-provider-ibmcloud/pull/34
Our ocp Dockerfile curently uses the build target. This however now also builds the react frontend and requires npm in turn. We build the frontend during mirroring already.
Switch out Dockerfile to use the common-build target. This should enable the bump to 0.27, tracked through this issue as well.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-38412. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-32773. The following is the description of the original issue:
—
Description of problem:
In the OpenShift WebConsole, when using the Instantiate Template screen, the values entered into the form are automatically cleared. This issue occurs for users with developer roles who do not have administrator privileges, but does not occur for users with the cluster-admin cluster role. Additionally, using the developer tools of the web browser, I observed the following console logs when the values were cleared: https://console-openshift-console.apps.mmatsuta-blue.apac.aws.cee.support/api/prometheus/api/v1/rules 403 (Forbidden) https://console-openshift-console.apps.mmatsuta-blue.apac.aws.cee.support/api/alertmanager/api/v2/silences 403 (Forbidden) It appears that a script attempting to fetch information periodically from PrometheusRule and Alertmanager's silences encounters a 403 error due to insufficient permissions, which causes the script to halt and the values in the form to be reset and cleared. This bug prevents users from successfully creating instances from templates in the WebConsole.
Version-Release number of selected component (if applicable):
4.15 4.14
How reproducible:
YES
Steps to Reproduce:
1. Log in with a non-administrator account. 2. Select a template from the developer catalog and click on Instantiate Template. 3. Enter values into the initially empty form. 4. Wait for several seconds, and the entered values will disappear.
Actual results:
Entered values are disappeard
Expected results:
Entered values are appeard
Additional info:
I could not find the appropriate component to report this issue. I reluctantly chose Dev Console, but please adjust it to the correct component.
Description of problem:
CAPI manifests have the TechPreviewNoUpgrade annotation but are missing the CustomNoUpgrade annotation
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Results of -hypershift-aws-e2e-external CI jobs do not contain obvious reason why a test failed. For example, this TestCreateCluster is listed as failed, but all failures in TestCreateCluster look like errors dumping the cluster after failure.
It should show that "storage operator did not become Available=true". Or even tell that "pod cluster-storage-operator-6f6d69bf89-fx2d2 in the hosted control plane XYZ is in CrashloopBackoff"
The PR under test had a simple typo leading to crashloop and it should be more obvious what went wrong.
Version-Release number of selected component (if applicable):
4.15.0-0.ci.test-2023-10-03-040803
Description of the problem:
In PSI, BE master ~ 2.30 - Massive amount of the following message "Cluster was updated with api-vip <IP ADDRESS>, ingress-vip <IP ADDRESS>" in cluster events.
This message is repeating itself every minute X 5 (I guess related to number of hosts ?)
the installation was started, but aborted due to network connection issues.
I've tried to reproduce in staging, but couldn't.
How reproducible:
Still checking
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
This message should not be shown more than once
Description of problem:
A recent [PR](https://github.com/openshift/hypershift/commit/c030ab66d897815e16d15c987456deab8d0d6da0) updated the kube-apiserver service port to `6443`. That change causes a small outage when upgrading from a 4.13 cluster in IBMCloud. We need to keep the service port as 2040 for IBM Cloud Provider to avoid the outage.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
We've moved to using bigquery for this, stop pushing and free up some loki cycles.
This is done with a command in origin: https://github.com/openshift/origin/blob/24e011ba3adf2767b88619351895bb878de3d62a/pkg/cmd/openshift-tests/dev/dev.go#L211
So all this code and probably some libraries could be removed.
But first, remove the invocation of this command in the release repo. (upload-intervals)
Description of problem:
The pod of catalogsource without registryPoll wasn't recreated during the node failure
jiazha-mac:~ jiazha$ oc get pods NAME READY STATUS RESTARTS AGE certified-operators-rcs64 1/1 Running 0 123m community-operators-8mxh6 1/1 Running 0 123m marketplace-operator-769fbb9898-czsfn 1/1 Running 4 (117m ago) 136m qe-app-registry-5jxlx 1/1 Running 0 106m redhat-marketplace-4bgv9 1/1 Running 0 123m redhat-operators-ww5tb 1/1 Running 0 123m test-2xvt8 1/1 Terminating 0 12m jiazha-mac:~ jiazha$ oc get pods test-2xvt8 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-2xvt8 1/1 Running 0 7m6s 10.129.2.26 qe-daily-417-0708-cv2p6-worker-westus-gcrrc <none> <none> jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc NAME STATUS ROLES AGE VERSION qe-daily-417-0708-cv2p6-worker-westus-gcrrc NotReady worker 116m v1.30.2+421e90e
Version-Release number of selected component (if applicable):
Cluster version is 4.17.0-0.nightly-2024-07-07-131215
How reproducible:
always
Steps to Reproduce:
1. create a catalogsource without the registryPoll configure. jiazha-mac:~ jiazha$ cat cs-32183.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: test namespace: openshift-marketplace spec: displayName: Test Operators image: registry.redhat.io/redhat/redhat-operator-index:v4.16 publisher: OpenShift QE sourceType: grpc jiazha-mac:~ jiazha$ oc create -f cs-32183.yaml catalogsource.operators.coreos.com/test created jiazha-mac:~ jiazha$ oc get pods test-2xvt8 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-2xvt8 1/1 Running 0 3m18s 10.129.2.26 qe-daily-417-0708-cv2p6-worker-westus-gcrrc <none> <none> 2. Stop the node jiazha-mac:~ jiazha$ oc debug node/qe-daily-417-0708-cv2p6-worker-westus-gcrrc Temporary namespace openshift-debug-q4d5k is created for debugging node... Starting pod/qe-daily-417-0708-cv2p6-worker-westus-gcrrc-debug-v665f ... To use host binaries, run `chroot /host` Pod IP: 10.0.128.5 If you don't see a command prompt, try pressing enter. sh-5.1# chroot /host sh-5.1# systemctl stop kubelet; sleep 600; systemctl start kubelet Removing debug pod ... Temporary namespace openshift-debug-q4d5k was removed. jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc NAME STATUS ROLES AGE VERSION qe-daily-417-0708-cv2p6-worker-westus-gcrrc NotReady worker 115m v1.30.2+421e90e 3. check it this catalogsource's pod recreated.
Actual results:
No new pod was generated.
jiazha-mac:~ jiazha$ oc get pods NAME READY STATUS RESTARTS AGE certified-operators-rcs64 1/1 Running 0 123m community-operators-8mxh6 1/1 Running 0 123m marketplace-operator-769fbb9898-czsfn 1/1 Running 4 (117m ago) 136m qe-app-registry-5jxlx 1/1 Running 0 106m redhat-marketplace-4bgv9 1/1 Running 0 123m redhat-operators-ww5tb 1/1 Running 0 123m test-2xvt8 1/1 Terminating 0 12m
once node recovery, a new pod was generated.
jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc
NAME STATUS ROLES AGE VERSION
qe-daily-417-0708-cv2p6-worker-westus-gcrrc Ready worker 127m v1.30.2+421e90e
jiazha-mac:~ jiazha$ oc get pods
NAME READY STATUS RESTARTS AGE
certified-operators-rcs64 1/1 Running 0 127m
community-operators-8mxh6 1/1 Running 0 127m
marketplace-operator-769fbb9898-czsfn 1/1 Running 4 (121m ago) 140m
qe-app-registry-5jxlx 1/1 Running 0 109m
redhat-marketplace-4bgv9 1/1 Running 0 127m
redhat-operators-ww5tb 1/1 Running 0 127m
test-wqxvg 1/1 Running 0 27s
Expected results:
During the node failure, a new catalog source pod should be generated.
Additional info:
Hi Team,
After some more investigating the source code of operator-lifecycle-manager, we figure out the reason.
apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: redhat-operator-index namespace: openshift-marketplace spec: image: quay-server.bastion.tokyo.com:5000/redhat/redhat-operator-index-logging:logging-vstable-5.8-v5.8.5 sourceType: grpc
And we verified that the catalog pod can be recreated on other node if we add the configuration of registryPoll to catalogsource as the following (The lines with <==).
apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: redhat-operator-index namespace: openshift-marketplace spec: image: quay-server.bastion.tokyo.com:5000/redhat/redhat-operator-index-logging:logging-vstable-5.8-v5.8.5 sourceType: grpc updateStrategy: <== registryPoll: <== interval: 10m <==
The registryPoll is NOT MUST for catalogsource.
So the commit [1] trying to fix the issue in EnsureRegistryServer() is not properly.
[1] https://github.com/operator-framework/operator-lifecycle-manager/pull/3201/files
[2] https://github.com/joelanford/operator-lifecycle-manager/blob/82f499723e52e85f28653af0610b6e7feff096cf/pkg/controller/registry/reconciler/grpc.go#L290
[3] https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/operators/catalog/operator.go#L1009
[4] https://docs.openshift.com/container-platform/4.16/operators/admin/olm-managing-custom-catalogs.html
These tests look to have been mistakenly deleted in 4.14 during the big monitortest refactor. We could use them right now to identify gcp jobs with spatter disruption.
[sig-network] there should be nearly zero single second disruptions for _
[sig-network] there should be reasonably few single second disruptions for _
Find out what happened and get them restored. Code is there but it looks like there are assumptions about extracting the backend name that may have been broken somewhere.
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/135
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-34631. The following is the description of the original issue:
—
Description of problem:
ManagedField in YAML editor is not collapsed by default which is incorrect, From OCP4.7, the field should be collapsed by default
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.Navigate to Workloads>pods and click on any pod to display details page. 2.Click on the YAML and scroll down to see the managedFields 3.
Actual results:
The field is not collapsed
Expected results:
The field should be collapsed by default
Additional info:
This is a clone of issue OCPBUGS-35547. The following is the description of the original issue:
—
Description of problem:
When creating IPI cluster, following unexpected traceback appears in terminal occasionally, it won't cause any failure and install succeed finally. # ./openshift-install create cluster --dir cluster --log-level debug ... INFO Importing OVA sgao-nest-ktqck-rhcos-generated-region-generated-zone into failure domain generated-failure-domain. [controller-runtime] log.SetLogger(...) was never called; logs will not be displayed. Detected at: > goroutine 131 [running]: > runtime/debug.Stack() > /usr/lib/golang/src/runtime/debug/stack.go:24 +0x5e > sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot() > /go/src/github.com/openshift/installer/vendor/sigs.k8s.io/controller-runtime/pkg/log/log.go:60 +0xcd > sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).Error(0xc000e37200, {0x26fd23c0, 0xc0016b4270}, {0x77d22d3, 0x3d}, {0x0, 0x0, 0x0}) > /go/src/github.com/openshift/installer/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:139 +0x5d > github.com/go-logr/logr.Logger.Error({{0x270398d8?, 0xc000e37200?}, 0x0?}, {0x26fd23c0, 0xc0016b4270}, {0x77d22d3, 0x3d}, {0x0, 0x0, 0x0}) > /go/src/github.com/openshift/installer/vendor/github.com/go-logr/logr/logr.go:301 +0xda > sigs.k8s.io/cluster-api-provider-vsphere/pkg/session.newClient.func1({0x26fd6c40?, 0xc0021f0160?}) > /go/src/github.com/openshift/installer/vendor/sigs.k8s.io/cluster-api-provider-vsphere/pkg/session/session.go:265 +0xda > sigs.k8s.io/cluster-api-provider-vsphere/pkg/session.newClient.KeepAliveHandler.func2() > /go/src/github.com/openshift/installer/vendor/github.com/vmware/govmomi/session/keep_alive.go:36 +0x22 > github.com/vmware/govmomi/session/keepalive.(*handler).Start.func1() > /go/src/github.com/openshift/installer/vendor/github.com/vmware/govmomi/session/keepalive/handler.go:124 +0x98 > created by github.com/vmware/govmomi/session/keepalive.(*handler).Start in goroutine 1 > /go/src/github.com/openshift/installer/vendor/github.com/vmware/govmomi/session/keepalive/handler.go:116 +0x116
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-13-213831
How reproducible:
sometimes
Steps to Reproduce:
1. Create IPI cluster on vSphere multiple times 2, Check output in terminal
Actual results:
unexpected log traceback appears in terminal
Expected results:
unexpected log traceback should not appear in terminal
Additional info:
Please review the following PR: https://github.com/openshift/k8s-prometheus-adapter/pull/98
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Work to setup the endpoint will be handled in another card.
For this one we want to setup a new disruption backend similar to the cluster-network-liveness-probe. We'll poll, submit request ids and possibly job identifiers.
This approach us gets us free disruption intervals in bigquery, charting per job, and graphing capabilities from the disruption dashboard.
The Route API documentation states that the default value for the spec.tls.insecureEdgeTerminationPolicy field is "Allow". However, the observable default behavior is that of "None".
OpenShift 3.11 and earlier and OpenShift 4.1 through 4.16.
100%.
1. Check the documentation: oc explain routes.spec.tls.insecureEdgeTerminationPolicy
2. Create an example application and edge-terminated route without specifying insecureEdgeTerminationPolicy, and try to connect to the route using HTTP:
oc adm new-project hello-openshift oc -n hello-openshift create -f https://raw.githubusercontent.com/openshift/origin/56867df5e362aab0d2d8fa8c225e6761c7469781/examples/hello-openshift/hello-pod.json oc -n hello-openshift expose pod hello-openshift oc -n hello-openshift create route edge --service=hello-openshift curl -k https://hello-openshift-hello-openshift.apps.<cluster domain> curl -I http://hello-openshift-hello-openshift.apps.<cluster domain>
The documentation states that "Allow" is the default:
% oc explain routes.spec.tls.insecureEdgeTerminationPolicy KIND: Route VERSION: route.openshift.io/v1 FIELD: insecureEdgeTerminationPolicy <string> DESCRIPTION: insecureEdgeTerminationPolicy indicates the desired behavior for insecure connections to a route. While each router may make its own decisions on which ports to expose, this is normally port 80. * Allow - traffic is sent to the server on the insecure port (edge/reencrypt terminations only) (default). * None - no traffic is allowed on the insecure port. * Redirect - clients are redirected to the secure port.
However, in practice, the default seems to be "None":
% oc adm new-project hello-openshift Created project hello-openshift % oc -n hello-openshift create -f https://raw.githubusercontent.com/openshift/origin/56867df5e362aab0d2d8fa8c225e6761c7469781/examples/hello-openshift/hello-pod.json pod/hello-openshift created % oc -n hello-openshift expose pod hello-openshift service/hello-openshift exposed % oc -n hello-openshift create route edge --service=hello-openshift route.route.openshift.io/hello-openshift created % oc -n hello-openshift get routes/hello-openshift -o yaml apiVersion: route.openshift.io/v1 kind: Route metadata: annotations: openshift.io/host.generated: "true" creationTimestamp: "2024-04-02T22:59:32Z" labels: name: hello-openshift name: hello-openshift namespace: hello-openshift resourceVersion: "27147" uid: 50029f66-a089-4ec0-be04-91f176883e2b spec: host: hello-openshift-hello-openshift.apps.8fbd3fa1605eb7f8632a.hypershift.aws-2.ci.openshift.org tls: termination: edge to: kind: Service name: hello-openshift weight: 100 wildcardPolicy: None status: ingress: - conditions: - lastTransitionTime: "2024-04-02T22:59:32Z" status: "True" type: Admitted host: hello-openshift-hello-openshift.apps.8fbd3fa1605eb7f8632a.hypershift.aws-2.ci.openshift.org routerCanonicalHostname: router-default.apps.8fbd3fa1605eb7f8632a.hypershift.aws-2.ci.openshift.org routerName: default wildcardPolicy: None - conditions: - lastTransitionTime: "2024-04-02T22:59:32Z" status: "True" type: Admitted host: hello-openshift-hello-openshift.apps.8fbd3fa1605eb7f8632a.hypershift.aws-2.ci.openshift.org routerCanonicalHostname: router-custom.custom.8fbd3fa1605eb7f8632a.hypershift.aws-2.ci.openshift.org routerName: custom wildcardPolicy: None % curl -k https://hello-openshift-hello-openshift.apps.8fbd3fa1605eb7f8632a.hypershift.aws-2.ci.openshift.org Hello OpenShift! % curl -I http://hello-openshift-hello-openshift.apps.8fbd3fa1605eb7f8632a.hypershift.aws-2.ci.openshift.org HTTP/1.0 503 Service Unavailable pragma: no-cache cache-control: private, max-age=0, no-cache, no-store content-type: text/html
Given the API documentation, I would maybe expect to see insecureEdgeTerminationPolicy: Allow in the route definition, and I would definitely expect the curl http:// command to succeed.
Alternatively, I would expect the API documentation to state that the default for insecureEdgeTerminationPolicy is "None", based on the observed behavior.
The current "(default)" text was added in https://github.com/openshift/origin/pull/10983/commits/dc1aecd4bcdae7525536180bab2a0a0083aaa0f4.
This is a clone of issue OCPBUGS-38692. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-38114. The following is the description of the original issue:
—
Description of problem:
Starting from version 4.16, the installer does not support creating a cluster in AWS with the OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY=true flag enabled anymore.
Version-Release number of selected component (if applicable):
How reproducible:
The installation procedure fails systemically when using a predefined VPC
Steps to Reproduce:
1. Follow the procedure at https://docs.openshift.com/container-platform/4.16/installing/installing_aws/ipi/installing-aws-vpc.html#installation-aws-config-yaml_installing-aws-vpc to prepare an install-config.yaml in order to install a cluster with a custom VPC 2. Run `openshift-install create cluster ...' 3. The procedure fails: `failed to create load balancer`
Actual results:
The installation procedure fails.
Expected results:
An OCP cluster to be provisioned in AWS, with public subnets only.
Additional info:
Description of problem:
The customer has a custom apiserver certificate.
This error can be found while trying to uninstall any operator by console:
openshift-console/pods/console-56494b7977-d7r76/console/console/logs/current.log:
2023-10-24T14:13:21.797447921+07:00 E1024 07:13:21.797400 1 operands_handler.go:67] Failed to get new client for listing operands: Get "https://api.<cluster>.<domain>:6443/api?timeout=32s": x509: certificate signed by unknown authority
when trying the same request from the console pod we can see no issue.
We see the root ca that signs apiserver certificate and this CA is trusted in the pod.
It seems the code that provokes this issue is:
https://github.com/openshift/console/blob/master/pkg/server/operands_handler.go#L62-L70
Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/1978
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Reviewing 4.15 Install failures (install should succeed: overall) there are a number of variants impacted by recent install failures.
search.ci: Cluster operator console is not available
Jobs like periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-serial show failures that appear to start with 4.15.0-0.nightly-2023-12-07-225558 have installation failures due to console-operator
ConsoleOperator reconciliation failed: Operation cannot be fulfilled on consoles.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again
4.15.0-0.nightly-2023-12-07-225558 contains console-operator/pull/814, noting in case it is related
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Steps to Reproduce:
1. Review link to install failures above 2. 3.
Actual results:
Expected results:
Additional info:
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-sdn
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade
Description of problem:
The Port_Security has been override although it has set to false in Worker machineset configuration
Version-Release number of selected component (if applicable):
OCP=4.14.14 RHOSP=17.1
How reproducible:
NFV Perf lab ShiftonStack Deployment mode = IPI
Steps to Reproduce:
1.Network configuration resources for Worker node $ oc get machinesets.machine.openshift.io -n openshift-machine-api | grep worker 5kqfbl3y0rhocpnfv-wj2jj-worker-0 1 1 1 1 5d23h $ oc describe machinesets.machine.openshift.io -n openshift-machine-api 5kqfbl3y0rhocpnfv-wj2jj-worker-0 Name: 5kqfbl3y0rhocpnfv-wj2jj-worker-0 Namespace: openshift-machine-api Labels: machine.openshift.io/cluster-api-cluster=5kqfbl3y0rhocpnfv-wj2jj machine.openshift.io/cluster-api-machine-role=worker machine.openshift.io/cluster-api-machine-type=worker Annotations: machine.openshift.io/memoryMb: 47104 machine.openshift.io/vCPU: 26 API Version: machine.openshift.io/v1beta1 Kind: MachineSet Metadata: Creation Timestamp: 2024-03-07T05:24:07Z Generation: 3 Resource Version: 226098 UID: 8cb06872-9b62-4c2c-b66b-bf91a03efa2d Spec: Replicas: 1 Selector: Match Labels: machine.openshift.io/cluster-api-cluster: 5kqfbl3y0rhocpnfv-wj2jj machine.openshift.io/cluster-api-machineset: 5kqfbl3y0rhocpnfv-wj2jj-worker-0 Template: Metadata: Labels: machine.openshift.io/cluster-api-cluster: 5kqfbl3y0rhocpnfv-wj2jj machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: 5kqfbl3y0rhocpnfv-wj2jj-worker-0 Spec: Lifecycle Hooks: Metadata: Provider Spec: Value: API Version: machine.openshift.io/v1alpha1 Availability Zone: worker Cloud Name: openstack Clouds Secret: Name: openstack-cloud-credentials Namespace: openshift-machine-api Config Drive: true Flavor: sos-worker Image: 5kqfbl3y0rhocpnfv-wj2jj-rhcos Kind: OpenstackProviderSpec Metadata: Networks: Filter: Subnets: Filter: Id: 7fb7d2d6-325d-49e1-b3f8-b4dbb1197e34 Ports: Fixed I Ps: Subnet ID: 1a892dcf-bf93-46ef-bf37-bda6cf923471 Name Suffix: provider3p1 Network ID: 50a557b5-34c2-4c47-b539-963688f7167c Port Security: false Tags: sriov Trunk: false Vnic Type: direct Fixed I Ps: Subnet ID: 76430b9e-302f-428d-916a-77482d9cfb19 Name Suffix: provider4p1 Network ID: e2106b16-8f83-4e2e-bdbd-20e2c12ec279 Port Security: false Tags: sriov Trunk: false Vnic Type: direct Fixed I Ps: Subnet ID: 1a892dcf-bf93-46ef-bf37-bda6cf923471 Name Suffix: provider3p2 Network ID: 50a557b5-34c2-4c47-b539-963688f7167c Port Security: false Tags: sriov Trunk: false Vnic Type: direct Fixed I Ps: Subnet ID: 76430b9e-302f-428d-916a-77482d9cfb19 Name Suffix: provider4p2 Network ID: e2106b16-8f83-4e2e-bdbd-20e2c12ec279 Port Security: false Tags: sriov Trunk: false Vnic Type: direct Fixed I Ps: Subnet ID: 1a892dcf-bf93-46ef-bf37-bda6cf923471 Name Suffix: provider3p3 Network ID: 50a557b5-34c2-4c47-b539-963688f7167c Port Security: false Tags: sriov Trunk: false Vnic Type: direct Fixed I Ps: Subnet ID: 76430b9e-302f-428d-916a-77482d9cfb19 Name Suffix: provider4p3 Network ID: e2106b16-8f83-4e2e-bdbd-20e2c12ec279 Port Security: false Tags: sriov Trunk: false Vnic Type: direct Fixed I Ps: Subnet ID: 1a892dcf-bf93-46ef-bf37-bda6cf923471 Name Suffix: provider3p4 Network ID: 50a557b5-34c2-4c47-b539-963688f7167c Port Security: false Tags: sriov Trunk: false Vnic Type: direct Fixed I Ps: Subnet ID: 76430b9e-302f-428d-916a-77482d9cfb19 Name Suffix: provider4p4 Network ID: e2106b16-8f83-4e2e-bdbd-20e2c12ec279 Port Security: false Tags: sriov Trunk: false Vnic Type: direct Primary Subnet: 7fb7d2d6-325d-49e1-b3f8-b4dbb1197e34 Security Groups: Filter: Name: 5kqfbl3y0rhocpnfv-wj2jj-worker Server Group Name: 5kqfbl3y0rhocpnfv-wj2jj-worker-worker Server Metadata: Name: 5kqfbl3y0rhocpnfv-wj2jj-worker Openshift Cluster ID: 5kqfbl3y0rhocpnfv-wj2jj Tags: openshiftClusterID=5kqfbl3y0rhocpnfv-wj2jj Trunk: true User Data Secret: Name: worker-user-data Status: Available Replicas: 1 Fully Labeled Replicas: 1 Observed Generation: 3 Ready Replicas: 1 Replicas: 1 Events: <none> $ oc get nodes NAME STATUS ROLES AGE VERSION 5kqfbl3y0rhocpnfv-wj2jj-master-0 Ready control-plane,master 5d23h v1.27.10+28ed2d7 5kqfbl3y0rhocpnfv-wj2jj-master-1 Ready control-plane,master 5d23h v1.27.10+28ed2d7 5kqfbl3y0rhocpnfv-wj2jj-master-2 Ready control-plane,master 5d23h v1.27.10+28ed2d7 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr Ready worker 5d22h v1.27.10+28ed2d7 $ oc describe nodes 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr Name: 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=sos-worker beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=regionOne failure-domain.beta.kubernetes.io/zone=worker feature.node.kubernetes.io/network-sriov.capable=true kubernetes.io/arch=amd64 kubernetes.io/hostname=5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=sos-worker node.openshift.io/os_id=rhcos topology.cinder.csi.openstack.org/zone=worker topology.kubernetes.io/region=regionOne topology.kubernetes.io/zone=worker Annotations: alpha.kubernetes.io/provided-node-ip: 192.168.0.91 csi.volume.kubernetes.io/nodeid: {"cinder.csi.openstack.org":"aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879"} machine.openshift.io/machine: openshift-machine-api/5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-8c613531f97974a9561f8b0ada0c2cd0 machineconfiguration.openshift.io/desiredConfig: rendered-worker-8c613531f97974a9561f8b0ada0c2cd0 machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-8c613531f97974a9561f8b0ada0c2cd0 machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-8c613531f97974a9561f8b0ada0c2cd0 machineconfiguration.openshift.io/lastSyncedControllerConfigResourceVersion: 505735 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done sriovnetwork.openshift.io/state: Idle tuned.openshift.io/bootcmdline: skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 nohz=on rcu_nocbs=10-25 tuned.non_isolcpus=000003ff systemd.cpu_affinity=0,1,2,3... volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Thu, 07 Mar 2024 06:09:31 +0000 Taints: <none> Unschedulable: false Lease: HolderIdentity: 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr AcquireTime: <unset> RenewTime: Wed, 13 Mar 2024 04:55:28 +0000 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Wed, 13 Mar 2024 04:55:33 +0000 Thu, 07 Mar 2024 15:18:00 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 13 Mar 2024 04:55:33 +0000 Thu, 07 Mar 2024 15:18:00 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Wed, 13 Mar 2024 04:55:33 +0000 Thu, 07 Mar 2024 15:18:00 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Wed, 13 Mar 2024 04:55:33 +0000 Thu, 07 Mar 2024 15:18:05 +0000 KubeletReady kubelet is posting ready status Addresses: InternalIP: 192.168.0.91 Hostname: 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr Capacity: cpu: 26 ephemeral-storage: 104266732Ki hugepages-1Gi: 20Gi hugepages-2Mi: 0 memory: 47264764Ki openshift.io/intl_provider3: 4 openshift.io/intl_provider4: 4 pods: 250 Allocatable: cpu: 16 ephemeral-storage: 95018478229 hugepages-1Gi: 20Gi hugepages-2Mi: 0 memory: 25166844Ki openshift.io/intl_provider3: 4 openshift.io/intl_provider4: 4 pods: 250 System Info: Machine ID: aa5cfdcbeb4646d88ac25bb6f0c0d879 System UUID: aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879 Boot ID: 77573755-0d27-4717-80fe-4579692d9c2c Kernel Version: 5.14.0-284.54.1.el9_2.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 414.92.202402201520-0 (Plow) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.27.3-6.rhaos4.14.git7eb2281.el9 Kubelet Version: v1.27.10+28ed2d7 Kube-Proxy Version: v1.27.10+28ed2d7 ProviderID: openstack:///aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879 Non-terminated Pods: (19 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age --------- ---- ------------ ---------- --------------- ------------- --- crucible-rickshaw testpmd-host-device-e810-sriov 10 (62%) 10 (62%) 10000Mi (40%) 10000Mi (40%) 3d13h openshift-cluster-csi-drivers openstack-cinder-csi-driver-node-hnv49 30m (0%) 0 (0%) 150Mi (0%) 0 (0%) 5d22h openshift-cluster-node-tuning-operator tuned-fcjfp 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 5d22h openshift-dns dns-default-v7s59 60m (0%) 0 (0%) 110Mi (0%) 0 (0%) 5d22h openshift-dns node-resolver-gkz8b 5m (0%) 0 (0%) 21Mi (0%) 0 (0%) 5d22h openshift-image-registry node-ca-p5dn5 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 5d22h openshift-ingress-canary ingress-canary-fk59t 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 5d22h openshift-machine-config-operator machine-config-daemon-9qw8z 40m (0%) 0 (0%) 100Mi (0%) 0 (0%) 5d22h openshift-monitoring node-exporter-czcmj 9m (0%) 0 (0%) 47Mi (0%) 0 (0%) 5d22h openshift-monitoring prometheus-adapter-7696787779-vj5wk 1m (0%) 0 (0%) 40Mi (0%) 0 (0%) 5d4h openshift-multus multus-additional-cni-plugins-l7rpv 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 5d22h openshift-multus multus-nxr6k 10m (0%) 0 (0%) 65Mi (0%) 0 (0%) 5d22h openshift-multus network-metrics-daemon-tb7sq 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 5d22h openshift-network-diagnostics network-check-target-pqtp9 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 5d22h openshift-openstack-infra coredns-5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr 200m (1%) 0 (0%) 400Mi (1%) 0 (0%) 5d22h openshift-openstack-infra keepalived-5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr 200m (1%) 0 (0%) 400Mi (1%) 0 (0%) 5d22h openshift-sdn sdn-9mdnb 110m (0%) 0 (0%) 220Mi (0%) 0 (0%) 5d22h openshift-sriov-network-operator sriov-device-plugin-tr68w 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 5d13h openshift-sriov-network-operator sriov-network-config-daemon-dtf95 100m (0%) 0 (0%) 100Mi (0%) 0 (0%) 5d22h Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 10845m (67%) 10 (62%) memory 11928Mi (48%) 10000Mi (40%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 8Gi (40%) 8Gi (40%) hugepages-2Mi 0 (0%) 0 (0%) openshift.io/intl_provider3 4 4 openshift.io/intl_provider4 4 4 Events: <none> 2. OpenStack Network resource for Worker node $ openstack server list --all --fit-width +--------------------------------------+----------------------------------------+--------+----------------------------------------------------------------------------------------------------------------------+-------------------------------+------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+----------------------------------------+--------+----------------------------------------------------------------------------------------------------------------------+-------------------------------+------------+ | aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr | ACTIVE | management=192.168.0.91; provider-3=192.168.177.197, 192.168.177.59, 192.168.177.66, 192.168.177.83; | 5kqfbl3y0rhocpnfv-wj2jj-rhcos | sos-worker | | | | | provider-4=192.168.178.108, 192.168.178.121, 192.168.178.144, 192.168.178.18 | | | | 1a24baf3-acde-49a0-ab8e-4f4afcc9d3cc | 5kqfbl3y0rhocpnfv-wj2jj-master-2 | ACTIVE | management=192.168.0.62 | 5kqfbl3y0rhocpnfv-wj2jj-rhcos | sos-master | | 3e545ab5-6e28-4189-8d94-9272dfa1cd05 | 5kqfbl3y0rhocpnfv-wj2jj-master-1 | ACTIVE | management=192.168.0.78 | 5kqfbl3y0rhocpnfv-wj2jj-rhcos | sos-master | | 97e5c382-0fb0-4a70-b58e-0469d3869a4e | 5kqfbl3y0rhocpnfv-wj2jj-master-0 | ACTIVE | management=192.168.0.93 | 5kqfbl3y0rhocpnfv-wj2jj-rhcos | sos-master | +--------------------------------------+----------------------------------------+--------+----------------------------------------------------------------------------------------------------------------------+-------------------------------+------------+$ openstack port list --server aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879 +--------------------------------------+----------------------------------------------------+-------------------+--------------------------------------------------------------------------------+--------+ | ID | Name | MAC Address | Fixed IP Addresses | Status | +--------------------------------------+----------------------------------------------------+-------------------+--------------------------------------------------------------------------------+--------+ | 0a562c29-4ddc-41c4-82e8-13934d3ee273 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-0 | fa:16:3e:16:9a:c3 | ip_address='192.168.0.91', subnet_id='7fb7d2d6-325d-49e1-b3f8-b4dbb1197e34' | ACTIVE | | 0c1814db-cd4f-4f6a-a0c6-4f8e569b6767 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p4 | fa:16:3e:15:88:d7 | ip_address='192.168.178.108', subnet_id='76430b9e-302f-428d-916a-77482d9cfb19' | ACTIVE | | 1778cb62-5fbf-42be-8847-53a7b092bdf5 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p2 | fa:16:3e:2a:64:e4 | ip_address='192.168.177.197', subnet_id='1a892dcf-bf93-46ef-bf37-bda6cf923471' | ACTIVE | | 557f205b-2674-4f6e-91a2-643fe1702be2 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p1 | fa:16:3e:56:a3:48 | ip_address='192.168.177.83', subnet_id='1a892dcf-bf93-46ef-bf37-bda6cf923471' | ACTIVE | | 721b5f15-2dc9-4509-a4ba-09f364ae8771 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p3 | fa:16:3e:dd:c3:28 | ip_address='192.168.177.59', subnet_id='1a892dcf-bf93-46ef-bf37-bda6cf923471' | ACTIVE | | 9da4b1be-27d7-4428-a194-9eb4b02f6ac5 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p3 | fa:16:3e:fb:06:1b | ip_address='192.168.178.144', subnet_id='76430b9e-302f-428d-916a-77482d9cfb19' | ACTIVE | | a72fcbd2-83d3-4fa9-be3d-e9fbde27d4bf | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p4 | fa:16:3e:a9:28:0e | ip_address='192.168.177.66', subnet_id='1a892dcf-bf93-46ef-bf37-bda6cf923471' | ACTIVE | | ba5cd10f-c6bc-4bed-b978-3b8a3560ad5c | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p1 | fa:16:3e:33:e4:c4 | ip_address='192.168.178.18', subnet_id='76430b9e-302f-428d-916a-77482d9cfb19' | ACTIVE | | bf2ce123-76fc-4e5c-9e4f-0473febbdeac | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p2 | fa:16:3e:ce:91:10 | ip_address='192.168.178.121', subnet_id='76430b9e-302f-428d-916a-77482d9cfb19' | ACTIVE | +--------------------------------------+----------------------------------------------------+-------------------+--------------------------------------------------------------------------------+--------+ $ openstack port show --fit-width 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p4 +-------------------------+---------------------------------------------------------------------------------------------------------------------------+ | Field | Value | +-------------------------+---------------------------------------------------------------------------------------------------------------------------+ | admin_state_up | UP | | allowed_address_pairs | | | binding_host_id | nfv-intel-11.perflab.com | | binding_profile | pci_slot='0000:b1:11.2', pci_vendor_info='8086:1889', physical_network='provider4' | | binding_vif_details | connectivity='l2', port_filter='False', vlan='178' | | binding_vif_type | hw_veb | | binding_vnic_type | direct | | created_at | 2024-03-07T06:03:43Z | | data_plane_status | None | | description | Created by cluster-api-provider-openstack cluster openshift-machine-api-5kqfbl3y0rhocpnfv-wj2jj | | device_id | aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879 | | device_owner | compute:worker | | device_profile | None | | dns_assignment | fqdn='host-192-168-178-108.openstacklocal.', hostname='host-192-168-178-108', ip_address='192.168.178.108' | | dns_domain | | | dns_name | | | extra_dhcp_opts | | | fixed_ips | ip_address='192.168.178.108', subnet_id='76430b9e-302f-428d-916a-77482d9cfb19' | | id | 0c1814db-cd4f-4f6a-a0c6-4f8e569b6767 | | ip_allocation | None | | mac_address | fa:16:3e:15:88:d7 | | name | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p4 | | network_id | e2106b16-8f83-4e2e-bdbd-20e2c12ec279 | | numa_affinity_policy | None | | port_security_enabled | True | | project_id | 927450d0f06647a99d86214acd822679 | | propagate_uplink_status | None | | qos_network_policy_id | None | | qos_policy_id | None | | resource_request | None | | revision_number | 6 | | security_group_ids | f0df9265-c7fd-4f47-875f-d346e5cb5074 | | status | ACTIVE | | tags | cluster-api-provider-openstack, openshift-machine-api-5kqfbl3y0rhocpnfv-wj2jj, openshiftClusterID=5kqfbl3y0rhocpnfv-wj2jj | | trunk_details | None | | updated_at | 2024-03-07T06:04:10Z | +-------------------------+---------------------------------------------------------------------------------------------------------------------------+$ openstack port show --fit-width 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p2 +-------------------------+---------------------------------------------------------------------------------------------------------------------------+ | Field | Value | +-------------------------+---------------------------------------------------------------------------------------------------------------------------+ | admin_state_up | UP | | allowed_address_pairs | | | binding_host_id | nfv-intel-11.perflab.com | | binding_profile | pci_slot='0000:b1:01.1', pci_vendor_info='8086:1889', physical_network='provider3' | | binding_vif_details | connectivity='l2', port_filter='False', vlan='177' | | binding_vif_type | hw_veb | | binding_vnic_type | direct | | created_at | 2024-03-07T06:03:41Z | | data_plane_status | None | | description | Created by cluster-api-provider-openstack cluster openshift-machine-api-5kqfbl3y0rhocpnfv-wj2jj | | device_id | aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879 | | device_owner | compute:worker | | device_profile | None | | dns_assignment | fqdn='host-192-168-177-197.openstacklocal.', hostname='host-192-168-177-197', ip_address='192.168.177.197' | | dns_domain | | | dns_name | | | extra_dhcp_opts | | | fixed_ips | ip_address='192.168.177.197', subnet_id='1a892dcf-bf93-46ef-bf37-bda6cf923471' | | id | 1778cb62-5fbf-42be-8847-53a7b092bdf5 | | ip_allocation | None | | mac_address | fa:16:3e:2a:64:e4 | | name | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p2 | | network_id | 50a557b5-34c2-4c47-b539-963688f7167c | | numa_affinity_policy | None | | port_security_enabled | True | | project_id | 927450d0f06647a99d86214acd822679 | | propagate_uplink_status | None | | qos_network_policy_id | None | | qos_policy_id | None | | resource_request | None | | revision_number | 9 | | security_group_ids | f0df9265-c7fd-4f47-875f-d346e5cb5074 | | status | ACTIVE | | tags | cluster-api-provider-openstack, openshift-machine-api-5kqfbl3y0rhocpnfv-wj2jj, openshiftClusterID=5kqfbl3y0rhocpnfv-wj2jj | | trunk_details | None | | updated_at | 2024-03-07T06:10:42Z | +-------------------------+---------------------------------------------------------------------------------------------------------------------------+ $ openstack network list +--------------------------------------+-------------+--------------------------------------+ | ID | Name | Subnets | +--------------------------------------+-------------+--------------------------------------+ | 50a557b5-34c2-4c47-b539-963688f7167c | provider-3 | 1a892dcf-bf93-46ef-bf37-bda6cf923471 | | e2106b16-8f83-4e2e-bdbd-20e2c12ec279 | provider-4 | 76430b9e-302f-428d-916a-77482d9cfb19 | | 5fdddf1c-3a71-4752-94bd-bdb5b9674500 | management | 7fb7d2d6-325d-49e1-b3f8-b4dbb1197e34 | +--------------------------------------+-------------+--------------------------------------+$ openstack network show provider-3 +---------------------------+--------------------------------------+ | Field | Value | +---------------------------+--------------------------------------+ | admin_state_up | UP | | availability_zone_hints | | | availability_zones | | | created_at | 2024-03-01T16:45:48Z | | description | | | dns_domain | | | id | 50a557b5-34c2-4c47-b539-963688f7167c | | ipv4_address_scope | None | | ipv6_address_scope | None | | is_default | None | | is_vlan_transparent | None | | mtu | 9216 | | name | provider-3 | | port_security_enabled | True | | project_id | ad4b9a972ac64bd9916ad7ee80288353 | | provider:network_type | vlan | | provider:physical_network | provider3 | | provider:segmentation_id | 177 | | qos_policy_id | None | | revision_number | 2 | | router:external | Internal | | segments | None | | shared | True | | status | ACTIVE | | subnets | 1a892dcf-bf93-46ef-bf37-bda6cf923471 | | tags | | | updated_at | 2024-03-01T16:45:52Z | +---------------------------+--------------------------------------+$ openstack network show provider-4 +---------------------------+--------------------------------------+ | Field | Value | +---------------------------+--------------------------------------+ | admin_state_up | UP | | availability_zone_hints | | | availability_zones | | | created_at | 2024-03-01T16:45:57Z | | description | | | dns_domain | | | id | e2106b16-8f83-4e2e-bdbd-20e2c12ec279 | | ipv4_address_scope | None | | ipv6_address_scope | None | | is_default | None | | is_vlan_transparent | None | | mtu | 9216 | | name | provider-4 | | port_security_enabled | True | | project_id | ad4b9a972ac64bd9916ad7ee80288353 | | provider:network_type | vlan | | provider:physical_network | provider4 | | provider:segmentation_id | 178 | | qos_policy_id | None | | revision_number | 2 | | router:external | Internal | | segments | None | | shared | True | | status | ACTIVE | | subnets | 76430b9e-302f-428d-916a-77482d9cfb19 | | tags | | | updated_at | 2024-03-01T16:46:01Z | +---------------------------+--------------------------------------+ 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-36833. The following is the description of the original issue:
—
Description of problem:
In 4,16 OCP starts to place an annotation on service accounts when it creates a dockercfg secret. Some operators/reconciliation loops (incorrectly) will then try to set the annotation on the SA back to exactly what they wanted. OCP will annotate again and create a new secret. Operators sets it back without annotation. Rinse Repeat. Eventually etcd will get completely overloaded with secrets, will start to OOM, and the entire cluster will come down.
There is belief that at least otel, tempo, acm, odf/ocs, strymzi, elasticsearch and possibly other operators reconciled the annoations on the SA by setting them back exactly how they wanted them set.
These seem to be related (but no complete)
https://issues.redhat.com/browse/LOG-5776
https://issues.redhat.com/browse/ENTMQST-6129
Description of the problem:
Setting up OCP with ODF (compact mode) using AI (stage). I have 3 hosts with install disk (120GB) and data disk (500GB, disk type: Multipath). Though we have non bootable disk (500GB), host status is "Insufficient". Could not proceed forward as the "Next" button is disabled.
Steps to reproduce:
1. Create a new cluster
2. Select "Install OpenShift Data Foundation" in Operators page
3. Take 3 hosts with 1 installation disk and 1 non-installation disk on each.
4. Add hosts by booting hosts with downloaded iso
Actual results:
Status of hosts is "Insufficient" and "Next" button is disabled
Expected results:
Status of hosts should be "Ready" and "Next" button should be enabled to proceed with installation
Description of problem:
When creating an ImageDigestMirrorSet with conflicting mirrorSourcePolicy, it didn't prompt error.
Version-Release number of selected component (if applicable):
% oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.0-0.nightly-2024-01-14-100410 True False 27m Cluster version is 4.15.0-0.nightly-2024-01-14-100410
How reproducible:
always
Steps to Reproduce:
1. create an ImageContentSourcePolicy ImageContentSourcePolicy.yaml: apiVersion: operator.openshift.io/v1alpha1 kind: ImageContentSourcePolicy metadata: name: ubi8repo spec: repositoryDigestMirrors: - mirrors: - example.io/example/ubi-minimal - example.com/example/ubi-minimal source: registry.access.redhat.com/ubi6/ubi-minimal - mirrors: - mirror.example.net source: registry.example.com/example 2.After the mcp finish updating, check the /etc/containers/registries.conf update as expected 3.create an ImageDigestMirrorSet with conflicting mirrorSourcePolicy for the same source "registry.example.com/example" ImageDigestMirrorSet-conflict.yaml: apiVersion: config.openshift.io/v1 kind: ImageDigestMirrorSet metadata: name: digest-mirror spec: imageDigestMirrors: - mirrors: - example.io/example/ubi-minimal - example.com/example/ubi-minimal source: registry.access.redhat.com/ubi8/ubi-minimal mirrorSourcePolicy: AllowContactingSource - mirrors: - mirror.example.net source: registry.example.com/example mirrorSourcePolicy: NeverContactSource
Actual results:
3. create successfully, but the mcp didn't get updated and no relevant mc generated. The machine-config-controller log showed: I0116 02:34:03.897335 1 container_runtime_config_controller.go:417] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not update registries config with new changes: conflicting mirrorSourcePolicy is set for the same source "registry.example.com/example" in imagedigestmirrorsets and/or imagetagmirrorsets
Expected results:
3. it should prompt: there exist conflicting mirrorSourcePolicy for the same source "registry.example.com/example" in ICSP
Additional info:
Description of problem:
If a cluster is installed using proxy and the username used for connecting to the proxy contains the characters "%40" for encoding a "@" in case of providing a doamin, the instalation fails. The failure is because the proxy variables implemented in the file "/etc/systemd/system.conf.d/10-default-env.conf" in the bootstrap node are ignored by systemd. This issue seems was already fixed in MCO (BZ 1882674 - fixed in RHOCP 4.7), but looks like is affecting the bootstrap process in 4.13 and 4.14, causing the installation to not start at all.
Version-Release number of selected component (if applicable):
4.14, 4.13
How reproducible:
100% always
Steps to Reproduce:
1. create a install-config.yaml file with "%40" in the middle of the username used for proxy. 2. start cluster installation. 3. bootstrap will fail for not using proxy variables.
Actual results:
Installation fails because systemd fails to load the proxy varaibles if "%" is present in the username.
Expected results:
Installation to succeed using a username with "%40" for the proxy.
Additional info:
File "/etc/systemd/system.conf.d/10-default-env.conf" for the bootstrap should be generated in a way accepted by systemd.
Tracker issue for bootimage bump in 4.16. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-36324.
Description of problem:
Fast DataPath released a new major version of Open vSwitch, which is 3.3. This version is going to be a new LTS and contains performance improvements and features required for future releases of OVN. Since OCP 4.16 is planned to have a longer support time frame, it should use this version of OVS. Moving to newer versions of OVS will also gradually allow FDP to drop support for older streams not used by any layered products. Most notable relevant improvements over OVS 3.1 are: - Improved performance of database operations, most notbaly the initial read of the database file and the database schema conversion on updates. The plan is to also update the main ovs-vwitchd on the os level in a separate issue, this will provide support for flushing CT entries by marks and labels needed for future versions of OVN. And it's better to keep versions on the host and inside the container in sync. The change was discussed with FDP and OVS-QE.
Please review the following PR: https://github.com/openshift/machine-api-provider-azure/pull/88
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
https://github.com/openshift/release/pull/48835
The scatter chart is very slow to load for the last week, but while there are a few hits here and there over last y days, it looks like this got a lot more common yesterday around noon and has continued ever since.
Suspicious PR: https://github.com/openshift/origin/pull/28587
Please review the following PR: https://github.com/openshift/cluster-api/pull/190
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
PipelineRun list view contains Task status column, which shows the overall task status of the pipelinerurn. Inorder to render this column we fetch all the tasksruns of that pipelinerun. Every pipelinerun row will have to have all the related TaskRuns information, which is causing performance issue in the pipelinerun list view.
Customer is facing issue of UI slowness and rendering problem for large number of pipelineruns with and without results enabled. In both cases, there is significant slowness being observed which is hampering their daily operations.
How reproducible:
Always
Steps to Reproduce:
1. Create few pipelineruns 2. Navigate to pipelineruns list view
Actual results:
All the Taskruns are being fetched and the pipelinerun list view renders this column asynchronously with loading indicator.
Expected results:
Taskruns should not be fetched at all, rather UI need to parse the `` string to render this column.
Additional info:
Pipelinerun status message gets updated on every task completion.
pipelinerun.status.conditions:
lastTransitionTime: '2023-11-15T07:51:42Z' message: 'Tasks Completed: 3 (Failed: 0, Cancelled 0), Skipped: 0' reason: Succeeded status: 'True' type: Succeeded
we can parse the above information to derive the following object and use this for rendering the column, this will increase the performance of this page hugely.
{
completed: 3, // 3 (total count) - 0 (failed count) - 0 (cancelled count),
failed: 0,
cancelled: 0,
skipped: 0,
pending: 0
}
Slack thread for more details - thread
Description of problem:
When navigating to Pipelines list page from Search menu in Dev perspective, Pipelines list page is getting crashed
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1.Install Pipelines Operator 2.Go to Developer perspective 3.Go to search menu, select Pipeline
Actual results:
Page is getting crashed
Expected results:
Page should not be crashed and should show Pipelines List page
Additional info:
The story is to track i18n upload/download routine tasks which are perform every sprint.
A.C.
- Upload strings to Memosource at the start of the sprint and reach out to localization team
- Download translated strings from Memsource when it is ready
- Review the translated strings and open a pull request
- Open a followup story for next sprint
update ironic code to remove cinderclient and glanceclient image from the ironic container
This is a clone of issue OCPBUGS-34540. The following is the description of the original issue:
—
ControlePlaneReleaseProvider is modifying the cached release image directly which means the userReleaseProvider is still picking up and using the registry overrides for data-plane components.
Description of problem:
Private HC provision failed on AWS.
How reproducible:
Always.
Steps to Reproduce:
Create a private HC on AWS following the steps in https://hypershift-docs.netlify.app/how-to/aws/deploy-aws-private-clusters/: RELEASE_IMAGE=registry.ci.openshift.org/ocp/release:4.17.0-0.nightly-2024-06-20-005211 HO_IMAGE=quay.io/hypershift/hypershift-operator:latest BUCKET_NAME=fxie-hcp-bucket REGION=us-east-2 AWS_CREDS="$HOME/.aws/credentials" CLUSTER_NAME=fxie-hcp-1 BASE_DOMAIN=qe.devcluster.openshift.com EXT_DNS_DOMAIN=hypershift-ext.qe.devcluster.openshift.com PULL_SECRET="/Users/fxie/Projects/hypershift/.dockerconfigjson" hypershift install --oidc-storage-provider-s3-bucket-name $BUCKET_NAME --oidc-storage-provider-s3-credentials $AWS_CREDS --oidc-storage-provider-s3-region $REGION --private-platform AWS --aws-private-creds $AWS_CREDS --aws-private-region=$REGION --wait-until-available --hypershift-image $HO_IMAGE hypershift create cluster aws --pull-secret=$PULL_SECRET --aws-creds=$AWS_CREDS --name=$CLUSTER_NAME --base-domain=$BASE_DOMAIN --node-pool-replicas=2 --region=$REGION --endpoint-access=Private --release-image=$RELEASE_IMAGE --generate-ssh
Additional info:
From the MC: $ for k in $(oc get secret -n clusters-fxie-hcp-1 | grep -i kubeconfig | awk '{print $1}'); do echo $k; oc extract secret/$k -n clusters-fxie-hcp-1 --to - 2>/dev/null | grep -i 'server:'; done admin-kubeconfig server: https://a621f63c3c65f4e459f2044b9521b5e9-082a734ef867f25a.elb.us-east-2.amazonaws.com:6443 aws-pod-identity-webhook-kubeconfig server: https://kube-apiserver:6443 bootstrap-kubeconfig server: https://api.fxie-hcp-1.hypershift.local:443 cloud-credential-operator-kubeconfig server: https://kube-apiserver:6443 dns-operator-kubeconfig server: https://kube-apiserver:6443 fxie-hcp-1-2bsct-kubeconfig server: https://kube-apiserver:6443 ingress-operator-kubeconfig server: https://kube-apiserver:6443 kube-controller-manager-kubeconfig server: https://kube-apiserver:6443 kube-scheduler-kubeconfig server: https://kube-apiserver:6443 localhost-kubeconfig server: https://localhost:6443 service-network-admin-kubeconfig server: https://kube-apiserver:6443
The bootstrap-kubeconfig uses an incorrect KAS port (should be 6443 since the KAS is exposed through LB), causing kubelet on each HC node to use the same incorrect port. As a result AWS VMs are provisioned but cannot join the HC as nodes.
From a bastion: [ec2-user@ip-10-0-5-182 ~]$ nc -zv api.fxie-hcp-1.hypershift.local 443 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connection timed out. [ec2-user@ip-10-0-5-182 ~]$ nc -zv api.fxie-hcp-1.hypershift.local 6443 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 10.0.143.91:6443. Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.
Besides, the CNO also passes the wrong KAS port to Network components on the HC.
Same for HA proxy configuration on the VMs:
frontend local_apiserver
bind 172.20.0.1:6443
log global
mode tcp
option tcplog
default_backend remote_apiserver
backend remote_apiserver
mode tcp
log global
option httpchk GET /version
option log-health-checks
default-server inter 10s fall 3 rise 3
server controlplane api.fxie-hcp-1.hypershift.local:443
spec: configuration: featureGate: featureSet: TechPreviewNoUpgrade
$ oc get pod NAME READY STATUS RESTARTS AGE capi-provider-bd4858c47-sf5d5 0/2 Init:0/1 0 9m33s cluster-api-85f69c8484-5n9ql 1/1 Running 0 9m33s control-plane-operator-78c9478584-xnjmd 2/2 Running 0 9m33s etcd-0 3/3 Running 0 9m10s kube-apiserver-55bb575754-g4694 4/5 CrashLoopBackOff 6 (81s ago) 8m30s $ oc logs kube-apiserver-55bb575754-g4694 -c kube-apiserver --tail=5 E0105 16:49:54.411837 1 controller.go:145] while syncing ConfigMap "kube-system/kube-apiserver-legacy-service-account-token-tracking", err: namespaces "kube-system" not found I0105 16:49:54.415074 1 trace.go:236] Trace[236726897]: "Create" accept:application/vnd.kubernetes.protobuf, */*,audit-id:71496035-d1fe-4ee1-bc12-3b24022ea39c,client:::1,api-group:scheduling.k8s.io,api-version:v1,name:,subresource:,namespace:,protocol:HTTP/2.0,resource:priorityclasses,scope:resource,url:/apis/scheduling.k8s.io/v1/priorityclasses,user-agent:kube-apiserver/v1.29.0 (linux/amd64) kubernetes/9368fcd,verb:POST (05-Jan-2024 16:49:44.413) (total time: 10001ms): Trace[236726897]: ---"Write to database call failed" len:174,err:priorityclasses.scheduling.k8s.io "system-node-critical" is forbidden: not yet ready to handle request 10001ms (16:49:54.415) Trace[236726897]: [10.001615835s] [10.001615835s] END F0105 16:49:54.415382 1 hooks.go:203] PostStartHook "scheduling/bootstrap-system-priority-classes" failed: unable to add default system priority classes: priorityclasses.scheduling.k8s.io "system-node-critical" is forbidden: not yet ready to handle request
Component Readiness has found a potential regression in [bz-networking][invariant] alert/OVNKubernetesResourceRetryFailure should not be at or above info.
Probability of significant regression: 96.30%
Sample (being evaluated) Release: 4.16
Start Time: 2024-04-29T00:00:00Z
End Time: 2024-05-06T23:59:59Z
Success Rate: 72.73%
Successes: 32
Failures: 12
Flakes: 0
Base (historical) Release: 4.15
Start Time: 2024-02-01T00:00:00Z
End Time: 2024-05-06T23:59:59Z
Success Rate: 85.20%
Successes: 236
Failures: 41
Flakes: 0
Description of problem:
The job [sig-node] [Conformance] Prevent openshift node labeling on update by the node TestOpenshiftNodeLabeling [Suite:openshift/conformance/parallel/minimal] uses `oc debug` command [1]. Occasionally we find that command fails to run with ends up failing the test.
CI search results - https://search.ci.openshift.org/?search=TestOpenshiftNodeLabeling&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Version-Release number of selected component (if applicable):
How reproducible:
Observed in CI jobs mentioned above.
Steps to Reproduce:
1. 2. 3.
Actual results:
oc debug command occasionally exits with error
Expected results:
oc debug command should not occasionally exit with error
Additional info:
Description of problem:
DRA plugins can be installed, but do not really work because the required scheduler plugin DynamicResources isn't enabled.
Version-Release number of selected component (if applicable):
4.14.3
How reproducible:
Always
Steps to Reproduce:
1. Install an OpenShift cluster and enable the TechPreviewNoUpgrade feature set either during installation or post-install. The feature set includes the DynamicResourceAllocation feature gate. 2. Install a DRA plugin by any vendor, e.g. by NVIDIA (requires at least one GPU worker with NVIDIA GPU drivers installed on the node, and a few tweaks to allow the plugin to run on OpenShift). 3. Create a resource claim. 4. Create a pod that consumes the resource claim.
Actual results:
The pod remains in ContainerCreating state, the claim in WaitingForFirstConsumer state forever, without any meaningful event or error message.
Expected results:
A resource is allocated according to the resource claim, and assigned to the pod.
Additional info:
The problem is caused by the DynamicResources scheduler plugin not being automatically enabled when the feature flag is turned on. This makes DRA plugins run without issues (the right APIs are available), but do nothing.
This is a clone of issue OCPBUGS-38258. The following is the description of the original issue:
—
The issue we're trying to address is that nodes go NotReady for a few seconds.
See slack thread https://redhat-external.slack.com/archives/C01C8502FMM/p1717767390381249
This is a clone of issue OCPBUGS-41233. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-39531. The following is the description of the original issue:
—
-> While upgrading the cluster from 4.13.38 -> 4.14.18, it is stuck on CCO, clusterversion is complaining about
"Working towards 4.14.18: 690 of 860 done (80% complete), waiting on cloud-credential".
While checking further we see that CCO deployment is yet to rollout.
-> ClusterOperator status.versions[name=operator] isn't a narrow "CCO Deployment is updated", it's "the CCO asserts the whole CC component is updated", which requires (among other things) a functional CCO Deployment. Seems like you don't have a functional CCO Deployment, because logs have it stuck talking about asking for a leader lease. You don't have Kube API audit logs to say if it's stuck generating the Lease request, or waiting for a response from the Kube API server.
Please review the following PR: https://github.com/openshift/cloud-credential-operator/pull/642
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
https://github.com/openshift/origin/pull/28522 removes two tests related to http2 testing with the default certificate that are know known to fail with HAProxy 2.8. We are reworking the tests as part of NE-1444 (HAProxy 2.8 bump).
This bug is a reminder that come OCP 4.16 GA we need to have reworked the tests so that they now pass with HAProxy 2.8 or, if not fixed, revert https://github.com/openshift/origin/pull/28522 which is why I'm marking this bug as a blocker. We do not want to ship 4.16 without reinstating the two tests.
The goal of removing the two tests in https://github.com/openshift/origin/pull/28522 is to allow us to make additional progress in https://github.com/openshift/router/pull/551 (which is our HAProxy 2.8 bump). With all tests passing in router#551 we can continue our assessment of HAProxy 2.8 by a) running the payload tests and b) creating a HAproxy 2.8 image that QE can use with their reliability test suite.
This is a clone of issue OCPBUGS-33486. The following is the description of the original issue:
—
Description of problem:
Build tests in OCP 4.14 reference Ruby images that are now EOL. The related code in our sample ruby build was deleted.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Run the build suite for OCP 4.14 against a 4.14 cluster
Actual results:
Test [sig-builds][Feature:Builds][Slow] builds with a context directory s2i context directory build should s2i build an application using a context directory [apigroup:build.openshift.io] fails 2024-05-08T11:11:57.558298778Z I0508 11:11:57.558273 1 builder.go:400] Powered by buildah v1.31.0 2024-05-08T11:11:57.581578795Z I0508 11:11:57.581509 1 builder.go:473] effective capabilities: [audit_control=true audit_read=true audit_write=true block_suspend=true bpf=true checkpoint_restore=true chown=true dac_override=true dac_read_search=true fowner=true fsetid=true ipc_lock=true ipc_owner=true kill=true lease=true linux_immutable=true mac_admin=true mac_override=true mknod=true net_admin=true net_bind_service=true net_broadcast=true net_raw=true perfmon=true setfcap=true setgid=true setpcap=true setuid=true sys_admin=true sys_boot=true sys_chroot=true sys_module=true sys_nice=true sys_pacct=true sys_ptrace=true sys_rawio=true sys_resource=true sys_time=true sys_tty_config=true syslog=true wake_alarm=true] 2024-05-08T11:11:57.583755245Z I0508 11:11:57.583715 1 builder.go:401] redacted build: {"kind":"Build","apiVersion":"build.openshift.io/v1","metadata":{"name":"s2icontext-1","namespace":"e2e-test-contextdir-wpphk","uid":"c2db2893-06e5-4274-96ae-d8cd635a1f8d","resourceVersion":"51882","generation":1,"creationTimestamp":"2024-05-08T11:11:55Z","labels":{"buildconfig":"s2icontext","openshift.io/build-config.name":"s2icontext","openshift.io/build.start-policy":"Serial"},"annotations":{"openshift.io/build-config.name":"s2icontext","openshift.io/build.number":"1"},"ownerReferences":[{"apiVersion":"build.openshift.io/v1","kind":"BuildConfig","name":"s2icontext","uid":"b7dbb52b-ae66-4465-babc-728ae3ceed9a","controller":true}],"managedFields":[{"manager":"openshift-apiserver","operation":"Update","apiVersion":"build.openshift.io/v1","time":"2024-05-08T11:11:55Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:openshift.io/build-config.name":{},"f:openshift.io/build.number":{}},"f:labels":{".":{},"f:buildconfig":{},"f:openshift.io/build-config.name":{},"f:openshift.io/build.start-policy":{}},"f:ownerReferences":{".":{},"k:{\"uid\":\"b7dbb52b-ae66-4465-babc-728ae3ceed9a\"}":{}}},"f:spec":{"f:output":{"f:to":{}},"f:serviceAccount":{},"f:source":{"f:contextDir":{},"f:git":{".":{},"f:uri":{}},"f:type":{}},"f:strategy":{"f:sourceStrategy":{".":{},"f:env":{},"f:from":{},"f:pullSecret":{}},"f:type":{}},"f:triggeredBy":{}},"f:status":{"f:conditions":{".":{},"k:{\"type\":\"New\"}":{".":{},"f:lastTransitionTime":{},"f:lastUpdateTime":{},"f:status":{},"f:type":{}}},"f:config":{},"f:phase":{}}}}]},"spec":{"serviceAccount":"builder","source":{"type":"Git","git":{"uri":"https://github.com/sclorg/s2i-ruby-container"},"contextDir":"2.7/test/puma-test-app"},"strategy":{"type":"Source","sourceStrategy":{"from":{"kind":"DockerImage","name":"image-registry.openshift-image-registry.svc:5000/openshift/ruby:2.7-ubi8"},"pullSecret":{"name":"builder-dockercfg-v9xk2"},"env":[{"name":"BUILD_LOGLEVEL","value":"5"}]}},"output":{"to":{"kind":"DockerImage","name":"image-registry.openshift-image-registry.svc:5000/e2e-test-contextdir-wpphk/test:latest"},"pushSecret":{"name":"builder-dockercfg-v9xk2"}},"resources":{},"postCommit":{},"nodeSelector":null,"triggeredBy":[{"message":"Manually triggered"}]},"status":{"phase":"New","outputDockerImageReference":"image-registry.openshift-image-registry.svc:5000/e2e-test-contextdir-wpphk/test:latest","config":{"kind":"BuildConfig","namespace":"e2e-test-contextdir-wpphk","name":"s2icontext"},"output":{},"conditions":[{"type":"New","status":"True","lastUpdateTime":"2024-05-08T11:11:55Z","lastTransitionTime":"2024-05-08T11:11:55Z"}]}} 2024-05-08T11:11:57.584949442Z Cloning "https://github.com/sclorg/s2i-ruby-container" ... 2024-05-08T11:11:57.585044449Z I0508 11:11:57.585030 1 source.go:237] git ls-remote --heads https://github.com/sclorg/s2i-ruby-container 2024-05-08T11:11:57.585081852Z I0508 11:11:57.585072 1 repository.go:450] Executing git ls-remote --heads https://github.com/sclorg/s2i-ruby-container 2024-05-08T11:11:57.840621917Z I0508 11:11:57.840572 1 source.go:237] 663daf43b2abb5662504638d017c7175a6cff59d refs/heads/3.2-experimental 2024-05-08T11:11:57.840621917Z 88b4e684576b3fe0e06c82bd43265e41a8129c5d refs/heads/add_test_latest_imagestreams 2024-05-08T11:11:57.840621917Z 12a863ab4b050a1365d6d59970dddc6743e8bc8c refs/heads/master 2024-05-08T11:11:57.840730405Z I0508 11:11:57.840714 1 source.go:69] Cloning source from https://github.com/sclorg/s2i-ruby-container 2024-05-08T11:11:57.840793509Z I0508 11:11:57.840781 1 repository.go:450] Executing git clone --recursive --depth=1 https://github.com/sclorg/s2i-ruby-container /tmp/build/inputs 2024-05-08T11:11:59.073229755Z I0508 11:11:59.073183 1 repository.go:450] Executing git rev-parse --abbrev-ref HEAD 2024-05-08T11:11:59.080132731Z I0508 11:11:59.080079 1 repository.go:450] Executing git rev-parse --verify HEAD 2024-05-08T11:11:59.083626287Z I0508 11:11:59.083586 1 repository.go:450] Executing git --no-pager show -s --format=%an HEAD 2024-05-08T11:11:59.115407368Z I0508 11:11:59.115361 1 repository.go:450] Executing git --no-pager show -s --format=%ae HEAD 2024-05-08T11:11:59.195276873Z I0508 11:11:59.195231 1 repository.go:450] Executing git --no-pager show -s --format=%cn HEAD 2024-05-08T11:11:59.198916080Z I0508 11:11:59.198879 1 repository.go:450] Executing git --no-pager show -s --format=%ce HEAD 2024-05-08T11:11:59.204712375Z I0508 11:11:59.204663 1 repository.go:450] Executing git --no-pager show -s --format=%ad HEAD 2024-05-08T11:11:59.211098793Z I0508 11:11:59.211051 1 repository.go:450] Executing git --no-pager show -s --format=%<(80,trunc)%s HEAD 2024-05-08T11:11:59.216192627Z I0508 11:11:59.216149 1 repository.go:450] Executing git config --get remote.origin.url 2024-05-08T11:11:59.218615714Z Commit: 12a863ab4b050a1365d6d59970dddc6743e8bc8c (Bump common from `1f774c8` to `a957816` (#537)) 2024-05-08T11:11:59.218661988Z Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> 2024-05-08T11:11:59.218683019Z Date: Tue Apr 9 15:24:11 2024 +0200 2024-05-08T11:11:59.218722882Z I0508 11:11:59.218711 1 repository.go:450] Executing git rev-parse --abbrev-ref HEAD 2024-05-08T11:11:59.234411732Z I0508 11:11:59.234366 1 repository.go:450] Executing git rev-parse --verify HEAD 2024-05-08T11:11:59.237729596Z I0508 11:11:59.237698 1 repository.go:450] Executing git --no-pager show -s --format=%an HEAD 2024-05-08T11:11:59.255304604Z I0508 11:11:59.255269 1 repository.go:450] Executing git --no-pager show -s --format=%ae HEAD 2024-05-08T11:11:59.261113560Z I0508 11:11:59.261074 1 repository.go:450] Executing git --no-pager show -s --format=%cn HEAD 2024-05-08T11:11:59.270006232Z I0508 11:11:59.269961 1 repository.go:450] Executing git --no-pager show -s --format=%ce HEAD 2024-05-08T11:11:59.278485984Z I0508 11:11:59.278443 1 repository.go:450] Executing git --no-pager show -s --format=%ad HEAD 2024-05-08T11:11:59.281940527Z I0508 11:11:59.281906 1 repository.go:450] Executing git --no-pager show -s --format=%<(80,trunc)%s HEAD 2024-05-08T11:11:59.299465312Z I0508 11:11:59.299423 1 repository.go:450] Executing git config --get remote.origin.url 2024-05-08T11:11:59.374652834Z error: provided context directory does not exist: 2.7/test/puma-test-app
Expected results:
Tests succeed
Additional info:
Ruby 2.7 is EOL and not searchable in the Red Hat container catalog. Failing test: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-openshift-controller-manager-operator/344/pull-ci-openshift-cluster-openshift-controller-manager-operator-release-4.14-openshift-e2e-aws-builds-techpreview/1788152058105303040
This is a clone of issue OCPBUGS-36185. The following is the description of the original issue:
—
Description of problem:
The MAPI for IBM Cloud currently only checks the first group of subnets (50) when searching for Subnet details by name. It should provide pagination support to search all subnets.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%, dependent on order of subnets returned by IBM Cloud API's however
Steps to Reproduce:
1. Create 50+ IBM Cloud VPC Subnets 2. Create a new IPI cluster (with or without BYON) 3. MAPI will attempt to find Subnet details by name, likely failing as it only checks the first group (50)...depending on order returned by IBM Cloud API
Actual results:
MAPI fails to find Subnet ID, thus cannot create/manage cluster nodes.
Expected results:
Successful IPI deployment.
Additional info:
IBM Cloud is working on a patch to MAPI to handle the ListSubnets API call and pagination results.
This is a clone of issue OCPBUGS-41371. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-38349. The following is the description of the original issue:
—
Description of problem:
When using configuring an OpenID idp that can only be accessed via the data plane, if the hostname of the provider can only be resolved by the data plane, reconciliation of the idp fails.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. Configure an OpenID idp on a HostedCluster with a URL that points to a service in the dataplane (like https://keycloak.keycloak.svc)
Actual results:
The oauth server fails to be reconciled
Expected results:
The oauth server reconciles and functions properly
Additional info:
Follow up to OCPBUGS-37753
This is a clone of issue OCPBUGS-28974. The following is the description of the original issue:
—
Description of problem:
Machine stuck in Provisioned when the cluster is upgraded from 4.1 to 4.15
Version-Release number of selected component (if applicable):
Upgrade from 4.1 to 4.15 4.1.41-x86_64, 4.2.36-x86_64, 4.3.40-x86_64, 4.4.33-x86_64, 4.5.41-x86_64, 4.6.62-x86_64, 4.7.60-x86_64, 4.8.57-x86_64, 4.9.59-x86_64, 4.10.67-x86_64, 4.11 nightly, 4.12 nightly, 4.13 nightly, 4.14 nightly, 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest
How reproducible:
Seems always, the issue was found in our prow ci, and I also reproduce it.
Steps to Reproduce:
1.Create an aws IPI 4.1 cluster, then upgrade it one by one to 4.14 liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2024-01-19-110702 True True 26m Working towards 4.12.0-0.nightly-2024-02-04-062856: 654 of 830 done (78% complete), waiting on authentication, openshift-apiserver, openshift-controller-manager liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2024-02-04-062856 True False 5m12s Cluster version is 4.12.0-0.nightly-2024-02-04-062856 liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2024-02-04-062856 True True 61m Working towards 4.13.0-0.nightly-2024-02-04-042638: 713 of 841 done (84% complete), waiting up to 40 minutes on machine-config liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-0.nightly-2024-02-04-042638 True False 10m Cluster version is 4.13.0-0.nightly-2024-02-04-042638 liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-0.nightly-2024-02-04-042638 True True 17m Working towards 4.14.0-0.nightly-2024-02-02-173828: 233 of 860 done (27% complete), waiting on control-plane-machine-set, machine-api liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2024-02-02-173828 True False 18m Cluster version is 4.14.0-0.nightly-2024-02-02-173828 2.When it upgrade to 4.14, check the machine scale successfully liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms1.yaml machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa created liuhuali@Lius-MacBook-Pro huali-test % oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a 1 1 1 1 14h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa 0 0 3s ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f 2 2 2 2 14h liuhuali@Lius-MacBook-Pro huali-test % oc scale machineset ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa --replicas=1 machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa scaled liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE ci-op-trzci0vq-8a8c4-dq95h-master-0 Running m6a.xlarge us-east-1 us-east-1f 15h ci-op-trzci0vq-8a8c4-dq95h-master-1 Running m6a.xlarge us-east-1 us-east-1a 15h ci-op-trzci0vq-8a8c4-dq95h-master-2 Running m6a.xlarge us-east-1 us-east-1f 15h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt Running m6a.xlarge us-east-1 us-east-1a 15h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa-mt9kh Running m6a.xlarge us-east-1 us-east-1a 15m ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k Running m6a.xlarge us-east-1 us-east-1f 15h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb Running m6a.xlarge us-east-1 us-east-1f 15h liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-128-51.ec2.internal Ready master 15h v1.27.10+28ed2d7 ip-10-0-143-198.ec2.internal Ready worker 14h v1.27.10+28ed2d7 ip-10-0-143-64.ec2.internal Ready worker 14h v1.27.10+28ed2d7 ip-10-0-143-80.ec2.internal Ready master 15h v1.27.10+28ed2d7 ip-10-0-144-123.ec2.internal Ready master 15h v1.27.10+28ed2d7 ip-10-0-147-94.ec2.internal Ready worker 14h v1.27.10+28ed2d7 ip-10-0-158-61.ec2.internal Ready worker 3m40s v1.27.10+28ed2d7 liuhuali@Lius-MacBook-Pro huali-test % oc scale machineset ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa --replicas=0 machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa scaled liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-128-51.ec2.internal Ready master 15h v1.27.10+28ed2d7 ip-10-0-143-198.ec2.internal Ready worker 15h v1.27.10+28ed2d7 ip-10-0-143-64.ec2.internal Ready worker 15h v1.27.10+28ed2d7 ip-10-0-143-80.ec2.internal Ready master 15h v1.27.10+28ed2d7 ip-10-0-144-123.ec2.internal Ready master 15h v1.27.10+28ed2d7 ip-10-0-147-94.ec2.internal Ready worker 15h v1.27.10+28ed2d7 liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE ci-op-trzci0vq-8a8c4-dq95h-master-0 Running m6a.xlarge us-east-1 us-east-1f 15h ci-op-trzci0vq-8a8c4-dq95h-master-1 Running m6a.xlarge us-east-1 us-east-1a 15h ci-op-trzci0vq-8a8c4-dq95h-master-2 Running m6a.xlarge us-east-1 us-east-1f 15h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt Running m6a.xlarge us-east-1 us-east-1a 15h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k Running m6a.xlarge us-east-1 us-east-1f 15h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb Running m6a.xlarge us-east-1 us-east-1f 15h liuhuali@Lius-MacBook-Pro huali-test % oc delete machineset ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa machineset.machine.openshift.io "ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa" deleted liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2024-02-02-173828 True False 43m Cluster version is 4.14.0-0.nightly-2024-02-02-173828 3.Upgrade to 4.15 As upgrade to 4.15 nightly stuck on operator-lifecycle-manager-packageserver which is a bug https://issues.redhat.com/browse/OCPBUGS-28744 so I build image with the fix pr (job build openshift/operator-framework-olm#679 succeeded) and upgrade to the image, upgrade successfully liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2024-02-02-173828 True True 7s Working towards 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest: 10 of 875 done (1% complete) liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False 23m Cluster version is 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 9h baremetal 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 11h cloud-controller-manager 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 8h cloud-credential 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 16h cluster-autoscaler 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 16h config-operator 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 13h console 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 3h19m control-plane-machine-set 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 5h csi-snapshot-controller 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 7h10m dns 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 9h etcd 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 14h image-registry 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 33m ingress 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 9h insights 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 16h kube-apiserver 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 14h kube-controller-manager 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 14h kube-scheduler 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 14h kube-storage-version-migrator 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 34m machine-api 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 16h machine-approver 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 13h machine-config 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 10h marketplace 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 10h monitoring 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 9h network 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 16h node-tuning 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 56m openshift-apiserver 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 9h openshift-controller-manager 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 4h56m openshift-samples 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 58m operator-lifecycle-manager 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 16h operator-lifecycle-manager-catalog 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 16h operator-lifecycle-manager-packageserver 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 57m service-ca 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 16h storage 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest True False False 9h liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE ci-op-trzci0vq-8a8c4-dq95h-master-0 Running m6a.xlarge us-east-1 us-east-1f 16h ci-op-trzci0vq-8a8c4-dq95h-master-1 Running m6a.xlarge us-east-1 us-east-1a 16h ci-op-trzci0vq-8a8c4-dq95h-master-2 Running m6a.xlarge us-east-1 us-east-1f 16h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt Running m6a.xlarge us-east-1 us-east-1a 16h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k Running m6a.xlarge us-east-1 us-east-1f 16h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb Running m6a.xlarge us-east-1 us-east-1f 16h 4.Check machine scale stuck in Provisioned, no csr pending liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms1.yaml machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1 created liuhuali@Lius-MacBook-Pro huali-test % oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a 1 1 1 1 16h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1 0 0 6s ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f 2 2 2 2 16h liuhuali@Lius-MacBook-Pro huali-test % oc scale machineset ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1 --replicas=1 machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1 scaled liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE ci-op-trzci0vq-8a8c4-dq95h-master-0 Running m6a.xlarge us-east-1 us-east-1f 16h ci-op-trzci0vq-8a8c4-dq95h-master-1 Running m6a.xlarge us-east-1 us-east-1a 16h ci-op-trzci0vq-8a8c4-dq95h-master-2 Running m6a.xlarge us-east-1 us-east-1f 16h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt Running m6a.xlarge us-east-1 us-east-1a 16h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1-5g877 Provisioning m6a.xlarge us-east-1 us-east-1a 4s ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k Running m6a.xlarge us-east-1 us-east-1f 16h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb Running m6a.xlarge us-east-1 us-east-1f 16h liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE ci-op-trzci0vq-8a8c4-dq95h-master-0 Running m6a.xlarge us-east-1 us-east-1f 18h ci-op-trzci0vq-8a8c4-dq95h-master-1 Running m6a.xlarge us-east-1 us-east-1a 18h ci-op-trzci0vq-8a8c4-dq95h-master-2 Running m6a.xlarge us-east-1 us-east-1f 18h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt Running m6a.xlarge us-east-1 us-east-1a 18h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1-5g877 Provisioned m6a.xlarge us-east-1 us-east-1a 97m ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k Running m6a.xlarge us-east-1 us-east-1f 18h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb Running m6a.xlarge us-east-1 us-east-1f 18h ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f1-4ln47 Provisioned m6a.xlarge us-east-1 us-east-1f 50m liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-128-51.ec2.internal Ready master 18h v1.28.6+a373c1b ip-10-0-143-198.ec2.internal Ready worker 18h v1.28.6+a373c1b ip-10-0-143-64.ec2.internal Ready worker 18h v1.28.6+a373c1b ip-10-0-143-80.ec2.internal Ready master 18h v1.28.6+a373c1b ip-10-0-144-123.ec2.internal Ready master 18h v1.28.6+a373c1b ip-10-0-147-94.ec2.internal Ready worker 18h v1.28.6+a373c1b liuhuali@Lius-MacBook-Pro huali-test % oc get csr NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-596n7 21m kubernetes.io/kube-apiserver-client-kubelet system:node:ip-10-0-147-94.ec2.internal <none> Approved,Issued csr-7nr9m 42m kubernetes.io/kubelet-serving system:node:ip-10-0-147-94.ec2.internal <none> Approved,Issued csr-bc9n7 16m kubernetes.io/kube-apiserver-client-kubelet system:node:ip-10-0-128-51.ec2.internal <none> Approved,Issued csr-dmk27 18m kubernetes.io/kubelet-serving system:node:ip-10-0-128-51.ec2.internal <none> Approved,Issued csr-ggkgd 64m kubernetes.io/kube-apiserver-client-kubelet system:node:ip-10-0-143-198.ec2.internal <none> Approved,Issued csr-rs9cz 70m kubernetes.io/kubelet-serving system:node:ip-10-0-143-80.ec2.internal <none> Approved,Issued liuhuali@Lius-MacBook-Pro huali-test %
Actual results:
Machine stuck in Provisioned
Expected results:
Machine should get Running
Additional info:
Must gather: https://drive.google.com/file/d/1TrZ_mb-cHKmrNMsuFl9qTdYo_eNPuF_l/view?usp=sharing I can see the provisioned machine on AWS console: https://drive.google.com/file/d/1-OcsmvfzU4JBeGh5cil8P2Hoe5DQsmqF/view?usp=sharing System log of ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1-5g877: https://drive.google.com/file/d/1spVT_o0S4eqeQxE5ivttbAazCCuSzj1e/view?usp=sharing Some log on the instance: https://drive.google.com/file/d/1zjxPxm61h4L6WVHYv-w7nRsSz5Fku26w/view?usp=sharing
While debugging a problem, I noticed some containers lack FallbackToLogsOnError. This is important for debugging via the API. Found via https://github.com/openshift/origin/pull/28547
Description of problem:
Following tests fails consistently in 4.14 powerVS runs
Issue 1 analysis:
Error Description :
{ failed during setup error waiting for replicaset: failed waiting for pods to be running: timeout waiting for 2 pods to be ready}Some Observations:
while creating a TCP service service-test with type=LoadBalancer for starting SimultaneousPodIPController it is failing to get loadbalancers details from cloud which is resulting to the error before starting data collection for e2e test and leading to the failure of test case "[Jira:"NetworkEdge"] XXXitor test service-type-load-balancer-availability setup".
Please review the following PR: https://github.com/openshift/csi-operator/pull/87
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-api-provider-libvirt/pull/274
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
==== This Jira covers only baremetal-runtimecfg component with respect to node IP detection ====
Description of problem:
Pods running in the namespace openshift-vsphere-infra are so much verbose printing as INFO messages that should debug. This excesse of verbosity has an impact in CRIO, in the node and also in the Logging system. For instance, having 71 nodes, the number of logs coming from this namespace in 1 month was: 450.000.000 meaning 1TB of logs written to disk on the node by CRIO, reading but the Red Hat log collector and stored in the Log Store. Added to the impact on the performance, it have a financial impact for the storage needed. Examples of logs are that adjust better to DEBUG and not as INFO: ``` /// For keep-alive pods are printed 4 messages per node each 10 seconds per node, in this example, the number of nodes is 71, then, this means 284 log entries per second, then 1704 log entries by minute and keepalive pod $ oc logs keepalived-master.example-0 -c keepalived-monitor |grep master.example-0|grep 2024-02-15T08:20:21 |wc -l $ oc logs keepalived-master-example-0 -c keepalived-monitor |grep worker-example-0|grep 2024-02-15T08:20:21 2024-02-15T08:20:21.671390814Z time="2024-02-15T08:20:21Z" level=info msg="Searching for Node IP of worker-example-0. Using 'x.x.x.x/24' as machine network. Filtering out VIPs '[x.x.x.x x.x.x.x]'." 2024-02-15T08:20:21.671390814Z time="2024-02-15T08:20:21Z" level=info msg="For node worker-example-0 selected peer address x.x.x.x using NodeInternalIP" 2024-02-15T08:20:21.733399279Z time="2024-02-15T08:20:21Z" level=info msg="Searching for Node IP of worker-example-0. Using 'x.x.x.x' as machine network. Filtering out VIPs '[x.x.x.x x.x.x.x]'." 2024-02-15T08:20:21.733421398Z time="2024-02-15T08:20:21Z" level=info msg="For node worker-example-0 selected peer address x.x.x.x using NodeInternalIP" /// For haproxy logs observed 2 logs printed per 6 seconds for each master, this means 6 messages in the same second, 60 messages/minute per pod $ oc logs haproxy-master-0-example -c haproxy-monitor ... 2024-02-15T08:20:00.517159455Z time="2024-02-15T08:20:00Z" level=info msg="Searching for Node IP of master-example-0. Using 'x.x.x.x/24' as machine network. Filtering out VIPs '[x.x.x.x]'." 2024-02-15T08:20:00.517159455Z time="2024-02-15T08:20:00Z" level=info msg="For node master-example-0 selected peer address x.x.x.x using NodeInternalIP"
Version-Release number of selected component (if applicable):
OpenShift 4.14 VSphere IPI installation
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift 4.14 Vsphere IPI environment 2. Review the logs of the haproxy pods and keealived pods running in the namespace `openshift-vsphere-infra`
Actual results:
The pods haproxy-* and keepalived-* pods being so much verbose printing as INFO messages should be as DEBUG. Some of the messages are available in the Description of the problem in the present bug.
Expected results:
Printed as INFO only relevant messages helping to reduce the verbosity of the pods running in the namespace `openshift-vsphere-infra`
Additional info:
[sig-cluster-lifecycle][Feature:Machines] Managed cluster should [sig-scheduling][Early] control plane machine set operator should not have any events [Suite:openshift/conformance/parallel]
Looks like this test is permafailing on 4.16 and 4.15 AWS UPI jobs - does this need to be skipped on UPI?
{ fail [github.com/openshift/origin/test/extended/machines/machines.go:191]: Unexpected error: <*errors.StatusError | 0xc0031b8f00>: controlplanemachinesets.machine.openshift.io "cluster" not found { ErrStatus: code: 404 details: group: machine.openshift.io kind: controlplanemachinesets name: cluster message: controlplanemachinesets.machine.openshift.io "cluster" not found metadata: {} reason: NotFound status: Failure, } occurred Ginkgo exit error 1: exit with code 1}
This is a clone of issue OCPBUGS-34953. The following is the description of the original issue:
—
Description of problem: When the bootstrap times out, the installer tries to download the logs from the bootstrap VM and gives an analysis of what happened. On OpenStack platform, we're currently failing to download the bootstrap logs (tracked in OCPBUGS-34950), which causes the analysis to always return an erroneous message:
time="2024-06-05T08:34:45-04:00" level=error msg="Bootstrap failed to complete: timed out waiting for the condition" time="2024-06-05T08:34:45-04:00" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane." time="2024-06-05T08:34:45-04:00" level=error msg="The bootstrap machine did not execute the release-image.service systemd unit"
The affirmation that the bootstrap machine did not execute the release-image.service systemd unit is wrong, as I can confirm by SSH'ing to the bootstrap node:
systemctl status release-image.service ● release-image.service - Download the OpenShift Release Image Loaded: loaded (/etc/systemd/system/release-image.service; static) Active: active (exited) since Wed 2024-06-05 11:57:33 UTC; 1h 16min ago Process: 2159 ExecStart=/usr/local/bin/release-image-download.sh (code=exited, status=0/SUCCESS) Main PID: 2159 (code=exited, status=0/SUCCESS) CPU: 47.364s Jun 05 11:57:05 mandre-tnvc8bootstrap systemd[1]: Starting Download the OpenShift Release Image... Jun 05 11:57:06 mandre-tnvc8bootstrap podman[2184]: 2024-06-05 11:57:06.895418265 +0000 UTC m=+0.811028632 system refresh Jun 05 11:57:06 mandre-tnvc8bootstrap release-image-download.sh[2159]: Pulling quay.io/openshift-release-dev/ocp-release@sha256:31cdf34b1957996d5c79c48466abab2fcfb9d9843> Jun 05 11:57:32 mandre-tnvc8bootstrap release-image-download.sh[2269]: 079f5c86b015ddaf9c41349ba292d7a5487be91dd48e48852d10e64dd0ec125d Jun 05 11:57:32 mandre-tnvc8bootstrap podman[2269]: 2024-06-05 11:57:32.82473216 +0000 UTC m=+25.848290388 image pull 079f5c86b015ddaf9c41349ba292d7a5487be91dd48e48852d1> Jun 05 11:57:33 mandre-tnvc8bootstrap systemd[1]: Finished Download the OpenShift Release Image.
The installer was just unable to retrieve the bootstrap logs. Earlier, buried in the installer logs, we can see:
time="2024-06-05T08:34:42-04:00" level=info msg="Failed to gather bootstrap logs: failed to connect to the bootstrap machine: dial tcp 10.196.2.10:22: connect: connection
timed out"
This is what should be reported by the analyzer.
This is a clone of issue OCPBUGS-37667. The following is the description of the original issue:
—
Description of problem:
After successfully mirroring the ibm-ftm-operator via the latest oc-mirror command to internal registry and applying the newly generated IBM CatalogSource YAML file. The created catalog pod in the openShift-marketplace namespace enters CrashLoopBackOff. Customer is trying to mirror operators and list the catalogue the command has no issues, but catalog pod is crashing with the following error: ~~~ time="2024-07-10T13:43:07Z" level=info msg="starting pprof endpoint" address="localhost:6060" time="2024-07-10T13:43:08Z" level=fatal msg="cache requires rebuild: cache reports digest as \"e891bfd5a4cb5702\", but computed digest is \"1922475dc0ee190c\"" ~~~
Version-Release number of selected component (if applicable):
oc-mirror 4.16 OCP 4.14.z
How reproducible:
Steps to Reproduce:
1. Create catalog image with the following imagesetconfiguration: ~~~ kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 archiveSize: 4 storageConfig: registry: imageURL: <internal-registry>:Port/oc-mirror-metadata/12july24 skipTLS: false mirror: platform: architectures: - "amd64" channels: - name: stable-4.14 minVersion: 4.14.11 maxVersion: 4.14.30 type: ocp shortestPath: true graph: true operators: - catalog: icr.io/cpopen/ibm-operator-catalog:v1.22 packages: - name: ibm-ftm-operator channels: - name: v4.4 ~~~ 2. Run the following command: ~~~ /oc-mirror --config=./imageset-config.yaml docker://Internal-registry:Port --rebuild-catalogs ~~~ 3. Create catalogsourcepod under openshift-marketplace namespace: ~~~ cat oc-mirror-workspace/results-1721222945/catalogSource-cs-ibm-operator-catalog.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: cs-ibm-operator-catalog namespace: openshift-marketplace spec: image: Internal-registry:Port/cpopen/ibm-operator-catalog:v1.22 sourceType: grpc ~~~
Actual results:
catalog pod is crashing with the following error: ~~~ time="2024-07-10T13:43:07Z" level=info msg="starting pprof endpoint" address="localhost:6060" time="2024-07-10T13:43:08Z" level=fatal msg="cache requires rebuild: cache reports digest as \"e891bfd5a4cb5702\", but computed digest is \"1922475dc0ee190c\"" ~~~
Expected results:
The pod should run without any issue.
Additional info:
1. The issue is reproducible with the OCP 4.14.14 and OCP 4.14.29 2. Customer is already using oc-mirror 4.16: ~~~ ./oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202407030803.p0.g394b1f8.assembly.stream.el9-394b1f8", GitCommit:"394b1f814f794f4f01f473212c9a7695726020bf", GitTreeState:"clean", BuildDate:"2024-07-03T10:18:49Z", GoVersion:"go1.21.11 (Red Hat 1.21.11-1.module+el8.10.0+21986+2112108a) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"} ~~~ 3. Customer tried with workaround described in the KB[1]: https://access.redhat.com/solutions/7006771 but no luck 4. Customer also tried to set the OPM_BINARY, but didn't work. They download OPM with respective arch: https://github.com/operator-framework/operator-registry/releases rename the downloaded binary as opm and set below variable before executing oc-mirror OPM_BINARY=/path/to/opm
Description of the problem:
BE master ~2.30 - in feature support api - VIP_AUTO_ALLOC is dev_preview for 4.15 - should be unavailable
How reproducible:
100%
Steps to reproduce:
1. GET https://<SERVICE_ADDRESS>/api/assisted-install/v2/support-levels/features?openshift_version=4.15&cpu_architecture=x86_64
2. BE response support level
3.
Actual results:
VIP_AUTO_ALLOC is dev_preview for 4.15
Expected results:
VIP_AUTO_ALLOC should be unavailable
Description of problem:
IHAC who is facing issue while deploying nutanix IPI cluster 4.16.x with dhcp.ENV DETAILS: Nutanix Versions: AOS: 6.5.4 NCC: 4.6.6.3 PC: pc.2023.4.0.2 LCM: 3.0.0.1During the installation process after the bootstrap nodes and control-planes are created, the IP addresses on the nodes shown in the Nutanix Dashboard conflict, even when Infinite DHCP leases are set. The installation will work successfully only when using the Nutanix IPAM. Also 4.14 and 4.15 releases install successfully. IPS of master0 and master2 are conflicting, Please chk attachment. Sos-report of master0 and master1 : https://drive.google.com/drive/folders/140ATq1zbRfqd1Vbew-L_7N4-C5ijMao3?usp=sharing The issue was reported via the slack thread:https://redhat-internal.slack.com/archives/C02A3BM5DGS/p1721837567181699
Version-Release number of selected component (if applicable):
How reproducible:
Use the OCP 4.16.z installer to create an OCP cluster with Nutanix using DHCP network. The installation will fail. Always reproducible.
Steps to Reproduce:
1. 2. 3.
Actual results:
The installation will fail.
Expected results:
The installation succeeds to create a Nutanix OCP cluster with the DHCP network.
Additional info:
In order to use hostPath volumes, containers in kubernetes must be started with the privileged flag set. This is because this flag toggles an SELinux boolean that cannot be toggled by enabling any particular capability. (Empirical testing shows the same restriction does not apply to emptyDir volumes.)
Since the baremetal components rely on a hostPath volumes for an number of purposes, this prevents many of them from running unprivileged.
However, there are a number of containers that do not use any hostPath volumes and need only an added capability, if anything. These should be specified explicitly instead of just setting privileged mode to enable everything.
This is a clone of issue OCPBUGS-33060. The following is the description of the original issue:
—
Description of problem:
HCP has audit log configuration for Kube API server, OpenShift API server, OAuth API server (like OCP), but does not have audit for oauth-openshift (OAuth server). Discussed with Standa in https://redhat-internal.slack.com/archives/CS05TR7BK/p1714124297376299 , oauth-openshift needs audit too in HCP.
Version-Release number of selected component (if applicable):
4.11 ~ 4.16
How reproducible:
Always
Steps to Reproduce:
1. Launch HCP env. 2. Check audit log configuration: $ oc get deployment -n clusters-hypershift-ci-279389 kube-apiserver openshift-apiserver openshift-oauth-apiserver oauth-openshift -o yaml | grep -e '^ name:' -e 'audit\.log'
Actual results:
2. It outputs oauth-openshift (OAuth server) has no audit: name: kube-apiserver - /var/log/kube-apiserver/audit.log name: openshift-apiserver - /var/log/openshift-apiserver/audit.log name: openshift-oauth-apiserver - --audit-log-path=/var/log/openshift-oauth-apiserver/audit.log - /var/log/openshift-oauth-apiserver/audit.log name: oauth-openshift
Expected results:
2. oauth-openshift (OAuth server) needs to have audit too.
Additional info:
OCP has audit for OAuth server since 4.11 AUTH-6 https://docs.openshift.com/container-platform/4.11/security/audit-log-view.html saying "You can view the logs for the OpenShift API server, Kubernetes API server, OpenShift OAuth API server, and OpenShift OAuth server".
Description of problem:
Rule ocp4-cis-file-permissions-cni-conf returned false negative result
From the CIS benchmark v1.4.0, it is using below command to check the multus config on nodes:
$ for i in $(oc get pods -n openshift-multus -l app=multus -oname); do oc exec -n openshift-multus $i -- /bin/bash -c "stat -c \"%a %n\" /host/etc/cni/net.d/*.conf"; done 600 /host/etc/cni/net.d/00-multus.conf 600 /host/etc/cni/net.d/00-multus.conf 600 /host/etc/cni/net.d/00-multus.conf 600 /host/etc/cni/net.d/00-multus.conf 600 /host/etc/cni/net.d/00-multus.conf 600 /host/etc/cni/net.d/00-multus.conf
Per the rule instructions, it is checking /etc/cni/net.d/ on the node.
However, the multus config on nodes is in path /etc/kubernetes/cni/net.d/, not /etc/cni/net.d/:
$ oc debug node/hongli-az-8pzqq-master-0 -- chroot /host ls -ltr /etc/cni/net.d/ Starting pod/hongli-az-8pzqq-master-0-debug ... To use host binaries, run `chroot /host` total 8 -rw-r--r--. 1 root root 129 Nov 7 02:18 200-loopback.conflist -rw-r--r--. 1 root root 469 Nov 7 02:18 100-crio-bridge.conflist Removing debug pod ... $ oc debug node/hongli-az-8pzqq-master-0 -- chroot /host ls -ltr /etc/kubernetes/cni/net.d/ Starting pod/hongli-az-8pzqq-master-0-debug ... To use host binaries, run `chroot /host` total 4 drwxr-xr-x. 2 root root 60 Nov 7 02:23 whereabouts.d -rw-------. 1 root root 352 Nov 7 02:23 00-multus.conf Removing debug pod ...
$ for node in `oc get node --no-headers|awk '{print $1}'`; do oc debug node/$node -- chroot /host ls -l /etc/kubernetes/cni/net.d/; done Starting pod/hongli-az-8pzqq-master-0-debug ... To use host binaries, run `chroot /host` total 4 -rw-------. 1 root root 352 Nov 7 02:23 00-multus.conf drwxr-xr-x. 2 root root 60 Nov 7 02:23 whereabouts.d Removing debug pod ... Starting pod/hongli-az-8pzqq-master-1-debug ... To use host binaries, run `chroot /host` total 4 -rw-------. 1 root root 352 Nov 7 02:23 00-multus.conf drwxr-xr-x. 2 root root 60 Nov 7 02:23 whereabouts.d Removing debug pod ... Starting pod/hongli-az-8pzqq-master-2-debug ... To use host binaries, run `chroot /host` total 4 -rw-------. 1 root root 352 Nov 7 02:23 00-multus.conf drwxr-xr-x. 2 root root 60 Nov 7 02:23 whereabouts.d Removing debug pod ... Starting pod/hongli-az-8pzqq-worker-westus-2mx6t-debug ... To use host binaries, run `chroot /host` total 4 -rw-------. 1 root root 352 Nov 7 02:38 00-multus.conf drwxr-xr-x. 2 root root 60 Nov 7 02:38 whereabouts.d Removing debug pod ... Starting pod/hongli-az-8pzqq-worker-westus-9qhf5-debug ... To use host binaries, run `chroot /host` total 4 -rw-------. 1 root root 352 Nov 7 02:38 00-multus.conf drwxr-xr-x. 2 root root 60 Nov 7 02:38 whereabouts.d Removing debug pod ... Starting pod/hongli-az-8pzqq-worker-westus-bcdpd-debug ... To use host binaries, run `chroot /host` total 4 -rw-------. 1 root root 352 Nov 7 02:38 00-multus.conf drwxr-xr-x. 2 root root 60 Nov 7 02:38 whereabouts.d Removing debug pod ...
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-11-05-194730
How reproducible:
Always
Steps to Reproduce:
1. $ for i in $(oc get pods -n openshift-multus -l app=multus -oname); do oc exec -n openshift-multus $i -- /bin/bash -c "stat -c \"%a %n\" /host/etc/cni/net.d/*.conf"; done $for node in `oc get node --no-headers|awk '{print $1}'`; do oc debug node/$node -- chroot /host ls -l /etc/kubernetes/cni/net.d/; done
Actual results:
The rule should check the wrong path and return FAIL
Expected results:
The rule should check the right path and return PASS
Additional info:
It was also applicable for both SDN and OVN
Please review the following PR: https://github.com/openshift/prometheus-alertmanager/pull/85
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/159
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The wait-for-ceo cmd is used during bootstrap to wait until the bootstrap completion conditions are met i.e etcd has scaled up to 3 members + bootstrap.
https://github.com/openshift/installer/blob/d08c982cdbb7f66b810f71aa9608bf51cce8c38c/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L569-L576
Currently this cmd won't return errors in the following two places:
Backport to 4.16 of AUTH-482 specifically for the oc node debug pods.
Description of problem:
The installer doesn’t do precheck if node architecture and vm type are consistent for aws and gcp, it works on azure
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-multi-2023-12-06-195439
How reproducible:
Always
Steps to Reproduce:
1.Config compute architecture field to arm64 but vm type choose amd64 instance type in install-config 2.Create cluster 3.Check installation
Actual results:
Azure will precheck if architecture is consistent with instance type when creating manifests, like: 12-07 11:18:24.452 [INFO] Generating manifests files.....12-07 11:18:24.452 level=info msg=Credentials loaded from file "/home/jenkins/ws/workspace/ocp-common/Flexy-install/flexy/workdir/azurecreds20231207-285-jd7gpj" 12-07 11:18:56.474 level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: controlPlane.platform.azure.type: Invalid value: "Standard_D4ps_v5": instance type architecture 'Arm64' does not match install config architecture amd64 But aws and gcp don’t have precheck, it will fail during installation, but many resources have been created. The case more likely to happen in multiarch cluster
Expected results:
The installer can do a precheck for architecture and vm type , especially for heterogeneous supported platforms(aws,gcp,azure)
Additional info:
At 17:26:09, the cluster is happily upgrading nodes:
An update is in progress for 57m58s: Working towards 4.14.1: 734 of 859 done (85% complete), waiting on machine-config
At 17:26:54, the upgrade starts to reboot master nodes and COs get noisy (this one specifically is OCPBUGS-20061)
An update is in progress for 58m50s: Unable to apply 4.14.1: the cluster operator control-plane-machine-set is not available
~Two minutes later, at 17:29:07, CVO starts to shout about waiting on operators for over 40 despite not indicating anything is wrong earlier:
An update is in progress for 1h1m2s: Unable to apply 4.14.1: wait has exceeded 40 minutes for these operators: etcd, kube-apiserver
This is only because these operators go briefly degraded during master reboot (which they shouldn't but that is a different story). CVO computes its 40 minutes against the time when it first started to upgrade the given operator so it:
1. Upgrades etcd / KAS very early in the upgrade, noting the time when it started to do that
2. These two COs upgrade successfuly and upgrade proceeds
3. Eventually cluster starts rebooting masters and etcd/KAS go degraded
4. CVO compares current time against the noted time, discovers its more than 40 minutes and starts warning about it.
all
Not entirely deterministic:
1. the upgrade must go for 40m+ between upgrading etcd and upgrading nodes
2. the upgrade must reboot a master that is not running CVO (otherwise there will be a new CVO instance without the saved times, they are only saved in memory)
1. Watch oc adm upgrade during the upgrade
Spurious "waiting for over 40m" message pops out of the blue
CVO simply says "waiting up to 40m on" and this eventually goes away as the node goes up and etcd goes out of degraded.
Storage operators are typically running only one replica and they could flip-flop between Progressing:True and Progressing:False also Available: False and Available:True during upgrade.
Usually this settles down, but this is causing our CI to report poor signal on vSphere platform - https://prow.ci.openshift.org/job-history/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-vsphere-ovn-upgrade
Please review the following PR: https://github.com/openshift/cloud-provider-alibaba-cloud/pull/40
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/azure-file-csi-driver/pull/59
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-34274. The following is the description of the original issue:
—
Description of problem:
AWS VPCs support a primary CIDR range and multiple secondary CIDR ranges: https://aws.amazon.com/about-aws/whats-new/2017/08/amazon-virtual-private-cloud-vpc-now-allows-customers-to-expand-their-existing-vpcs/
Let's pretend a VPC exists with:
and a hostedcontrolplane object like:
networking: ... machineNetwork: - cidr: 10.1.0.0/24 ... olmCatalogPlacement: management platform: aws: cloudProviderConfig: subnet: id: subnet-b vpc: vpc-069a93c6654464f03
Even though all EC2 instances will be spun up in subnet-b (10.1.0.0/24), CPO will detect the CIDR range of the VPC as 10.0.0.0/24 (https://github.com/openshift/hypershift/blob/0d10c822912ed1af924e58ccb8577d2bb1fd68be/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go#L4755-L4765) and create security group rules only allowing inboud traffic from 10.0.0.0/24. This specifically prevents these EC2 instances from communicating with the VPC Endpoint created by the awsendpointservice CR and reading the hosted control plane pods.
Version-Release number of selected component (if applicable):
Reproduced on a 4.14.20 ROSA HCP cluster, but the version should not matter
How reproducible:
100%
Steps to Reproduce:
1. Create a VPC with at least one secondary CIDR block 2. Install a ROSA HCP cluster providing the secondary CIDR block as the machine CIDR range and selecting the appropriate subnets within the secondary CIDR range
Actual results:
* Observe that the default security group contains inbound security group rules allowing traffic from the VPC's primary CIDR block (not a CIDR range containing the cluster's worker nodes) * As a result, the EC2 instances (worker nodes) fail to reach the ignition-server
Expected results:
The EC2 instances are able to reach the ignition-server and HCP pods
Additional info:
This bug seems like it could be fixed by using the machine CIDR range for the security group instead of the VPC CIDR range. Alternatively, we could duplicate rules for every secondary CIDR block, but the default AWS quota is 60 inbound security group rules/security group, so it's another failure condition to keep in mind if we go that route.
aws ec2 describe-vpcs output for a VPC with secondary CIDR blocks: ❯ aws ec2 describe-vpcs --region us-east-2 --vpc-id vpc-069a93c6654464f03 { "Vpcs": [ { "CidrBlock": "10.0.0.0/24", "DhcpOptionsId": "dopt-0d1f92b25d3efea4f", "State": "available", "VpcId": "vpc-069a93c6654464f03", "OwnerId": "429297027867", "InstanceTenancy": "default", "CidrBlockAssociationSet": [ { "AssociationId": "vpc-cidr-assoc-0abbc75ac8154b645", "CidrBlock": "10.0.0.0/24", "CidrBlockState": { "State": "associated" } }, { "AssociationId": "vpc-cidr-assoc-098fbccc85aa24acf", "CidrBlock": "10.1.0.0/24", "CidrBlockState": { "State": "associated" } } ], "IsDefault": false, "Tags": [ { "Key": "Name", "Value": "test" } ] } ] }
Description of problem:
When trying to delete a ClusterResourceQuota resource, using foreground deletion cascading strategy it's stuck in that state and failing to complete the removal. Once background deletion cascading strategy is used it's immediately removed. Now, given that OpenShift GitOps is using foreground deletion cascading strategy by default, this does expose some challenges when managing ClusterResourceQuota resources using OpenShift GitOps.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-23-223425 but also previous version of OpenShift Container Platform 4 are affected
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift Container Platform 4 2. Create the ClusterResourceQuota as shown below $ bat -p /tmp/crq.yaml apiVersion: quota.openshift.io/v1 kind: ClusterResourceQuota metadata: creationTimestamp: null name: blue spec: quota: hard: pods: "10" secrets: "20" selector: annotations: null labels: matchLabels: color: nocolor 3. Delete the ClusterResourceQuota using "oc delete --cascade=foreground clusterresourcequota blue"
Actual results:
$ oc delete --cascade=foreground clusterresourcequota blue clusterresourcequota.quota.openshift.io "blue" deleted Is stuck and won't finish, the resource looks as shown below. $ oc get clusterresourcequota blue -o yaml apiVersion: quota.openshift.io/v1 kind: ClusterResourceQuota metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"quota.openshift.io/v1","kind":"ClusterResourceQuota","metadata":{"annotations":{},"creationTimestamp":null,"name":"blue"},"spec":{"quota":{"hard":{"pods":"10","secrets":"20"}},"selector":{"annotations":null,"labels":{"matchLabels":{"color":"nocolor"}}}}} creationTimestamp: "2023-10-24T07:37:48Z" deletionGracePeriodSeconds: 0 deletionTimestamp: "2023-10-24T07:59:47Z" finalizers: - foregroundDeletion generation: 2 name: blue resourceVersion: "60554" uid: c18dd92c-afeb-47f4-a944-8b55be4037d7 spec: quota: hard: pods: "10" secrets: "20" selector: annotations: null labels: matchLabels: color: nocolor
Expected results:
The ClusterResourceQuota to be deleted using foreground deletion cascading strategy without being stuck as there does not appear to be any OwnerReference that is still around and blocking removal
Additional info:
Description of problem:
Snyk is failing on some deps
Version-Release number of selected component (if applicable):
At least master/4.17 and 4.16
How reproducible:
100%
Steps to Reproduce:
Open a PR against master or release-4.16 branch, Snyk will fail. And it seems like recent history shows that the test is just being overridden, we should stop overriding the test and fix the deps or justify excluding them from Snyk
Actual results:
This is a clone of issue OCPBUGS-34713. The following is the description of the original issue:
—
Description of problem:
[AWS] securityGroups and subnet don’t keep consistent in machine yaml and on aws console No securityGroups huliu-aws531d-vlzbw-master-sg for masters on aws console, but shows in master machines yaml No securityGroups huliu-aws531d-vlzbw-worker-sg for workers on aws console, but shows in worker machines yaml No subnet huliu-aws531d-vlzbw-private-us-east-2a for masters and workers on aws console, but shows in master and worker machines yaml
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-30-130713 This happens in the latest 4.16(CAPI) AWS cluster
How reproducible:
Always
Steps to Reproduce:
1. Install a AWS 4.16 cluster liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-05-30-130713 True False 46m Cluster version is 4.16.0-0.nightly-2024-05-30-130713 liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws531d-vlzbw-master-0 Running m6i.xlarge us-east-2 us-east-2a 65m huliu-aws531d-vlzbw-master-1 Running m6i.xlarge us-east-2 us-east-2b 65m huliu-aws531d-vlzbw-master-2 Running m6i.xlarge us-east-2 us-east-2c 65m huliu-aws531d-vlzbw-worker-us-east-2a-swwmk Running m6i.xlarge us-east-2 us-east-2a 62m huliu-aws531d-vlzbw-worker-us-east-2b-f2gw9 Running m6i.xlarge us-east-2 us-east-2b 62m huliu-aws531d-vlzbw-worker-us-east-2c-x6gbz Running m6i.xlarge us-east-2 us-east-2c 62m 2.Check the machines yaml, there are 4 securityGroups and 2 subnet value for master machines, 3 securityGroups and 2 subnet value for worker machines. But check on aws console, only 3 securityGroups and 1 subnet value for masters, 2 securityGroups and 1 subnet value for workers. liuhuali@Lius-MacBook-Pro huali-test % oc get machine huliu-aws531d-vlzbw-master-0 -oyaml … securityGroups: - filters: - name: tag:Name values: - huliu-aws531d-vlzbw-master-sg - filters: - name: tag:Name values: - huliu-aws531d-vlzbw-node - filters: - name: tag:Name values: - huliu-aws531d-vlzbw-lb - filters: - name: tag:Name values: - huliu-aws531d-vlzbw-controlplane subnet: filters: - name: tag:Name values: - huliu-aws531d-vlzbw-private-us-east-2a - huliu-aws531d-vlzbw-subnet-private-us-east-2a … https://drive.google.com/file/d/1YyPQjSCXOm-1gbD3cwktDQQJter6Lnk4/view?usp=sharing https://drive.google.com/file/d/1MhRIm8qIZWXdL9-cDZiyu0TOTFLKCAB6/view?usp=sharing https://drive.google.com/file/d/1Qo32mgBerWp5z6BAVNqBxbuH5_4sRuBv/view?usp=sharing https://drive.google.com/file/d/1seqwluMsPEFmwFL6pTROHYyJ_qPc0cCd/view?usp=sharing liuhuali@Lius-MacBook-Pro huali-test % oc get machine huliu-aws531d-vlzbw-worker-us-east-2a-swwmk -oyaml … securityGroups: - filters: - name: tag:Name values: - huliu-aws531d-vlzbw-worker-sg - filters: - name: tag:Name values: - huliu-aws531d-vlzbw-node - filters: - name: tag:Name values: - huliu-aws531d-vlzbw-lb subnet: filters: - name: tag:Name values: - huliu-aws531d-vlzbw-private-us-east-2a - huliu-aws531d-vlzbw-subnet-private-us-east-2a … https://drive.google.com/file/d/1FM7dxfSK0CGnm81dQbpWuVz1ciw9hgpq/view?usp=sharing https://drive.google.com/file/d/1QClWivHeGGhxK7FdBUJnGu-vHylqeg5I/view?usp=sharing https://drive.google.com/file/d/12jgyFfyP8fTzQu5wRoEa6RrXbYt_Gxm1/view?usp=sharing
Actual results:
securityGroups and subnet don’t keep consistent in machine yaml and on aws console
Expected results:
securityGroups and subnet should keep consistent in machine yaml and on aws console
Additional info:
Description of problem:
Kube-apiserver operator is trying to delete prometheus rule that does not exists leading to huge amount of unwanted audit logs, With the introduction of the change as a part of BUG-2004585 kube-apiserver SLO rulesare split into 2 groups kube-apiserver-slos-basic and kube-apiserver-slos-extended kube-apiserver-operator is trying to delete /apis/monitoring.coreos.com/v1/namespaces/openshift-kube-apiserver/prometheusrules/kube-apiserver-slos which no longer exist in the cluster
Version-Release number of selected component (if applicable):
4.12 4.13 4.14
How reproducible:
Its easy to reproduce
Steps to Reproduce:
1. install a cluster with 4.12 2. enable cluster logging 3. forward the audit log to internal or external logstore using below config apiVersion: logging.openshift.io/v1 kind: ClusterLogForwarder metadata: name: instance namespace: openshift-logging spec: pipelines: - name: all-to-default inputRefs: - infrastructure - application - audit outputRefs: - default 4. Check the audit logs in kibana, it will show the logs like below image
Actual results:
Kube-apiserver-operator is trying to delete prometheus rule that does not exists in the cluster
Expected results:
if the rule is not there in the cluster it should not be searched for deletion
Additional info:
Aggregator claims these tests only ran 4 times out of what looks like 10 jobs that ran to normal completion:
[sig-network-edge] Application behind service load balancer with PDB remains available using new connections
[sig-network-edge] Application behind service load balancer with PDB remains available using reused connections
However looking at one of the jobs not in the list of passes, we can see these tests ran:
Why is the aggregator missing this result somehow?
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Original issue reported here: https://issues.redhat.com/browse/ACM-6189 reported by QE and customer.
Using ACM/hive, customers can deploy Openshift on vSphere. In the upcoming release of ACM 2.9, we support customers on OCP 4.12 - 4.15. ACM UI updates the install config as users add configurations details.
This has worked for several releases over the last few years. However in OCP 4.13+ the format has changed and there is now additional validation to check if the datastore is a full path.
As per https://issues.redhat.com/browse/SPLAT-1093, removal of the legacy fields should not happen until later, so any legacy configurations such as relative paths should still work.
Version-Release number of selected component (if applicable):
ACM 2.9.0-DOWNSTREAM-2023-10-24-01-06-09 OpenShift 4.14.0-rc.7 OpenShift 4.13.18 OpenShift 4.12.39
How reproducible:
Always
Steps to Reproduce:
1. Deploy OCP 4.12 on vSphere using legacy field and relative path without folder (e.g. platform.vsphere.defaultDatastore: WORKLOAD-DS 2. Installer passes. 3. Deploy OCP 4.12 on vSphere using legacy field and relative path WITH folder (e.g. platform.vsphere.defaultDatastore: WORKLOAD-DS-Folder/WORKLOAD-DS 4. Installer fails. 5. Deploy OCP 4.12 on vSphere using legacy field and FULL path (e.g. platform.vsphere.defaultDatastore: /Workload Datacenter/datastore/WORKLOAD-DS-Folder/WORKLOAD-DS 6. Installer fails. 7. Deploy OCP 4.13 on vSphere using legacy field and relative path without folder (e.g. platform.vsphere.defaultDatastore: WORKLOAD-DS 8. Installer fails. 9. Deploy OCP 4.13 on vSphere using legacy field and relative path WITH folder (e.g. platform.vsphere.defaultDatastore: WORKLOAD-DS-Folder/WORKLOAD-DS 10. Installer passes. 11. Deploy OCP 4.13 on vSphere using legacy field and FULL path (e.g. platform.vsphere.defaultDatastore: /Workload Datacenter/datastore/WORKLOAD-DS-Folder/WORKLOAD-DS 12. Installer fails.
Actual results:
Default Datastore Value | OCP 4.12 | OCP 4.13 | OCP 4.14 |
---|---|---|---|
/Workload Datacenter/datastore/WORKLOAD-DS-Folder/WORKLOAD-DS | No | Yes | Yes |
WORKLOAD-DS-Folder/WORKLOAD-DS | No | Yes | Yes |
WORKLOAD-DS | Yes | No | No |
For OCP 4.12.z managed clusters deployments name-only path is the only one that works as expected.
For OCP 4.13.z+ managed cluster deployments only full name and relative path with folder works as expected.
Expected results:
OCP 4.13.z+ takes relative path without specifying the folder like OCP 4.12.z does.
Additional info:
Searching CI turns up runs like this which log:
level=warning msg=The bootstrap machine is unable to resolve API and/or API-Int Server URLs
despite the gathere log-bundle saying resolution was fine:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade/1721875076346810368/artifacts/e2e-gcp-ovn-upgrade/ipi-install-install-stableinitial/artifacts/log-bundle-20231107134325.tar | tar xOz log-bundle-20231107134325/bootstrap/services/bootkube.json | jq -r '.[] | .timestamp + " " + (.stage // "-") + " " + .phase + " " + (.result // "-")' bootstrap/services/bootkube.json | sort | grep resolve 2023-11-03T10:26:53Z resolve-api-int-url stage end success 2023-11-03T10:26:53Z resolve-api-int-url stage start - 2023-11-03T10:26:53Z resolve-api-url stage end success 2023-11-03T10:26:53Z resolve-api-url stage start - 2023-11-03T10:47:30Z resolve-api-int-url stage end success 2023-11-03T10:47:30Z resolve-api-int-url stage start - 2023-11-03T10:47:30Z resolve-api-url stage end success 2023-11-03T10:47:30Z resolve-api-url stage start -
Definitely 4.15, from that CI run. Likely all releases since the check landed in installer#5816.
Untested, but from inspecting the code, I'd expect fairly reproducible.
1. Feed the installer an impossible manifest (e.g. using a kind that does not exist).
2. Try to install.
3. See the installer gather bootstrap logs and analyze them.
level=warning msg=The bootstrap machine is unable to resolve API and/or API-Int Server URLs
Installer does not complain about API resolution unless API resolution is broken.
The check logic seems to be looking at overall success of the bootkube service. It should be updated to only check the success of the resolve-api-url and resolve-api-int-url steps in that service. Ideally with separate analysis steps, so you don't have to say "and/or" in the logged warning.
Please review the following PR: https://github.com/openshift/csi-external-attacher/pull/69
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Mirror OCI catalog with v2, after create catalog source , the pod is not present, check the catalog see error info : oc describe catalogsource cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c Name: cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c Namespace: openshift-marketplace Labels: <none> Annotations: <none> API Version: operators.coreos.com/v1alpha1 Kind: CatalogSource Metadata: Creation Timestamp: 2024-03-29T02:49:47Z Generation: 1 Resource Version: 53264 UID: 69a39693-b29b-4fa4-a6da-de31dc3d521c Spec: Image: ec2-3-139-239-15.us-east-2.compute.amazonaws.com:5000/multi/redhat-operator-index:8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c Source Type: grpc Status: Message: couldn't ensure registry server - error ensuring pod: : error creating new pod: cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c-: Pod "cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7da785sd" is invalid: metadata.labels: Invalid value: "cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c": must be no more than 63 characters Reason: RegistryServerError Events: <none>
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202403251146.p0.g03ce0ca.assembly.stream.el9-03ce0ca", GitCommit:"03ce0ca797e73b6762fd3e24100ce043199519e9", GitTreeState:"clean", BuildDate:"2024-03-25T16:34:33Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
always
Steps to Reproduce:
1) Copy the operator as OCI format to localhost: `skopeo copy --all docker://registry.redhat.io/redhat/redhat-operator-index:v4.15 oci:///app1/noo/redhat-operator-index --remove-signatures` 2) Use following imagesetconfigure for mirror: cat config-multi-op.yaml kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 mirror: operators: - catalog: oci:///app1/noo/redhat-operator-index packages: - name: odf-operator `oc-mirror --config config-multi-op.yaml file://outmulitop --v2` 3) Do diskTomirror : `oc-mirror --config config-multi-op.yaml --from file://outmulitop --v2 docker://ec2-3-139-239-15.us-east-2.compute.amazonaws.com:5000/multi` 4) Create cluster resource with file: itms-oc-mirror.yaml `oc create -f cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c.yaml`
Actual results:
4) The pod for catalogsource not present oc describe catalogsource cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c Name: cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c Namespace: openshift-marketplace Labels: <none> Annotations: <none> API Version: operators.coreos.com/v1alpha1 Kind: CatalogSource Metadata: Creation Timestamp: 2024-03-29T02:49:47Z Generation: 1 Resource Version: 53264 UID: 69a39693-b29b-4fa4-a6da-de31dc3d521c Spec: Image: ec2-3-139-239-15.us-east-2.compute.amazonaws.com:5000/multi/redhat-operator-index:8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c Source Type: grpc Status: Message: couldn't ensure registry server - error ensuring pod: : error creating new pod: cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c-: Pod "cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7da785sd" is invalid: metadata.labels: Invalid value: "cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c": must be no more than 63 characters Reason: RegistryServerError Events: <none> cat cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: creationTimestamp: null name: cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c namespace: openshift-marketplace spec: image: ec2-3-139-239-15.us-east-2.compute.amazonaws.com:5000/multi/redhat-operator-index:8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c sourceType: grpc status: {}
Expected results:
4) catalog source pod running well.
Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/56
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-34261. The following is the description of the original issue:
—
Description of problem:
The CurrentImagePullSecret field on the MachineOSConfig is not being consumed by the rollout process. This is evident when the designated image registry is private and the only way to pull an image is to present a secret.
How reproducible:
Always
Steps to Reproduce:
Actual results:
The node and MachineConfigPool will degrade because rpm-ostree is unable to pull the newly-built image because it does not have access to the credentials even though the MachineOSConfig has a field for them.
Expected results:
Rolling out the newly-built OS image should succeed.
Additional info:
It looks like we'll need make the getImageRegistrySecrets() function aware of all MachineOSConfigs and pull the secrets from there. Where this could be problematic is where there are two image registries with different secrets. This is because the secrets are merged based on the image registry hostname. Instead, what we may want to do is have the MCD write only the contents of the referenced secret to the nodes' filesystem before calling rpm-ostree to consume it. This could potentially also reduce or eliminate the overall complexity introduced by the getImageRegistrySecrets() while simultaneously resolving the concerns found under https://issues.redhat.com//browse/OCPBUGS-33803.
It is worth mentioning that even though we use a private image registry to test the rollout process in OpenShift CI, the reason it works is because it uses an Imagestream which the machine-os-puller service account and its image pull secret is associated with it. This secret is surfaced to all of the cluster nodes by the getImageRegistrySecrets() process. So in effect, it may appear that its working when it does not work as intended. A way to test this would be to create an ImageStream in a separate namespace along with a separate pull secret and then attempt to use that ImageStream and pull secret within a MachineOSConfig.
Finally, to add another wrinkle to this problem: If a cluster admin wants to use a different final image pull secret for each MachineConfigPool, merging those will get more difficult. Assuming the image registry has the same hostname, this would lead to the last secret being merged as the winner. And the last secret that gets merged would be the secret that gets used; which may be the incorrect secret.
Description of problem. In Advanced Cluster Security we rely for OCP creation in our CI and recently observed an increase of cluster creation failures. While we've been advised to retry the failures (and we do so now, see ROX-25416), I'm afraid our use case is not so unique and others are affected as well.
We suggest upgrading terraform and provider to the latest version (possible before license changes) in openshift-installer for 4.12+. The underlying issue is probably already fixed upstream and released in v5.37.0.
Version-Release number of selected component (if applicable): TBD
How reproducible: TBD
Steps to Reproduce: TBD
Actual results: TBD
Expected results: TBD
The most common error we see in our JIRA issues is and that is something we could find similar issues with AWS provider too eg. OCPBUGS-4213.
level=error msg=Error: Provider produced inconsistent result after apply .... resource was present, but now absent
Summary of errors from:
3 failed to create cluster: failed to apply Terraform: error(GCPComputeBackendTimeout) from Infrastructure Provider: GCP is experiencing backend service interuptions, the compute instance failed to create in reasonable time." 3 Provider produced inconsistent result after apply\n\nWhen applying changes to\nmodule.master.google_service_account.master-node-sa[0], provider\n\"provider[\\\"openshift/local/google\\\"]\" produced an unexpected new value: Root\nresource was present, but now absent.\n\n 6 Error waiting to create Network: Error waiting for Creating Network: timeout while waiting for state to become 'DONE' (last state: 'RUNNING', timeout: 4m0s)\n\n with module.network.google_compute_network.cluster_network[0],\n on network/network.tf line 1, in resource \"google_compute_network\" \"cluster_network\":\n 1: resource \"google_compute_network\" \"cluster_network\" {\n\n" 9 error applying Terraform configs: failed to apply Terraform: error(GCPComputeBackendTimeout) from Infrastructure Provider: GCP is experiencing backend service interuptions, the compute instance failed to create in reasonable time." 14 Provider produced inconsistent result after apply\n\nWhen applying changes to module.master.google_service_account.master-node-sa,\nprovider \"provider[\\\"openshift/local/google\\\"]\" produced an unexpected new\nvalue: Root resource was present, but now absent. 16 Provider produced inconsistent result after apply\n\nWhen applying changes to google_service_account_key.bootstrap, provider\n\"provider[\\\"openshift/local/google\\\"]\" produced an unexpected new value: Root\nresource was present, but now absent. 18 Provider produced inconsistent result after apply\n\nWhen applying changes to module.iam.google_service_account.worker-node-sa,\nprovider \"provider[\\\"openshift/local/google\\\"]\" produced an unexpected new\nvalue: Root resource was present, but now absent. 34 Error creating service account key: googleapi: Error 404: Service account projects/acs-san-stackroxci/serviceAccounts/XXX@acs-san-stackroxci.iam.gserviceaccount.com does not exist., notFound\n\n with google_service_account_key.bootstrap,\n on main.tf line 38, in resource \"google_service_account_key\" \"bootstrap\":\n 38: resource \"google_service_account_key\" \"bootstrap\" {\n\n" 45 error applying Terraform configs: failed to apply Terraform: exit status 1\n\nError: Provider produced inconsistent result after apply\n\nWhen applying changes to\nmodule.master.google_service_account.master-node-sa[0], provider\n\"provider[\\\"openshift/local/google\\\"]\" produced an unexpected new value: Root\nresource was present, but now absent. 59 error applying Terraform configs: failed to apply Terraform: exit status 1\n\nError: Provider produced inconsistent result after apply\n\nWhen applying changes to module.iam.google_service_account.worker-node-sa,\nprovider \"provider[\\\"openshift/local/google\\\"]\" produced an unexpected new\nvalue: Root resource was present, but now absent. 100 Provider produced inconsistent result after apply\n\nWhen applying changes to google_service_account.bootstrap-node-sa, provider\n\"provider[\\\"openshift/local/google\\\"]\" produced an unexpected new value: Root\nresource was present, but now absent. 103 Provider produced inconsistent result after apply\n\nWhen applying changes to module.iam.google_service_account.worker-node-sa[0],\nprovider \"provider[\\\"openshift/local/google\\\"]\" produced an unexpected new\nvalue: Root resource was present, but now absent. 116 Provider produced inconsistent result after apply\n\nWhen applying changes to\nmodule.master.google_service_account.master-node-sa[0], provider\n\"provider[\\\"openshift/local/google\\\"]\" produced an unexpected new value: Root\nresource was present, but now absent.
The openshift installer contains a bundled terraform and google-provider
These two tests are permafailing on some metal jobs:
[sig-arch][Late] all tls artifacts must be registered [Suite:openshift/conformance/parallel]
[sig-arch][Late] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel]
It was previously passing 100%, maybe it's tied to recent changes?
https://github.com/openshift/origin/pull/28444
Additional context here:
This is a clone of issue OCPBUGS-42120. The following is the description of the original issue:
—
Description of problem:
After upgrading OCP and LSO to version 4.14, elasticsearch pods in the openshift-logging deployment are unable to schedule to their respective nodes and remain Pending, even though the LSO managed PVs are bound to the PVCs. A test pod using a newly created test PV managed by the LSO is able to schedule correctly however.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Consistently
Steps to Reproduce:
1. 2. 3.
Actual results:
Pods consuming previously existing LSO managed PVs are unable to schedule and remain in a Pending state after upgrading OCP and LSO to 4.14.
Expected results:
That pods would be able to consume LSO managed PVs and schedule correctly to nodes.
Additional info:
Description of problem:
Adding automountServiceAccount: false to the pod removes the SA token in ovnkube-control-plane pod, this causes it to crash with following error: F1212 12:18:13.705048 1 ovnkube.go:136] unable to create kubernetes rest config, err: TLS-secured apiservers require token/cert and CA certificate. This error is misleading as the pod doesnt use KAS and doesnt need the SA token.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
100%
Steps to Reproduce:
1. Add automountServiceAccountToken: false to pod spec in ovnkube-control-plane deployment 2. Check new pod for error 3.
Actual results:
pod crashes with error: unable to create kubernetes rest config, err: TLS-secured apiservers require token/cert and CA certificate.
Expected results:
pod runs without issues
Additional info:
This is a clone of issue OCPBUGS-33136. The following is the description of the original issue:
—
Description of problem:
Compute nodes table, does not display correct filesystem data
Version-Release number of selected component (if applicable):
4.16.0-0.ci-2024-04-29-054754
How reproducible:
Always
Steps to Reproduce:
1. In an Openshift cluster 4.16.0-0.ci-2024-04-29-054754 2. Go to the Compute / Nodes menu 3. Check the Filesystem column
Actual results:
There is no storage data displayed
Expected results:
The query is executed correctly and the storage data is displayed correctly
Additional info:
The query has an error as is not concatenating things correctly: https://github.com/openshift/console/blob/master/frontend/packages/console-app/src/components/nodes/NodesPage.tsx#L413
This is a clone of issue OCPBUGS-37334. The following is the description of the original issue:
—
Description of problem:
ci/prow/security is failing: k8s.io/client-go/transport
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. trigger ci/prow/security on a pull request 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
all images have been removed from quay.io/centos7 and oc newapp unit tests are heavily relying on these images and started failing. See https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_oc/1716/pull-ci-openshift-oc-master-unit/1773203483667730432
Version-Release number of selected component (if applicable):
probably all
How reproducible:
Open a PR and see that pre-submit unit test fails
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
[sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers for ns/openshift-kni-infra
Test is now passing near 0% on metal-ipi.
Started around Feb 10th.
event [namespace/openshift-kni-infra node/master-1.ostest.test.metalkube.org pod/haproxy-master-1.ostest.test.metalkube.org hmsg/64785a22cf - Back-off restarting failed container haproxy in pod haproxy-master-1.ostest.test.metalkube.org_openshift-kni-infra(336080d8c1b455c151170524132c026d)] happened 295 times
Possible relation to haproxy 2.8 merge?
Logs indicate the error is:
/bin/bash: line 47: socat: command not found
Please review the following PR: https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/183
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Single line execute markdown reference is not working.
The inline code is not getting rendered properly specifically for single line execute syntax.
The inline code should show a code block with a small execute icon to run the commands in web terminal
Description of problem:
When there is new update for cluster, try to click "Select a version" from cluster settings page, there is no reaction.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-19-033450
How reproducible:
Always
Steps to Reproduce:
1.Prepare a cluster with available update. 2.Go to Cluster Settings page, choose a version by clicking on "Select a version" button. 3.
Actual results:
2. There is no response when click on the button, user could not select a version from the page.
Expected results:
2. A modal should show up for user to select version after clicking on "Select a version" button
Additional info:
screenshot: https://drive.google.com/file/d/1Kpyu0kUKFEQczc5NVEcQFbf_uly_S60Y/view?usp=sharing
This is a clone of issue OCPBUGS-30841. The following is the description of the original issue:
—
Description of problem:
PAC provide the log link in git to see log of the PLR. Which is broken on 4.15 after this change https://github.com/openshift/console/pull/13470. This PR changes the log URL after react route package upgrade.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
PR https://github.com/openshift/monitoring-plugin/pull/83 was intended to just modify the images built for local testing, but accidentally changed the deafult Dockerfile leading to a mismatch between the nginx config and Dockerfile used in CI. The causes the monitoring-plugin to fail to load in CI builds.
It has been shown that running the conformance test suite on hypershift hosted clusters with the kubevirt provider are far more stable than their metal counterparts. In order to get the conformance on Azure passing, we need to skip a single test that sends ping (ICMP) to the Internet, as azure is blocking ICMP.
This is a clone of issue OCPBUGS-34493. The following is the description of the original issue:
—
Description of problem:
Failed to deploy baremetal cluster as cluster nodes are not introspected
Version-Release number of selected component (if applicable):
4.15.15
How reproducible:
periodically
Steps to Reproduce:
1. Deploy baremetal dualstack cluster with disabled provisioning network 2. 3.
Actual results:
Cluster fails to deploy as ironic.service fails to start on the bootstrap node: [root@api ~]# systemctl status ironic.service ○ ironic.service - Ironic baremetal deployment service Loaded: loaded (/etc/containers/systemd/ironic.container; generated) Active: inactive (dead) May 27 08:01:05 api.kni-qe-4.lab.eng.rdu2.redhat.com systemd[1]: Dependency failed for Ironic baremetal deployment service. May 27 08:01:05 api.kni-qe-4.lab.eng.rdu2.redhat.com systemd[1]: ironic.service: Job ironic.service/start failed with result 'dependency'.
Expected results:
ironic.service is started, nodes are introspected and cluster is deployed
Additional info:
Description of problem:
the checkbox should be displayed on a single row eg: for 'Deny all ingress traffic' & 'Deny all egress traffic' in Create NetworkPolicy page for 'Secure Route' in Create route page
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-11-14-082209
How reproducible:
Always
Steps to Reproduce:
1. Go to Networking -> NetworkPolicies page, click the 'Create NetworkPolicy' button 2. Check the Policy type section, check if the checkbox of 'Deny all ingress traffic' & "Deny all egress traffic" is displayed in a single row 3. Check the same things in 'Create route' page,
Actual results:
not in a single row
Expected results:
in a single row
Additional info:
https://drive.google.com/file/d/1xgEe-CuuRYrY9tBFmIa-7o5Rcn7iCr1e/view?usp=drive_link
This is a clone of issue OCPBUGS-34708. The following is the description of the original issue:
—
Description of problem:
failed job: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1023/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-gatewayapi/1796261717831847936 seeing below error: level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: error unpacking terraform: could not unpack the directory for the aws provider: open mirror/openshift/local/aws: file does not exist
Version-Release number of selected component (if applicable):
4.16/4.17
How reproducible:
100%
Steps to Reproduce:
1. create AWS cluster with "CustomNoUpgrade" featureSet is configured install-config.yaml ---------------------- featureSet: CustomNoUpgrade featureGates: [GatewayAPIEnabled=true] 2.
Actual results:
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: error unpacking terraform: could not unpack the directory for the aws provider: open mirror/openshift/local/aws: file does not exist
Expected results:
install should be successful
Additional info:
workaround is to add ClusterAPIInstallAWS=true to feature_gates as well, .e.g featureSet: CustomNoUpgrade featureGates: [GatewayAPIEnabled=true,ClusterAPIInstallAWS=true]
discussion thread: https://redhat-internal.slack.com/archives/C68TNFWA2/p1716887301410459
Description of problem:
Go to one pvc "VolumeSnapshots" tab, it shows error "Oh no! Something went wrong."
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-03-140457
How reproducible:
Always
Steps to Reproduce:
1.Create a pvc in project. Go to the pvc's "VolumeSnapshots" tab. 2. 3.
Actual results:
1. The error "Oh no! Something went wrong." shows up on the page.
Expected results:
1. Should show volumesnapshot related to the pvc without error.
Additional info:
screenshot: https://drive.google.com/file/d/1l0i0DCFh_q9mvFHxnftVJL0AM1LaKFOO/view?usp=sharing
Please review the following PR: https://github.com/openshift/cluster-etcd-operator/pull/1187
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/102
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Nodepools fail to provision if not subnet is specified.
Subnet field https://github.com/openshift/hypershift/blob/main/api/hypershift/v1beta1/nodepool_types.go#L760 should be required.
Multi-arch compute clusters have an issue where the cluster version's image ref is single arch, so this change resolves the image ref without spinning up a pod.
Description of problem:
Version shown for `oc-mirror --v2 version` should be similar to `oc-mirror version`
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202403070215.p0.gc4f8295.assembly.stream.el9-c4f8295", GitCommit:"c4f829512107f7d0f52a057cd429de2030b9b3b3", GitTreeState:"clean", BuildDate:"2024-03-07T03:46:24Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
always
Steps to Reproduce:
1) `oc-mirror –v2 -v`
Actual results:
oc-mirror --v2 -v--v2 flag identified, flow redirected to the oc-mirror v2 version. PLEASE DO NOT USE that. V2 is still under development and it is not ready to be used. oc-mirror version v2.0.0-dev-01
Expected results:
oc-mirror version --output=yaml clientVersion: buildDate: "2024-03-07T03:46:24Z" compiler: gc gitCommit: c4f829512107f7d0f52a057cd429de2030b9b3b3 gitTreeState: clean gitVersion: 4.16.0-202403070215.p0.gc4f8295.assembly.stream.el9-c4f8295 goVersion: go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime major: "" minor: "" platform: linux/amd64
Description of problem:
Cluster with user provisioned image registry storage accounts fails to upgrade to 4.14.20 due to image-registry-operator being degraded. message: "Progressing: The registry is ready\nNodeCADaemonProgressing: The daemon set node-ca is deployed\nAzurePathFixProgressing: Migration failed: panic: AZURE_CLIENT_ID is required for authentication\nAzurePathFixProgressing: \nAzurePathFixProgressing: goroutine 1 [running]:\nAzurePathFixProgressing: main.main()\nAzurePathFixProgressing: \t/go/src/github.com/openshift/cluster-image-registry-operator/cmd/move-blobs/main.go:25 +0x15c\nAzurePathFixProgressing: " cmd/move-blobs was introduced due to https://issues.redhat.com/browse/OCPBUGS-29003.
Version-Release number of selected component (if applicable):
4.14.15+
How reproducible:
I have not reproduced myself but I imagine you would hit this every time when upgrading from 4.13->4.14.15+ with Azure UPI image registry
Steps to Reproduce:
1.Starting on version 4.13, Configuring the registry for Azure user-provisioned infrastructure - https://docs.openshift.com/container-platform/4.14/registry/configuring_registry_storage/configuring-registry-storage-azure-user-infrastructure.html. 2. Upgrade to 4.14.15+ 3.
Actual results:
Upgrade does not complete succesfully $ oc get co .... image-registry 4.14.20 True False True 617d AzurePathFixControllerDegraded: Migration failed: panic: AZURE_CLIENT_ID is required for authentication... $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.38 True True 7h41m Unable to apply 4.14.20: wait has exceeded 40 minutes for these operators: image-registry
Expected results:
Upgrade to complete successfully
Additional info:
This is a clone of issue OCPBUGS-32186. The following is the description of the original issue:
—
Description of problem:
The self-managed hypershift cli (hcp) reports an inaccurate OCP supported version. For example, if I have a hypershift-operator deployed which supports OCP v4.14 and I build the hcp cli from the latest source code, when I execute "hcp -v", the cli tool reports the following. $ hcp -v hcp version openshift/hypershift: 02bf7af8789f73c7b5fc8cc0424951ca63441649. Latest supported OCP: 4.16.0 This makes it appear that the hcp cli is capable of deploying OCP v4.16.0, when the backend is actually limited to v4.14.0. The cli needs to indicate what the server is capable of deploying. Otherwise it appears that v4.16.0 would be deployable in this scenario, but the backend would not allow that.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100%
Steps to Reproduce:
1. download an HCP client that does not match the hypershift-operator backend 2. execute 'hcp -v' 3. the reported "Latest supported OCP" is not representative of the version the hypershift-operator actually supports
Actual results:
Expected results:
hcp cli reports a latest OCP version that is representative of what the deployed hypershift operator is capable of deploying.
Additional info:
Please review the following PR: https://github.com/openshift/openstack-cinder-csi-driver-operator/pull/145
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Hit error when create pruning plan generation for delete phase using –generate for mirror2mirror
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202404191609.p0.g9ac063b.assembly.stream.el9-9ac063b", GitCommit:"9ac063b0b88466183a50287af277c5ed40a8e238", GitTreeState:"clean", BuildDate:"2024-04-19T22:03:51Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
always
Steps to Reproduce:
1) Use following isc to do mirror2mirror for v2: cat config.yaml kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 mirror: platform: channels: - name: stable-4.15 additionalImages: - name: registry.redhat.io/ubi8/ubi:latest - name: registry.redhat.io/ubi8/ubi-minimal:latest operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.14 packages: - name: 3scale-operator - catalog: oci:///app1/ibm-catalog targetTag: "v14" targetCatalog: "zhouy/catalog" - catalog: oci:///app1/noo/redhat-operator-index packages: - name: cluster-kube-descheduler-operator - name: advanced-cluster-management `oc-mirror --config config.yaml --v2 docker://xxx.com:5000/m2m --workspace file:///app1/0416/clid20/` 2) generate pruning plan for delete phase using --generate cat config-delete.yaml apiVersion: mirror.openshift.io/v1alpha2 kind: DeleteImageSetConfiguration delete: platform: channels: - name: stable-4.15 additionalImages: - name: registry.redhat.io/ubi8/ubi:latest - name: registry.redhat.io/ubi8/ubi-minimal:latest operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.14 packages: - name: 3scale-operator - catalog: oci:///app1/ibm-catalog targetTag: "v14" targetCatalog: "zhouy/catalog" - catalog: oci:///app1/noo/redhat-operator-index packages: - name: cluster-kube-descheduler-operator - name: advanced-cluster-management `oc-mirror delete --config config-delete.yaml --workspace file://clid20 --v2 --generate docker://xxx.com:5000/m2m`
Actual results:
2) Many errors for generate command: 2024/04/22 10:02:29 [ERROR] : reading manifest d8e94620237da97e1b65dac4fb616d21d13e2fea08c9385145a02ad3fbd59d88 in localhost:55000/3scale-amp2/3scale-rhel7-operator-metadata: manifest unknown image : map[] 2024/04/22 10:02:29 [ERROR] : [delete-images] reading manifest d8e94620237da97e1b65dac4fb616d21d13e2fea08c9385145a02ad3fbd59d88 in localhost:55000/3scale-amp2/3scale-rhel7-operator-metadata: manifest unknown 2024/04/22 10:02:29 [ERROR] : reading manifest 4f72f049436af1a940833c61b075d84ad5910c7bb2df2a8de99618995c067dfe in localhost:55000/3scale-amp2/3scale-rhel7-operator: manifest unknown image : map[] 2024/04/22 10:02:29 [ERROR] : [delete-images] reading manifest 4f72f049436af1a940833c61b075d84ad5910c7bb2df2a8de99618995c067dfe in localhost:55000/3scale-amp2/3scale-rhel7-operator: manifest unknown 2024/04/22 10:02:29 [ERROR] : reading manifest 4f72f049436af1a940833c61b075d84ad5910c7bb2df2a8de99618995c067dfe in localhost:55000/3scale-amp2/3scale-rhel7-operator: manifest unknown image : map[] 2024/04/22 10:02:29 [ERROR] : [delete-images] reading manifest 4f72f049436af1a940833c61b075d84ad5910c7bb2df2a8de99618995c067dfe in localhost:55000/3scale-amp2/3scale-rhel7-operator: manifest unknown 2024/04/22 10:02:29 [ERROR] : reading manifest 68b310ed3cfd65db893ba015ef1d5442365201c0ced006c1915e90edb99933ea in localhost:55000/3scale-amp2/apicast-gateway-rhel8: manifest unknown image : map[] 2024/04/22 10:02:29 [ERROR] : [delete-images] reading manifest 68b310ed3cfd65db893ba015ef1d5442365201c0ced006c1915e90edb99933ea in localhost:55000/3scale-amp2/apicast-gateway-rhel8: manifest unknown 2024/04/22 10:02:29 [ERROR] : reading manifest 2b8525c55cfbd5b5d66b50868ebd8fe6f468b10715653d047cdae25fa28e5983 in localhost:55000/3scale-amp2/backend-rhel8: manifest unknown image : map[] 2024/04/22 10:02:29 [ERROR] : [delete-images] reading manifest 2b8525c55cfbd5b5d66b50868ebd8fe6f468b10715653d047cdae25fa28e5983 in localhost:55000/3scale-amp2/backend-rhel8: manifest unknown 2024/04/22 10:02:29 [ERROR] : reading manifest 97e6355fcfadf7fea1f30cc8bbf833c8a518655cef4d63051df29fbcde2c1f00 in localhost:55000/3scale-amp2/memcached-rhel7: manifest unknown image : map[]
Expected results:
2) no error
Description of problem:
Trying to create the second cluster using the same cluster name and base domain as the first cluster would fail, as expected, because of the dns record-sets conflicts. But deleting the second cluster leads to the first cluster inaccessible, which is unexpected.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-14-100410
How reproducible:
Always
Steps to Reproduce:
1. create the first cluster and make sure it succeeds 2. try to create the second cluster, with the same cluster name, base domain, and region, and make sure it failed 3. destroy the second cluster which failed due to "Platform Provisioning Check" 4. check if the first cluster is still healthy
Actual results:
The first cluster turns unhealthy, because the dns record-sets are deleted by step3
Expected results:
The dns record-sets of the first cluster stay untouched during step3, and the the first cluster stays healthy after step3.
Additional info:
(1) the first cluster is by Flexy-install job https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/257549/, and it's healthy initially $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.0-0.nightly-2024-01-14-100410 True False 54m Cluster version is 4.15.0-0.nightly-2024-01-14-100410 $ oc get nodes NAME STATUS ROLES AGE VERSION jiwei-0115y-lgns8-master-0.c.openshift-qe.internal Ready control-plane,master 73m v1.28.5+c84a6b8 jiwei-0115y-lgns8-master-1.c.openshift-qe.internal Ready control-plane,master 73m v1.28.5+c84a6b8 jiwei-0115y-lgns8-master-2.c.openshift-qe.internal Ready control-plane,master 74m v1.28.5+c84a6b8 jiwei-0115y-lgns8-worker-a-gqq96.c.openshift-qe.internal Ready worker 62m v1.28.5+c84a6b8 jiwei-0115y-lgns8-worker-b-2h9xd.c.openshift-qe.internal Ready worker 63m v1.28.5+c84a6b8 $ (2) try to create the second cluster and expect failing due to dns record already exists $ openshift-install version openshift-install 4.15.0-0.nightly-2024-01-14-100410 built from commit b6f320ab7eeb491b2ef333a16643c140239de0e5 release image registry.ci.openshift.org/ocp/release@sha256:385d84c803c776b44ce77b80f132c1b6ed10bd590f868c97e3e63993b811cc2d release architecture amd64 $ mkdir test1 $ cp install-config.yaml test1 $ yq-3.3.0 r test1/install-config.yaml baseDomain qe.gcp.devcluster.openshift.com $ yq-3.3.0 r test1/install-config.yaml metadata creationTimestamp: null name: jiwei-0115y $ yq-3.3.0 r test1/install-config.yaml platform gcp: projectID: openshift-qe region: us-central1 $ openshift-install create cluster --dir test1 INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" INFO Consuming Install Config from target directory FATAL failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": metadata.name: Invalid value: "jiwei-0115y": record(s) ["api.jiwei-0115y.qe.gcp.devcluster.openshift.com."] already exists in DNS Zone (openshift-qe/qe) and might be in use by another cluster, please remove it to continue $ (3) delete the second cluster $ openshift-install destroy cluster --dir test1 INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" INFO Deleted 2 recordset(s) in zone qe INFO Deleted 3 recordset(s) in zone jiwei-0115y-lgns8-private-zone WARNING Skipping deletion of DNS Zone jiwei-0115y-lgns8-private-zone, not created by installer INFO Time elapsed: 37s INFO Uninstallation complete! $ (4) check the first cluster status and the dns record-sets $ oc get clusterversion Unable to connect to the server: dial tcp: lookup api.jiwei-0115y.qe.gcp.devcluster.openshift.com on 10.11.5.160:53: no such host $ $ gcloud dns managed-zones describe jiwei-0115y-lgns8-private-zone cloudLoggingConfig: kind: dns#managedZoneCloudLoggingConfig creationTime: '2024-01-15T07:22:55.199Z' description: Created By OpenShift Installer dnsName: jiwei-0115y.qe.gcp.devcluster.openshift.com. id: '9193862213315831261' kind: dns#managedZone labels: kubernetes-io-cluster-jiwei-0115y-lgns8: owned name: jiwei-0115y-lgns8-private-zone nameServers: - ns-gcp-private.googledomains.com. privateVisibilityConfig: kind: dns#managedZonePrivateVisibilityConfig networks: - kind: dns#managedZonePrivateVisibilityConfigNetwork networkUrl: https://www.googleapis.com/compute/v1/projects/openshift-qe/global/networks/jiwei-0115y-lgns8-network visibility: private $ gcloud dns record-sets list --zone jiwei-0115y-lgns8-private-zone NAME TYPE TTL DATA jiwei-0115y.qe.gcp.devcluster.openshift.com. NS 21600 ns-gcp-private.googledomains.com. jiwei-0115y.qe.gcp.devcluster.openshift.com. SOA 21600 ns-gcp-private.googledomains.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300 $ gcloud dns record-sets list --zone qe --filter='name~jiwei-0115y' Listed 0 items. $
Description of problem:
When creating an HostedCluster with 'NodePort' service publishing strategy, the VMs (guest nodes) are trying to contact HCP services, such as ignition and oauth. If these services are colocated on the same infra node, they can't be reached via NodePort because the 'virt-launcher' NetworkPolicy is blocking it. Need to explicitly add access to oauth and ignition-server-proxy pods so they can be reached from the virtual machines on the same node.
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
Always, if conditions are met
Steps to Reproduce:
1. As described above 2. 3.
Actual results:
VMs are not joining the cluster as nodes if the ignition server is running on the same infra node as the VM.
Expected results:
All VMs are joining the cluster as nodes, and the HostedCluster is eventually Completed and Available
Additional info:
This is a clone of issue OCPBUGS-14963. The following is the description of the original issue:
—
Description of problem:
When using IPI for IBM Cloud to create a Private BYON cluster, the installer attempts to fetch the VPC resource to verify if it is already a PermittedNetwork for the DNS Services Zone. However, currently there is a new VPC Region that is listed in IBM Cloud, eu-es, which is not yet GA'd. This means while it is listed in available VPC Regions, to search for resources, requests to eu-es fail. Any attempts to use VPC Regions alphabetically after eu-es (appears they are returned in this order), fail due to requests made to eu-es. This includes, eu-gb, us-east, and us-south, causing a golang panic.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
100%
Steps to Reproduce:
1. Create IBM Cloud BYON resources in us-east or us-south 2. Attempt to create a Private BYON based cluster in us-east or us-south
Actual results:
DEBUG Fetching Common Manifests... DEBUG Reusing previously-fetched Common Manifests DEBUG Generating Terraform Variables... panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x2bdb706] goroutine 1 [running]: github.com/openshift/installer/pkg/asset/installconfig/ibmcloud.(*Metadata).IsVPCPermittedNetwork(0xc000e89b80, {0x1a8b9918, 0xc00007c088}, {0xc0009d8678, 0x8}) /go/src/github.com/openshift/installer/pkg/asset/installconfig/ibmcloud/metadata.go:175 +0x186 github.com/openshift/installer/pkg/asset/cluster.(*TerraformVariables).Generate(0x1dc55040, 0x5?) /go/src/github.com/openshift/installer/pkg/asset/cluster/tfvars.go:606 +0x3a5a github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc000ca0d80, {0x1a8ab280, 0x1dc55040}, {0x0, 0x0}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:227 +0x5fa github.com/openshift/installer/pkg/asset/store.(*storeImpl).Fetch(0x7ffd948754cc?, {0x1a8ab280, 0x1dc55040}, {0x1dc32840, 0x8, 0x8}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:77 +0x48 main.runTargetCmd.func1({0x7ffd948754cc, 0xb}) /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:261 +0x125 main.runTargetCmd.func2(0x1dc38800?, {0xc000ca0a80?, 0x3?, 0x3?}) /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:291 +0xe7 github.com/spf13/cobra.(*Command).execute(0x1dc38800, {0xc000ca0a20, 0x3, 0x3}) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:876 +0x67b github.com/spf13/cobra.(*Command).ExecuteC(0xc000bc8000) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:990 +0x3bd github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:918 main.installerMain() /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:61 +0x2b0 main.main() /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:38 +0xff
Expected results:
Successful Private cluster creation using BYON on IBM Cloud
Additional info:
IBM Cloud development has identified the issue and is working on a fix to all affected supported releases (4.12, 4.13, 4.14+)
Description of problem:
The following test "[sig-apps][Feature:DeploymentConfig] deploymentconfigs when tagging images should successfully tag the deployed image [apigroup:apps.openshift.io][apigroup:authorization.openshift.io][apigroup:image.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" fails. One pod is stuck in Pending because it requests 3G memory that the node doesn't have (There are 2 pods of 3G)
Version-Release number of selected component (if applicable):
4.14.17
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
After a manual crash of a OCP node the OSPD VM running on the OCP node is stuck in terminating state
Version-Release number of selected component (if applicable):
OCP 4.12.15 osp-director-operator.v1.3.0 kubevirt-hyperconverged-operator.v4.12.5
How reproducible:
Login to a OCP 4.12.15 Node running a VM Manually crash the master node. After reboot the VM stay in terminating state
Steps to Reproduce:
1. ssh core@masterX 2. sudo su 3. echo c > /proc/sysrq-trigger
Actual results:
After reboot the VM stay in terminating state $ omc get node|sed -e 's/modl4osp03ctl/model/g' | sed -e 's/telecom.tcnz.net/aaa.bbb.ccc/g' NAME STATUS ROLES AGE VERSION model01.aaa.bbb.ccc Ready control-plane,master,worker 91d v1.25.8+37a9a08 model02.aaa.bbb.ccc Ready control-plane,master,worker 91d v1.25.8+37a9a08 model03.aaa.bbb.ccc Ready control-plane,master,worker 91d v1.25.8+37a9a08 $ omc get pod -n openstack NAME READY STATUS RESTARTS AGE openstack-provision-server-7b79fcc4bd-x8kkz 2/2 Running 0 8h openstackclient 1/1 Running 0 7h osp-director-operator-controller-manager-5896b5766b-sc7vm 2/2 Running 0 8h osp-director-operator-index-qxxvw 1/1 Running 0 8h virt-launcher-controller-0-9xpj7 1/1 Running 0 20d virt-launcher-controller-1-5hj9x 1/1 Running 0 20d virt-launcher-controller-2-vhd69 0/1 NodeAffinity 0 43d $ omc describe pod virt-launcher-controller-2-vhd69 |grep Status: Status: Terminating (lasts 37h) $ xsos sosreport-xxxx/|grep time ... Boot time: Wed Nov 22 01:44:11 AM UTC 2023 Uptime: 8:27, 0 users
Expected results:
VM restart automatically OR does not stay in Terminating state
Additional info:
The issue has been seen two time. First time, a crash of the kernel occured and we had the associated VM on the node in terminating state Second time we try to reproduce the issue by crashing manually the kernel and we got the same result. The VM running on the OCP node stay in terminating state
MetricsServer feature gate was made GA https://github.com/openshift/api/pull/1851/files a few
nightly payload jobs showed failures after this
Description of problem:
Disable:Broken for [sig-builds][Feature:Builds][Slow] can use private repositories as build input build using an HTTP token should be able to clone source code via an HTTP token [apigroup:build.openshift.io]
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Incorrect help info for loglevel when using --v2 flag
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202403070215.p0.gc4f8295.assembly.stream.el9-c4f8295", GitCommit:"c4f829512107f7d0f52a057cd429de2030b9b3b3", GitTreeState:"clean", BuildDate:"2024-03-07T03:46:24Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
always
Steps to Reproduce:
1) Check `oc-mirror –v2 -h` help info 2) When use command `oc-mirror --config config.yaml --from file://out docker://xxxxxx.com:5000/test --v2 --max-nested-paths 2 --loglevel trace`
Actual results:
oc-mirror --config config.yaml --from file://out docker://ec2-18-188-118-33.us-east-2.compute.amazonaws.com:5000/test --v2 --max-nested-paths 2 --loglevel trace --v2 flag identified, flow redirected to the oc-mirror v2 version. PLEASE DO NOT USE that. V2 is still under development and it is not ready to be used. 2024/03/20 07:18:30 [INFO] : mode diskToMirror 2024/03/20 07:18:30 [TRACE] : creating signatures directory out/working-dir/signatures 2024/03/20 07:18:30 [TRACE] : creating release images directory out/working-dir/release-images 2024/03/20 07:18:30 [TRACE] : creating release cache directory out/working-dir/hold-release 2024/03/20 07:18:30 [TRACE] : creating operator cache directory out/working-dir/hold-operator 2024/03/20 07:18:30 [ERROR] : error parsing local storage configuration : invalid loglevel trace Must be one of [error, warn, info, debug]
Expected results:
Show correct help information
Additional info:
same for `--strict-archive archiveSize`, the information is not clear how to use it.
Description of problem: new feature in ironic to clear the non-OS disks during the bmh installation. Only works for disks with blocksize=512
Customer says the following:
This is unlisted new feature (or enhancement) in OCP4.14. This non-OS disk wiping during bmh installation is not available in 4.12.
Version-Release number of selected component (if applicable):{code:none} The following command generated by ironic. "dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct" Fails with ironic-agent[4054]: Command: dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct ironic-agent[4054]: Exit code: 1 podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root [-] Unexpected error dispatching erase_devices_metadata to manager <ironic_python_agent.hardware.GenericHardwareManager object at 0x7f050797f2e0>: Error erasing block device: Failed to erase the metadata on the device(s): "/dev/sdb": Unexpected error while running command.
How reproducible:
Repeatable for bmh with disk has block size larger than 512.
Steps to Reproduce:
1. This problem will occur on server with disk has block size greater than 512. For example, SAMSUNG, p/n: KR-05RJND-SSK00-389-02DF-A02, that drive has block size of 4096. 2. Add a bmh which has non-OS disk with block size greater than 512. 2a. The introspection of the bmh will be fine. 2b. When the bmh is added (provisioning phase), the OCP installation will try to wipe out the non-OS drive using command "dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct". This can be monitored in the bmh via "journalctl -f -b". In the case of disk with block size of 4096, the above dd command will be rejected. The bmh will be rebooted and attempt the step 2b over again. 3. Manual testing. in a bmh server with disks have block-size greater than 512, (tested with disks have bz=4096 ). the command: "dd bs=512 if=/dev/zero of=/dev/sdb count=33" will failed the alternate command which determine the disk's block-size for the dd command will work. "dd bs=$(blockdev --getss /dev/sdb) if=/dev/zero of=/dev/sdb count=33 oflag=direct"
Actual results:
Expected results:
disk will be formatted with any blocksize
additial logs:
In OCP 4.14.16, there is new feature in ironic to clear the non-OS disks during the bmh installation. The following command generated by ironic.
"dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct"
"dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct"
The reason for failure is the alignment restriction, logical block size are different depending disk type.
May be instead of
"dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct".
It could be replaced by:
"dd bs=$(blockdev --getss /dev/sda) if=/dev/zero of=/dev/sda count=33 oflag=direct"
That would work for various type of disks.
~~~~ THE IRONIC ERROR in OCP 4.14.16 ~~~~
-agent[4054]: 2024-03-27 04:00:48.240 1 ERROR root [-] Unexpected error dispatching erase_devices_metadata to manager <ironic_python_agent.hardware.GenericHardwareManager object at 0x7f050797f2e0>: Error erasing block device: Failed to erase the metadata on the device(s): "/dev/sdb": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com ironic-agent[4054]: Command: dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com ironic-agent[4054]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root [-] Unexpected error dispatching erase_devices_metadata to manager <ironic_python_agent.hardware.GenericHardwareManager object at 0x7f050797f2e0>: Error erasing block device: Failed to erase the metadata on the device(s): "/dev/sdb": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Command: dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stderr: "dd: error writing '/dev/sdb': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.6049e-05 s, 0.0 kB/s\n"; "/dev/sda": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Command: dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stderr: "dd: error writing '/dev/sda': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.9605e-05 s, 0.0 kB/s\n": ironic_python_agent.errors.BlockDeviceEraseError: Error erasing block device: Failed to erase the metadata on the device(s): "/dev/sdb": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Command: dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stderr: "dd: error writing '/dev/sdb': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.6049e-05 s, 0.0 kB/s\n"; "/dev/sda": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Command: dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stderr: "dd: error writing '/dev/sda': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.9605e-05 s, 0.0 kB/s\n"
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root Traceback (most recent call last):
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 3124, in dispatch_to_managers
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root return getattr(manager, method)(*args, **kwargs)
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1702, in erase_devices_metadata
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root raise errors.BlockDeviceEraseError(excpt_msg)
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root ironic_python_agent.errors.BlockDeviceEraseError: Error erasing block device: Failed to erase the metadata on the device(s): "/dev/sdb": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root Command: dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root Stderr: "dd: error writing '/dev/sdb': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.6049e-05 s, 0.0 kB/s\n"; "/dev/sda": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root Command: dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root Stderr: "dd: error writing '/dev/sda': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.9605e-05 s, 0.0 kB/s\n"
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com ironic-agent[4054]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root [-] Error performing clean step erase_devices_metadata: ironic_python_agent.errors.BlockDeviceEraseError: Error erasing block device: Failed to erase the metadata on the device(s): "/dev/sdb": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Command: dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stderr: "dd: error writing '/dev/sdb': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.6049e-05 s, 0.0 kB/s\n"; "/dev/sda": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Command: dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stderr: "dd: error writing '/dev/sda': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.9605e-05 s, 0.0 kB/s\n"
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Traceback (most recent call last):
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root File "/usr/lib/python3.9/site-packages/ironic_python_agent/extensions/clean.py", line 77, in execute_clean_step
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root result = hardware.dispatch_to_managers(step['step'], node, ports,
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 3124, in dispatch_to_managers
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root return getattr(manager, method)(*args, **kwargs)
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1702, in erase_devices_metadata
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root raise errors.BlockDeviceEraseError(excpt_msg)
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root ironic_python_agent.errors.BlockDeviceEraseError: Error erasing block device: Failed to erase the metadata on the device(s): "/dev/sdb": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Command: dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Stderr: "dd: error writing '/dev/sdb': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.6049e-05 s, 0.0 kB/s\n"; "/dev/sda": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Command: dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Stderr: "dd: error writing '/dev/sda': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.9605e-05 s, 0.0 kB/s\n"
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com ironic-agent[4054]: Stderr: "dd: error writing '/dev/sdb': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.6049e-05 s, 0.0 kB/s\n"; "/dev/sda": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com ironic-agent[4054]: Command: dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root [-] Command failed: execute_clean_step, error: Error erasing block device: Failed to erase the metadata on the device(s): "/dev/sdb": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Command: dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stderr: "dd: error writing '/dev/sdb': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.6049e-05 s, 0.0 kB/s\n"; "/dev/sda": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Command: dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stderr: "dd: error writing '/dev/sda': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.9605e-05 s, 0.0 kB/s\n": ironic_python_agent.errors.BlockDeviceEraseError: Error erasing block device: Failed to erase the metadata on the device(s): "/dev/sdb": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Command: dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stderr: "dd: error writing '/dev/sdb': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.6049e-05 s, 0.0 kB/s\n"; "/dev/sda": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Command: dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stderr: "dd: error writing '/dev/sda': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.9605e-05 s, 0.0 kB/s\n"
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Traceback (most recent call last):
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root File "/usr/lib/python3.9/site-packages/ironic_python_agent/extensions/base.py", line 174, in run
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root result = self.execute_method(**self.command_params)
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root File "/usr/lib/python3.9/site-packages/ironic_python_agent/extensions/clean.py", line 77, in execute_clean_step
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root result = hardware.dispatch_to_managers(step['step'], node, ports,
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 3124, in dispatch_to_managers
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root return getattr(manager, method)(*args, **kwargs)
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1702, in erase_devices_metadata
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root raise errors.BlockDeviceEraseError(excpt_msg)
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root ironic_python_agent.errors.BlockDeviceEraseError: Error erasing block device: Failed to erase the metadata on the device(s): "/dev/sdb": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Command: dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Stderr: "dd: error writing '/dev/sdb': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.6049e-05 s, 0.0 kB/s\n"; "/dev/sda": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Command: dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Stderr: "dd: error writing '/dev/sda': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.9605e-05 s, 0.0 kB/s\n"
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com ironic-agent[4054]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com ironic-agent[4054]: Stdout: ''
This is a clone of issue OCPBUGS-34359. The following is the description of the original issue:
—
The issue was observed during testing of the k8s 1.30 rebase in which the webhook client started using http2 for loopback IPs: kubernetes/kubernetes#122558.
It looks like the issue is caused by how a http2 client handles this invalid address, I verified this change by setting up a cluster with openshift/kubernetes#1953 and this pr.
This is a clone of issue OCPBUGS-36390. The following is the description of the original issue:
—
Description of problem:
The Installer still requires permissions to create and delete IAM roles even when the users brings existing roles.
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
always
Steps to Reproduce:
1. Specify existing IAM role in the install-config 2. 3.
Actual results:
The following permissions are required even though they are not used: "iam:CreateRole", "iam:DeleteRole", "iam:DeleteRolePolicy", "iam:PutRolePolicy", "iam:TagInstanceProfile"
Expected results:
Only actually needed permissions are required.
Additional info:
I think this is tech debt from when roles were not tagged. The fix will kind of revert https://github.com/openshift/installer/pull/5286
Description of problem:
When deploying a HCP KubeVirt cluster using the hcp's --node-selector cli arg, that node selector is not applied to the "kubevirt-cloud-controller-manager" pods within the HCP namespace. This makes it not possible to pin the entire HCP pods to specific nodes.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100%
Steps to Reproduce:
1. deploy an hcp kubevirt cluster with the --node-selector cli option 2. 3.
Actual results:
the node selector is not applied to cloud provider kubevirt pod
Expected results:
the node selector should be applied to cloud provider kubevirt pod.
Additional info:
Description of problem:
The azure csi driver operator cannot run in a HyperShift control plane because it has this selector: node-role.kubernetes.io/master: ""
Version-Release number of selected component (if applicable):
4.16 ci latest
How reproducible:
always
Steps to Reproduce:
1. Install hypershift 2. Create azure hosted cluster
Actual results:
azure-disk-csi-driver-operator pod remains in Pending state
Expected results:
all control plane pods run
Additional info:
Description of problem:
It was found when testing OCP-71263 and regression OCP-35770 for 4.15. For GCP in Mint mode, the root credential can be removed after cluster installation. But after removing the root credential, CCO became degrade.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-25-051548 4.15.0-rc.3
How reproducible:
Always
Steps to Reproduce:
1.Install a GCP cluster with Mint mode 2.After install, remove the root credential jianpingshu@jshu-mac ~ % oc delete secret -n kube-system gcp-credentials secret "gcp-credentials" deleted 3.Wait some time(about 1/2h to 1h), CCO became degrade jianpingshu@jshu-mac ~ % oc get co cloud-credential NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE cloud-credential 4.15.0-rc.3 True True True 6h45m 6 of 7 credentials requests are failing to sync. jianpingshu@jshu-mac ~ % oc -n openshift-cloud-credential-operator get -o json credentialsrequests | jq -r '.items[] | select(tostring | contains("InfrastructureMismatch") | not) | .metadata.name as $n | .status.conditions // [{type: "NoConditions"}] | .[] | .type + "=" + .status + " " + $n + " " + .reason + ": " + .message' | sort CredentialsProvisionFailure=False openshift-cloud-network-config-controller-gcp CredentialsProvisionSuccess: successfully granted credentials request CredentialsProvisionFailure=True cloud-credential-operator-gcp-ro-creds CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found CredentialsProvisionFailure=True openshift-gcp-ccm CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found CredentialsProvisionFailure=True openshift-gcp-pd-csi-driver-operator CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found CredentialsProvisionFailure=True openshift-image-registry-gcs CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found CredentialsProvisionFailure=True openshift-ingress-gcp CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found CredentialsProvisionFailure=True openshift-machine-api-gcp CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found openshift-cloud-network-config-controller-gcp has no failure because it doesn't has customized role in 4.15.0.rc3
Actual results:
CCO became degrade
Expected results:
CCO not in degrade, just "upgradeable" condition updated with missing the root credential
Additional info:
Tested the same case on 4.14.10, no issue
Description of problem:
We are in a live migration scenario.
If a project has a networkpolicy to allow from the host network (more concretely, to allow from the ingress controllers and the ingress controllers are in the host network), traffic doesn't work during the live migration between any ingress controller node (either migrated or not migrated) and an already migrated application node.
I'll expand later in the description and internal comments, but the TL;DR is that the IPs of the tun0 of not migrated source nodes and the IPs of the ovn-k8s-mp0 from migrated source nodes are not added to the address sets related to the networkpolicy ACL in the target OVN-Kubernetes node, so that traffic is not allowed.
Version-Release number of selected component (if applicable):
4.16.13
How reproducible:
Always
Steps to Reproduce:
1. Before the migration: have a project with a networkpolicy that allows from the ingress controller and the ingress controller in the host network. Everything must work properly at this point.
2. Start the migration
3. During the migration, check connectivity from the host network of either a migrated node or a non-migrated node. Both will fail (checking from the same node doesn't fail)
Actual results:
Pod on the worker node is not reachable from the host network of the ingress controller node (unless the pod is in the same node than the ingress controller), which causes the ingress controller routes to throw 503 error.
Expected results:
Pod on the worker node to be reachable from the ingress controller node, even when the ingress controller node has not migrated yet and the application node has.
Additional info:
This is not a duplicate of OCPBUGS-42578. This bug refers to the host-to-pod communication path while the other one doesn't.
This is a customer issue. More details to be included in private comments for privacy.
Workaround: Creating a networkpolicy that explicitly allows traffic from tun0 and ovn-k8s-mp0 interfaces. However, note that the workaround can be problematic for clusters with hundreds or thousands of projects. Another possible workaround is to temporarily delete all the networkpolicies of the projects. But again, this may be problematic (and a security risk).
Please review the following PR: https://github.com/openshift/cloud-provider-alibaba-cloud/pull/44
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem: After migrating cluster from SDN to OVN, seeing intermittent failures while accessing service.
Wed Jul 31 05:28:11 UTC 2024 Wed Jul 31 01:28:11 EDT 2024 Hello OpenShift! Wed Jul 31 05:28:42 UTC 2024 Wed Jul 31 01:28:42 EDT 2024 curl: (7) Failed to connect to 34.92.142.227 port 27018 after 75006 ms: Couldn't connect to server Wed Jul 31 05:30:27 UTC 2024 Wed Jul 31 01:30:27 EDT 2024 Hello OpenShift! Wed Jul 31 05:31:59 UTC 2024 Wed Jul 31 01:31:59 EDT 2024 Hello OpenShift! Wed Jul 31 05:33:31 UTC 2024 Wed Jul 31 01:33:31 EDT 2024 Hello OpenShift! Wed Jul 31 05:34:01 UTC 2024 Wed Jul 31 01:34:01 EDT 2024 curl: (52) Empty reply from server Wed Jul 31 05:38:51 UTC 2024 Wed Jul 31 01:38:51 EDT 2024 Hello OpenShift!
Version-Release number of selected component (if applicable):
$ oc version Client Version: 4.15.14 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: 4.15.0-0.nightly-2024-07-29-053620 Kubernetes Version: v1.28.11+add48d0*no* further _formatting_ is done here
How reproducible:
Steps to Reproduce:
1. Create a 4.14 SDN OSD on GCP cluster
2. Upgrade to 4.15
3. Scale cluster to 24 nodes
4. Add cluster-density-v2 workload
5. Run migration and let if finish
6. Start seeing errors
Actual results: Intermittent failures accessing service.
Expected results: Live migration should not cause disruption to service.
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Description of problem:
OTA team wants to rename the `supported but not recommended` update edges to `known issues`
Version-Release number of selected component (if applicable):
openshift-4.16
Expected results:
`supported but not recommended` edges are renamed to `known issues`
Additional info:
https://github.com/openshift/console/blob/962ab7b443f87e45486741444a32053a807d3219/frontend/public/components/modals/cluster-update-modal.tsx#L191 https://github.com/openshift/console/blob/962ab7b443f87e45486741444a32053a807d3219/frontend/public/components/modals/cluster-update-modal.tsx#L216 https://github.com/openshift/console/blob/962ab7b443f87e45486741444a32053a807d3219/frontend/public/components/modals/cluster-update-modal.tsx#L219 https://github.com/openshift/console/blob/962ab7b443f87e45486741444a32053a807d3219/frontend/public/components/modals/cluster-update-modal.tsx#L234 https://github.com/openshift/console/blob/962ab7b443f87e45486741444a32053a807d3219[…]rontend/public/components/cluster-settings/cluster-settings.tsx
Please review the following PR: https://github.com/openshift/cloud-credential-operator/pull/639
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/ovirt-csi-driver-operator/pull/129
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/images/pull/155
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:[link Worker CloudFormation Template|[Installing a cluster on AWS using CloudFormation templates - Installing on AWS | Installing | OpenShift Container Platform 4.13|https://docs.openshift.com/container-platform/4.13/installing/installing_aws/installing-aws-user-infra.html#installation-cloudformation-worker_installing-aws-user-infra]]
In OpenShift Documentation under Manual AWS Cloudformation Templates. Within the cloudformation template for Worker Nodes. The description for Subnet and WorkerSecurityGroupId refer to the Master Nodes. Based on the variable names the descriptions should refer to Worker Nodes instead.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
Description of problem:
CAPI E2Es failing to start in some CAPI provider's release branches.
Failing with the following error: `go: errors parsing go.mod:94/tmp/tmp.ssf1LXKrim/go.mod:5: unknown directive: toolchain` https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-api/199/pull-ci-openshift-cluster-api-master-e2e-aws-capi-techpreview/1765512397532958720#1:build-log.txt%3A91-95
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is because the script launching the e2e is launching it from the `main` branch of the cluster-capi-operator (which has some backward incompabible go toolchain changes), rather than the correctly matching release branch.
Description of problem
I had a version of MTC installed on my cluster when it was running a prior version. I had deleted it some time ago, long before upgrading to 4.15. I upgraded it to 4.15 and needed to reinstall to take a look at something, but found the operator would not install.
I originally tried with 4.15.0, but on failure upgraded to 4.15.3 to see if it would resolve the issue, but it did no.
Version-Release number of selected component (if applicable):
$ oc version Client Version: 4.15.3 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: 4.15.3 Kubernetes Version: v1.28.7+6e2789b
How reproducible:
Always as far as I can tell. I have at least two clusters where I was able to reproduce it.
Steps to Reproduce:
1. Install Migration Toolkit for Containers on OpenShift 4.14 2. Uninstall it 3. Upgrade to 4.15 4. Try to install it again
Actual results:
The operator never installs. UI just shows "Upgrade status: Unkown Failure" Observe the catalog operator logs and note errors like: E0319 21:35:57.350591 1 queueinformer_operator.go:319] sync {"update" "openshift-migration"} failed: bundle unpacking failed with an error: [roles.rbac.authorization.k8s.io "c1572438804f004fb90b6768c203caad96c47331f7ecc4f68c3cf6b43b0acfd" already exists, roles.rbac.authorization.k8s.io "724788f6766aa5ba19b24ef4619b6a8e8e856b8b5fb96e1380f0d3f5b9dcb7a" already exists] If you delete the roles, you'll get the same for rolebindings, then the same for jobs.batch, and then configmaps.
Expected results:
Operator just installs
Additional info:
If you clean up all these resources the operator will install successfully.
Description of problem:
[vSphere] network.devices, template and workspace will be cleared when deleting the controlplanemachineset, updating these fields will not trigger an update
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-04-23-032717
How reproducible:
Always
Steps to Reproduce:
1.Install a vSphere 4.16 cluster, we use automated template: ipi-on-vsphere/versioned-installer liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-04-23-032717 True False 24m Cluster version is 4.16.0-0.nightly-2024-04-23-032717 2.Check the controlplanemachineset, you can see network.devices, template and workspace have value. liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE cluster 3 3 3 3 Active 51m liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset cluster -oyaml apiVersion: machine.openshift.io/v1 kind: ControlPlaneMachineSet metadata: creationTimestamp: "2024-04-25T02:52:11Z" finalizers: - controlplanemachineset.machine.openshift.io generation: 1 labels: machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl name: cluster namespace: openshift-machine-api resourceVersion: "18273" uid: f340d9b4-cf57-4122-b4d4-0f45f20e4d79 spec: replicas: 3 selector: matchLabels: machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master state: Active strategy: type: RollingUpdate template: machineType: machines_v1beta1_machine_openshift_io machines_v1beta1_machine_openshift_io: failureDomains: platform: VSphere vsphere: - name: generated-failure-domain metadata: labels: machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master spec: lifecycleHooks: {} metadata: {} providerSpec: value: apiVersion: machine.openshift.io/v1beta1 credentialsSecret: name: vsphere-cloud-credentials diskGiB: 120 kind: VSphereMachineProviderSpec memoryMiB: 16384 metadata: creationTimestamp: null network: devices: - networkName: devqe-segment-221 numCPUs: 4 numCoresPerSocket: 4 snapshot: "" template: huliu-vs425c-f5tfl-rhcos-generated-region-generated-zone userDataSecret: name: master-user-data workspace: datacenter: DEVQEdatacenter datastore: /DEVQEdatacenter/datastore/vsanDatastore folder: /DEVQEdatacenter/vm/huliu-vs425c-f5tfl resourcePool: /DEVQEdatacenter/host/DEVQEcluster/Resources server: vcenter.devqe.ibmc.devcluster.openshift.com status: conditions: - lastTransitionTime: "2024-04-25T02:59:37Z" message: "" observedGeneration: 1 reason: AsExpected status: "False" type: Error - lastTransitionTime: "2024-04-25T03:03:45Z" message: "" observedGeneration: 1 reason: AllReplicasAvailable status: "True" type: Available - lastTransitionTime: "2024-04-25T03:03:45Z" message: "" observedGeneration: 1 reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2024-04-25T03:01:04Z" message: "" observedGeneration: 1 reason: AllReplicasUpdated status: "False" type: Progressing observedGeneration: 1 readyReplicas: 3 replicas: 3 updatedReplicas: 3 3.Delete the controlplanemachineset, it will recreate a new one, but those three fields that had values before are now cleared. liuhuali@Lius-MacBook-Pro huali-test % oc delete controlplanemachineset cluster controlplanemachineset.machine.openshift.io "cluster" deleted liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE cluster 3 3 3 3 Inactive 6s liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset cluster -oyaml apiVersion: machine.openshift.io/v1 kind: ControlPlaneMachineSet metadata: creationTimestamp: "2024-04-25T03:45:51Z" finalizers: - controlplanemachineset.machine.openshift.io generation: 1 name: cluster namespace: openshift-machine-api resourceVersion: "46172" uid: 45d966c9-ec95-42e1-b8b0-c4945ea58566 spec: replicas: 3 selector: matchLabels: machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master state: Inactive strategy: type: RollingUpdate template: machineType: machines_v1beta1_machine_openshift_io machines_v1beta1_machine_openshift_io: failureDomains: platform: VSphere vsphere: - name: generated-failure-domain metadata: labels: machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master spec: lifecycleHooks: {} metadata: {} providerSpec: value: apiVersion: machine.openshift.io/v1beta1 credentialsSecret: name: vsphere-cloud-credentials diskGiB: 120 kind: VSphereMachineProviderSpec memoryMiB: 16384 metadata: creationTimestamp: null network: devices: null numCPUs: 4 numCoresPerSocket: 4 snapshot: "" template: "" userDataSecret: name: master-user-data workspace: {} status: conditions: - lastTransitionTime: "2024-04-25T03:45:51Z" message: "" observedGeneration: 1 reason: AsExpected status: "False" type: Error - lastTransitionTime: "2024-04-25T03:45:51Z" message: "" observedGeneration: 1 reason: AllReplicasAvailable status: "True" type: Available - lastTransitionTime: "2024-04-25T03:45:51Z" message: "" observedGeneration: 1 reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2024-04-25T03:45:51Z" message: "" observedGeneration: 1 reason: AllReplicasUpdated status: "False" type: Progressing observedGeneration: 1 readyReplicas: 3 replicas: 3 updatedReplicas: 3 4.I active the controlplanemachineset and it does not trigger an update, I continue to add these field values back and it does not trigger an update, I continue to edit these fields to add a second network device and it still does not trigger an update. network: devices: - networkName: devqe-segment-221 - networkName: devqe-segment-222 By the way, I can create worker machines with other network device or two network devices. huliu-vs425c-f5tfl-worker-0a-ldbkh Running 81m huliu-vs425c-f5tfl-worker-0aa-r8q4d Running 70m
Actual results:
network.devices, template and workspace will be cleared when deleting the controlplanemachineset, updating these fields will not trigger an update
Expected results:
The fields value should not be changed when deleting the controlplanemachineset, Updating these fields should trigger an update, or if these fields should not be modified, then it should not take effect when modifying the controlplanemachineset, as such an inconsistency seems confusing.
Additional info:
Must gather: https://drive.google.com/file/d/1mHR31m8gaNohVMSFqYovkkY__t8-E30s/view?usp=sharing
During jobs that upgrade to 4.16 from 4.15, the testing of unauthenticated build webhook invocation fails (I suspect due to the existing rolebindings from 4.15 surviving the upgrade).
[sig-builds][Feature:Builds][webhook] TestWebhook [apigroup:build.openshift.io][apigroup:image.openshift.io] [Suite:openshift/conformance/parallel] . . . STEP: testing unauthenticated forbidden webhooks @ 05/07/24 20:03:20.024 STEP: executing the webhook to get the build object @ 05/07/24 20:03:20.024 [FAILED] in [It] - github.com/openshift/origin/test/extended/builds/webhook.go:36 @ 05/07/24 20:03:20.148
This is a clone of issue OCPBUGS-18711. The following is the description of the original issue:
—
Description of problem:
secrets-store-csi-driver with AWS provider does not work in HyperShift hosted cluster, pod can't mount the volume successfully.
Version-Release number of selected component (if applicable):
secrets-store-csi-driver-operator.v4.14.0-202308281544 in 4.14.0-0.nightly-2023-09-06-235710 HyperShift hosted cluster.
How reproducible:
Always
Steps to Reproduce:
1. Follow test case OCP-66032 "Setup" part to install secrets-store-csi-driver-operator.v4.14.0-202308281544 , secrets-store-csi-driver and AWS provider successfully: $ oc get po -n openshift-cluster-csi-drivers NAME READY STATUS RESTARTS AGE aws-ebs-csi-driver-node-7xxgr 3/3 Running 0 5h18m aws-ebs-csi-driver-node-fmzwf 3/3 Running 0 5h18m aws-ebs-csi-driver-node-rgrxd 3/3 Running 0 5h18m aws-ebs-csi-driver-node-tpcxq 3/3 Running 0 5h18m csi-secrets-store-provider-aws-2fm6q 1/1 Running 0 5m14s csi-secrets-store-provider-aws-9xtw7 1/1 Running 0 5m15s csi-secrets-store-provider-aws-q5lvb 1/1 Running 0 5m15s csi-secrets-store-provider-aws-q6m65 1/1 Running 0 5m15s secrets-store-csi-driver-node-4wdc8 3/3 Running 0 6m22s secrets-store-csi-driver-node-n7gkj 3/3 Running 0 6m23s secrets-store-csi-driver-node-xqr52 3/3 Running 0 6m22s secrets-store-csi-driver-node-xr24v 3/3 Running 0 6m22s secrets-store-csi-driver-operator-9cb55b76f-7cbvz 1/1 Running 0 7m16s 2. Follow test case OCP-66032 steps to create AWS secret, set up AWS IRSA successfully. 3. Follow test case OCP-66032 steps SecretProviderClass, deployment with the secretProviderClass successfully. Then check pod, pod is stuck in ContainerCreating: $ oc get po NAME READY STATUS RESTARTS AGE hello-openshift-84c76c5b89-p5k4f 0/1 ContainerCreating 0 10m $ oc describe po hello-openshift-84c76c5b89-p5k4f ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 11m default-scheduler Successfully assigned xxia-proj/hello-openshift-84c76c5b89-p5k4f to ip-10-0-136-205.us-east-2.compute.internal Warning FailedMount 11m kubelet MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials caused by: InvalidIdentityToken: Incorrect token audience status code: 400, request id: 92d1ff5b-36be-4cc5-9b55-b12279edd78e Warning FailedMount 11m kubelet MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials caused by: InvalidIdentityToken: Incorrect token audience status code: 400, request id: 50907328-70a6-44e0-9f05-80a31acef0b4 Warning FailedMount 11m kubelet MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials caused by: InvalidIdentityToken: Incorrect token audience status code: 400, request id: 617dc3bc-a5e3-47b0-b37c-825f8dd84920 Warning FailedMount 11m kubelet MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials caused by: InvalidIdentityToken: Incorrect token audience status code: 400, request id: 8ab5fc2c-00ca-45e2-9a82-7b1765a5df1a Warning FailedMount 11m kubelet MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials caused by: InvalidIdentityToken: Incorrect token audience status code: 400, request id: b76019ca-dc04-4e3e-a305-6db902b0a863 Warning FailedMount 11m kubelet MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials caused by: InvalidIdentityToken: Incorrect token audience status code: 400, request id: b395e3b2-52a2-4fc2-80c6-9a9722e26375 Warning FailedMount 11m kubelet MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials caused by: InvalidIdentityToken: Incorrect token audience status code: 400, request id: ec325057-9c0a-4327-80c9-a9b6233a64dd Warning FailedMount 10m kubelet MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials caused by: InvalidIdentityToken: Incorrect token audience status code: 400, request id: 405492b2-ed52-429b-b253-6a7c098c26cb Warning FailedMount 82s (x5 over 9m35s) kubelet Unable to attach or mount volumes: unmounted volumes=[secrets-store-inline], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition Warning FailedMount 74s (x5 over 9m25s) kubelet (combined from similar events): MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials caused by: InvalidIdentityToken: Incorrect token audience status code: 400, request id: c38bbed1-012d-4250-b674-24ab40607920
Actual results:
Hit above stuck issue.
Expected results:
Pod should be Running.
Additional info:
Compared another operator (cert-manager-operator) which also uses AWS IRSA: OCP-62500 , that case works well. So secrets-store-csi-driver-operator has bug.
Description of problem:
Internal registry Pods will panic while deploying OCP on `ca-west-1` AWS Region
Version-Release number of selected component (if applicable):
4.14.2
How reproducible:
Every time
Steps to Reproduce:
1. Deploy OCP on `ca-west-1` AWS Region
Actual results:
$ oc logs image-registry-85b69cd9fc-b78sb -n openshift-image-registry time="2024-02-08T11:43:09.287006584Z" level=info msg="start registry" distribution_version=v3.0.0+unknown go.version="go1.20.10 X:strictfipsruntime" openshift_version=4.14.0-202311021650.p0.g5e7788a.assembly.stream-5e7788a time="2024-02-08T11:43:09.287365337Z" level=info msg="caching project quota objects with TTL 1m0s" go.version="go1.20.10 X:strictfipsruntime" panic: invalid region provided: ca-west-1goroutine 1 [running]: github.com/distribution/distribution/v3/registry/handlers.NewApp({0x2873f40?, 0xc00005c088?}, 0xc000581800) /go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/handlers/app.go:130 +0x2bf1 github.com/openshift/image-registry/pkg/dockerregistry/server/supermiddleware.NewApp({0x2873f40, 0xc00005c088}, 0x0?, {0x2876820?, 0xc000676cf0}) /go/src/github.com/openshift/image-registry/pkg/dockerregistry/server/supermiddleware/app.go:96 +0xb9 github.com/openshift/image-registry/pkg/dockerregistry/server.NewApp({0x2873f40?, 0xc00005c088}, {0x285ffd0?, 0xc000916070}, 0xc000581800, 0xc00095c000, {0x0?, 0x0}) /go/src/github.com/openshift/image-registry/pkg/dockerregistry/server/app.go:138 +0x485 github.com/openshift/image-registry/pkg/cmd/dockerregistry.NewServer({0x2873f40, 0xc00005c088}, 0xc000581800, 0xc00095c000) /go/src/github.com/openshift/image-registry/pkg/cmd/dockerregistry/dockerregistry.go:212 +0x38a github.com/openshift/image-registry/pkg/cmd/dockerregistry.Execute({0x2858b60, 0xc000916000}) /go/src/github.com/openshift/image-registry/pkg/cmd/dockerregistry/dockerregistry.go:166 +0x86b main.main() /go/src/github.com/openshift/image-registry/cmd/dockerregistry/main.go:93 +0x496
Expected results:
The internal registry is deployed with no issues
Additional info:
This is a new AWS Region we are adding support to. The support will be backported to 4.14.z
Description of problem:
OVN br-int ovs flows do not get updated on other nodes when a nodes bond MACADDR is changed to other slave interface after reboot. This causes network traffic coming for the sdn of one node to get dropped when it hits the node that changed mac addresses on its bond interface.
Version-Release number of selected component (if applicable): 4.12+
How reproducible: 100% of the time after rebooting if the mac changes. mac does not always change.
Steps to Reproduce:
1. Capture bond0 mac before reboot 2. Reboot host 3. Confirm mac change 4. oc run --rm -it test-pod-sdn --image=registry.redhat.io/openshift4/network-tools-rhel8 --overrides='{"spec": {"tolerations": [{"operator": "Exists"}],"nodeSelector":{"kubernetes.io/hostname":"nodeb-not-rebooted"}}}' /bin/bash 5. Ping rebooted node
Actual results:
ping hits rebooted node but is dropped because the MAC address is of other slave interface and not the one bond is using.
Expected results:
OVS flows to update on all nodesafter reboot if MAC changes
Additional info:
If we restart NetworkManager a couple times this triggers OVS flows to get updated, not sure why. Possible workarounds - https://access.redhat.com/solutions/6972925 - Statically set the mac of bond0 to one of the slave interfaces.
Yesterday a major DPCR and thus Loki outage took the system down entirely. One test would fail as a result:
[sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
[ { "metric": { "__name__": "ALERTS", "alertname": "KubeDaemonSetRolloutStuck", "alertstate": "firing", "container": "kube-rbac-proxy-main", "daemonset": "loki-promtail", "endpoint": "https-main", "job": "kube-state-metrics", "namespace": "openshift-e2e-loki", "prometheus": "openshift-monitoring/k8s", "service": "kube-state-metrics", "severity": "warning" }, "value": [ 1709071917.851, "1" ] } ]
The query this test uses should be adapted to omit everything in openshift-e2e-loki.
Ideally, backports would be good here, but we could just fix it going forward also if this is too cumbersome.
Bump the sigs.k8s dependencies and update dependabot groupings
This is a clone of bug OCPBUGS-35211, so that the fix can be backported to 4.16.
------
Description of problem:
The ACM perf/scale hub OCP has 3 baremetal nodes, each has 480GB for the installation disk. metal3 pod uses too much disk space for logs and make the node has disk presure and start evicting pods. which make the ACM stop provisioning clusters. below is the log size of the metal3 pods: # du -h -d 1 /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83 4.0K /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/machine-os-images 276M /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-httpd 181M /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-ironic 384G /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-ramdisk-logs 77M /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-ironic-inspector 385G /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83 # ls -l -h /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-ramdisk-logs total 384G -rw-------. 1 root root 203G Jun 10 12:44 0.log -rw-r--r--. 1 root root 6.5G Jun 10 09:05 0.log.20240610-084807.gz -rw-r--r--. 1 root root 8.1G Jun 10 09:27 0.log.20240610-090606.gz -rw-------. 1 root root 167G Jun 10 09:27 0.log.20240610-092755
the logs are too huge to be attached. Please contact me if you need access to the cluster to check.
Version-Release number of selected component (if applicable):
the one has the issue is 4.16.0-rc4. 4.16.0.rc3 does not have the issue
How reproducible:
Steps to Reproduce:
1.Install latest ACM 2.11.0 build on OCP 4.16.0-rc4 and deploy 3500 SNOs on baremetal hosts 2. 3.
Actual results:
ACM stop deploying the rest of SNOs after 1913 SNOs are deployed b/c ACM pods are being evicated.
Expected results:
3500 SNOs are deployed.
Additional info:
This is a clone of issue OCPBUGS-35309. The following is the description of the original issue:
—
Description of problem:
Installation of 4.16 fails with a AWS AccessDenied error trying to attach a bootstrap s3 bucket policy.
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
Every time
Steps to Reproduce:
1. Create an installer policy with the permissions listed in the installer [here|https://github.com/openshift/installer/blob/master/pkg/asset/installconfig/aws/permissions.go] 2. Run a install in AWS IPI
Actual results:
Install fails attempting to attach a policy to the bootstrap s3 bucket {code:java} time="2024-06-11T14:58:15Z" level=debug msg="I0611 14:58:15.485718 132 s3.go:256] \"Created bucket\" controller=\"awscluster\" controllerGroup=\"infrastru cture.cluster.x-k8s.io\" controllerKind=\"AWSCluster\" AWSCluster=\"openshift-cluster-api-guests/jamesh-sts-8tl72\" namespace=\"openshift-cluster-api-guests\" name=\"jamesh-sts-8tl72\" reconcileID=\"c390f027-a2ee-4d37-9e5d-b6a11882c46b\" cluster=\"openshift-cluster-api-guests/jamesh-sts-8tl72\" bucket_name=\"opensh ift-bootstrap-data-jamesh-sts-8tl72\"" time="2024-06-11T14:58:15Z" level=debug msg="E0611 14:58:15.643613 132 controller.go:329] \"Reconciler error\" err=<" time="2024-06-11T14:58:15Z" level=debug msg="\tfailed to reconcile S3 Bucket for AWSCluster openshift-cluster-api-guests/jamesh-sts-8tl72: ensuring bucket pol icy: creating S3 bucket policy: AccessDenied: Access Denied"
Expected results:{code:none} Install completes successfully
Additional info:
The installer did not attach an S3 bootstrap bucket policy in the past as far as I can tell [here|https://github.com/openshift/installer/blob/release-4.15/data/data/aws/cluster/main.tf#L133-L148], this new permission is required because of new functionality. CAPA is placing a policy that denies non SSL encrypted traffic to the bucket, this shouldn't have an effect on installs, adding the IAM policy to allow the policy to be added results in a successful install. S3 bootstrap bucket policy: {code:java} "Statement": [ { "Sid": "ForceSSLOnlyAccess", "Principal": { "AWS": [ "*" ] }, "Effect": "Deny", "Action": [ "s3:*" ], "Resource": [ "arn:aws:s3:::openshift-bootstrap-data-jamesh-sts-2r5f7/*" ], "Condition": { "Bool": { "aws:SecureTransport": false } } } ] },
allow eviction of unhealthy (not ready) pods even if there are no disruptions allowed on a PodDisruptionBudget. This can help to drain/maintain a node and recover without a manual intervention when multiple instances of nodes or pods are misbehaving.
to prevent possible issues similar to https://issues.redhat.com//browse/OCPBUGS-23796
the name in setup.cfg is incorrectly set as ironic-image
it should be ironic-agent-image
because of the pin in the packages list the ART pipeline is rebuilding packages all the time
unfortunately we need to remove the strong pins and move back to relaxed ones
once that's done we need to merge https://github.com/openshift-eng/ocp-build-data/pull/4097
Description of problem:
When expanding a PVC of unit-less size (e.g., '2147483648'), the Expand PersistentVolumeClaim modal populates the spinner with a unit-less value (e.g., 2147483648) instead of a meaningful value.
Version-Release number of selected component (if applicable):
CNV - 4.14.3
How reproducible:
always
Steps to Reproduce:
1.Create a PVC using the following YAML.
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: task-pv-claim spec: storageClassName: gp3-csi accessModes: - ReadWriteOnce resources: requests: storage: "2147483648"
apiVersion: v1 kind: Pod metadata: name: task-pv-pod spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault volumes: - name: task-pv-storage persistentVolumeClaim: claimName: task-pv-claim containers: - name: task-pv-container image: nginx ports: - containerPort: 80 name: "http-server" volumeMounts: - mountPath: "/usr/share/nginx/html" name: task-pv-storage
2. From the newly created PVC details page, Click Actions > Expand PVC. 3. Note the value in the spinner input.
See https://drive.google.com/file/d/1toastX8rCBtUzx5M-83c9Xxe5iPA8fNQ/view for a demo
Please review the following PR: https://github.com/openshift/builder/pull/375
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
Looks like ODF minimum size disk validation set to 75 G per node while it was ~25.
The validation should when only ODF enabled.
ODFMinDiskSizeGB int64 `envconfig:"ODF_MIN_DISK_SIZE_GB" default:"25"`
insufficient ODF requirements: Insufficient resources to deploy ODF in compact mode. ODF requires a minimum of 3 hosts. Each host must have at least 1 additional disk of 75 GB minimum and an installation disk.
How reproducible:
always
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
Creating an OS image without a cpu architecture field is currently allowed.
- openshiftVersion: "4.12" version: "rhcos-412.86.202308081039-0" url: "http://registry.ocp-edge-cluster-assisted-0.qe2.e2e.bos.redhat.com:8080/images/openshift-v4/amd64/dependencies/rhcos/4.12/4.12.30/rhcos-live.x86_64.iso"
This results in invalid InfraEnvs being allowed and assisted-image-service returning an empty ISO file
Assisted-image-service log (4.12- is missing the architecture):
{"file":"/remote-source/app/pkg/imagestore/imagestore.go:299","func":"github.com/openshift/assisted-image-service/pkg/imagestore.(*rhcosStore).Populate","level":"info","msg":"Finished creating minimal iso for 4.12- (rhcos-412.86.202308081039-0)","time":"2024-04-11T17:04:16Z"}
InfraEnv conditions:
[ { "lastTransitionTime": "2024-04-11T17:04:47Z", "message": "Image has been created", "reason": "ImageCreated", "status": "True", "type": "ImageCreated" } ]
InfraEnv ISODownloadURL:
isoDownloadURL: https://assisted-image-service-multicluster-engine.apps.ocp-edge-cluster-assisted-0.qe2.e2e.bos.redhat.com/byapikey/eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJpbmZyYV9lbnZfaWQiOiI1ZjZhZmZjYy0zMzMwLTQ0NTYtODkxOC1lOThmYTE5ZTU2NGQifQ.T3h-_q6yMr1JvNkWXMspNk_9MFsHOX-CGBlBIlfpgjje9k-Y6RsI_6cWdZgJTPT0nMXRJiEUuvBJZJGPNdK-MQ/4.12/x86_64/minimal.iso
Actually curling for this URL:
$ curl -kI "https://assisted-image-service-multicluster-engine.apps.ocp-edge-cluster-assisted-0.qe2.e2e.bos.redhat.com/byapikey/eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJpbmZyYV9lbnZfaWQiOiI1ZjZhZmZjYy0zMzMwLTQ0NTYtODkxOC1lOThmYTE5ZTU2NGQifQ.T3h-_q6yMr1JvNkWXMspNk_9MFsHOX-CGBlBIlfpgjje9k-Y6RsI_6cWdZgJTPT0nMXRJiEUuvBJZJGPNdK-MQ/4.12/x86_64/minimal.iso"
HTTP/1.1 404 Not Found
content-type: text/plain; charset=utf-8
x-content-type-options: nosniff
date: Thu, 11 Apr 2024 17:09:35 GMT
content-length: 19
set-cookie: 1a4b5ac1ad25c005c048fb541ba389b4=02300906d3489ab71b6417aaeed52390; path=/; HttpOnly; Secure; SameSite=None
This is a clone of issue OCPBUGS-38398. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-38174. The following is the description of the original issue:
—
Description of problem:
The prometheus operator fails to reconcile when proxy settings like no_proxy are set in the Alertmanager configuration secret.
Version-Release number of selected component (if applicable):
4.15.z and later
How reproducible:
Always when AlertmanagerConfig is enabled
Steps to Reproduce:
1. Enable UWM with AlertmanagerConfig enableUserWorkload: true alertmanagerMain: enableUserAlertmanagerConfig: true 2. Edit the "alertmanager.yaml" key in the alertmanager-main secret (see attached configuration file) 3. Wait for a couple of minutes.
Actual results:
Monitoring ClusterOperator goes Degraded=True.
Expected results:
No error
Additional info:
The Prometheus operator logs show that it doesn't understand the proxy_from_environment field.
This is a clone of issue OCPBUGS-38228. The following is the description of the original issue:
—
Description of problem:
On overview page's getting started resources card, there is "OpenShift LightSpeed" link when this operator is available on the cluster, the text should be updated to "OpenShift Lightspeed" to keep consistent with operator name.
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-08-08-013133 4.16.0-0.nightly-2024-08-08-111530
How reproducible:
Always
Steps to Reproduce:
1. Check overview page's getting started resources card, 2. 3.
Actual results:
1. There is "OpenShift LightSpeed" link in "Explore new features and capabilities"
Expected results:
1. The text should be "OpenShift Lightspped" to keep consistent with operator name.
Additional info:
This is a clone of issue OCPBUGS-36140. The following is the description of the original issue:
—
Description of problem:
GCP private cluster with CCO Passthrough mode failed to install due to CCO degraded. status: conditions: - lastTransitionTime: "2024-06-24T06:04:39Z" message: 1 of 7 credentials requests are failing to sync. reason: CredentialsFailing status: "True" type: Degraded
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2024-06-21-203120
How reproducible:
Always
Steps to Reproduce:
1.Create GCP private cluster with CCO Passthrough mode, flexy template is private-templates/functionality-testing/aos-4_13/ipi-on-gcp/versioned-installer-xpn-private 2.Wait for cluster installation
Actual results:
jianpingshu@jshu-mac ~ % oc get clusterversionNAME VERSION AVAILABLE PROGRESSING SINCE STATUSversion False False 23m Error while reconciling 4.13.0-0.nightly-2024-06-21-203120: the cluster operator cloud-credential is degraded status: conditions: - lastTransitionTime: "2024-06-24T06:04:39Z" message: 1 of 7 credentials requests are failing to sync. reason: CredentialsFailing status: "True" type: Degraded jianpingshu@jshu-mac ~ % oc -n openshift-cloud-credential-operator get -o json credentialsrequests | jq -r '.items[] | select(tostring | contains("InfrastructureMismatch") | not) | .metadata.name as $n | .status.conditions // [{type: "NoConditions"}] | .[] | .type + "=" + .status + " " + $n + " " + .reason + ": " + .message' | sortCredentialsProvisionFailure=True cloud-credential-operator-gcp-ro-creds CredentialsProvisionFailure: failed to grant creds: error while validating permissions: error testing permissions: googleapi: Error 400: Permission commerceoffercatalog.agreements.list is not valid for this resource., badRequest NoConditions= openshift-cloud-network-config-controller-gcp : NoConditions= openshift-gcp-ccm : NoConditions= openshift-gcp-pd-csi-driver-operator : NoConditions= openshift-image-registry-gcs : NoConditions= openshift-ingress-gcp : NoConditions= openshift-machine-api-gcp :
Expected results:
Cluster installed successfully without degrade
Additional info:
Some problem PROW CI tests: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.14-multi-nightly-gcp-ipi-user-labels-tags-filestore-csi-tp-arm-f14/1805064266043101184 https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-4.14-upgrade-from-stable-4.13-gcp-ipi-xpn-fips-f28/1804676149503070208
When platform specific passwords are included in the install-config.yaml they are stored in the generated agent-cluster-install.yaml, which is included in the output of the agent-gather command. These passwords should be redacted.
Description of problem:
When cloning a PVC of 60GiB size, the system autofills the remote size to be 8192 PeB. This size cannot be changed in the UI before starting the clone.
Version-Release number of selected component (if applicable):
CNV - 4.14.3
How reproducible:
always
Steps to Reproduce:
1.Create a VM with a PVC of 60Gib 2.Power off the VM 3.As a cluster admin, clone the 60GiB PVC (Storage -> PersistentVolumeClaims -> Kebab menu next to pvc
Actual results:
The system tries to clone the 60 GiB PVC as a 8192 PeB
Expected results:
A new pvc of the 60 GiB
Additional info:
This seems like the closed BZ 2177979.I will upload a screenshot of the UI. Here is the yaml for the original pvc. apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: cdi.kubevirt.io/storage.bind.immediate.requested: "true" cdi.kubevirt.io/storage.contentType: kubevirt cdi.kubevirt.io/storage.pod.phase: Succeeded cdi.kubevirt.io/storage.populator.progress: 100.0% cdi.kubevirt.io/storage.preallocation.requested: "false" cdi.kubevirt.io/storage.usePopulator: "true" pv.kubernetes.io/bind-completed: "yes" pv.kubernetes.io/bound-by-controller: "yes" volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com creationTimestamp: "2023-12-05T17:34:19Z" finalizers:kubernetes.io/pvc-protectionprovisioner.storage.kubernetes.io/cloning-protection labels: app: containerized-data-importer app.kubernetes.io/component: storage app.kubernetes.io/managed-by: cdi-controller app.kubernetes.io/part-of: hyperconverged-cluster app.kubernetes.io/version: 4.14.0 kubevirt.io/created-by: 60f46f91-2db3-4118-aaba-b1697b29c496 name: win2k19-base namespace: base-images ownerReferences:apiVersion: cdi.kubevirt.io/v1beta1 blockOwnerDeletion: true controller: true kind: DataVolume name: win2k19-base uid: 8980e7b7-ce0b-47b4-a7e4-f4c79e984ebe resourceVersion: "697047" uid: fccb0aa9-8541-4b51-b49e-ddceaa22b68c spec: accessModes:ReadWriteMany dataSource: apiGroup: cdi.kubevirt.io kind: VolumeImportSource name: volume-import-source-8980e7b7-ce0b-47b4-a7e4-f4c79e984ebe dataSourceRef: apiGroup: cdi.kubevirt.io kind: VolumeImportSource name: volume-import-source-8980e7b7-ce0b-47b4-a7e4-f4c79e984ebe resources: requests: storage: "64424509440" storageClassName: ocs-storagecluster-ceph-rbd volumeMode: Block volumeName: pvc-dbfc9fe9-5677-469d-9402-c2f3a22dab3f status: accessModes:ReadWriteMany capacity: storage: 60Gi phase: Bound Here is the yaml for the cloning pvc. apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com creationTimestamp: "2023-12-06T14:24:07Z" finalizers:kubernetes.io/pvc-protection name: win2k19-base-clone namespace: base-images resourceVersion: "1551054" uid: f72665c3-6408-4129-82a2-e663d8ecc0cc spec: accessModes:ReadWriteMany dataSource: apiGroup: "" kind: PersistentVolumeClaim name: win2k19-base dataSourceRef: apiGroup: "" kind: PersistentVolumeClaim name: win2k19-base resources: requests: storage: "9223372036854775807" storageClassName: ocs-storagecluster-ceph-rbd volumeMode: Block status: phase: Pending
We frequently receive inquiries regarding the versions of monitoring components (such as Prometheus, Alertmanager, etc.) that are used in a giving OCP version.
Currently, obtaining this information requires several manual steps on our part, e.g.:
What if we automate this?
How about a view that displays the versions of all components for all recent OCP versions.
The image registry operator and ingress operator use the `/metrics` endpoint for liveness/readiness probes which in the case of the former results in a payload of ~100kb. This at scale can be non-performant and is also not best practice. The teams which own these operators should instead introduce health endpoints if these probes are needed.
This is a clone of issue OCPBUGS-39239. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-38918. The following is the description of the original issue:
—
Description of problem:
When installing OpenShift 4.16 on vSphere using IPI method with a template it fails with below error: 2024-08-07T09:55:51.4052628Z "level=debug msg= Fetching Image...", 2024-08-07T09:55:51.4054373Z "level=debug msg= Reusing previously-fetched Image", 2024-08-07T09:55:51.4056002Z "level=debug msg= Fetching Common Manifests...", 2024-08-07T09:55:51.4057737Z "level=debug msg= Reusing previously-fetched Common Manifests", 2024-08-07T09:55:51.4059368Z "level=debug msg=Generating Cluster...", 2024-08-07T09:55:51.4060988Z "level=info msg=Creating infrastructure resources...", 2024-08-07T09:55:51.4063254Z "level=debug msg=Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202406251923-0/x86_64/rhcos-416.94.202406251923-0-vmware.x86_64.ova?sha256=893a41653b66170c7d7e9b343ad6e188ccd5f33b377f0bd0f9693288ec6b1b73'", 2024-08-07T09:55:51.4065349Z "level=debug msg=image download content length: 12169", 2024-08-07T09:55:51.4066994Z "level=debug msg=image download content length: 12169", 2024-08-07T09:55:51.4068612Z "level=debug msg=image download content length: 12169", 2024-08-07T09:55:51.4070676Z "level=error msg=failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: failed to use cached vsphere image: bad status: 403"
Version-Release number of selected component (if applicable):
4.16
How reproducible:
All the time in user environment
Steps to Reproduce:
1.Try to install disconnected IPI install on vSphere using a template. 2. 3.
Actual results:
No cluster installation
Expected results:
Cluster installed with indicated template
Additional info:
- 4.14 works as expected in customer environment - 4.15 works as expected in customer environment
The Cloud Credential operator was made optional in OCP 4.15, see https://issues.redhat.com/browse/OCPEDGE-69. The CloudCredential cap was added as a new capability.
However, for OCP 4.15 the disablement of CCO is only supported on BareMetal platforms, see https://issues.redhat.com/browse/OCPEDGE-69?focusedId=23595076&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-23595076.
We propose to guard against installations on non-BareMetal platforms without the CloudCredential cap, which could be implemented similar to https://issues.redhat.com/browse/OCPBUGS-15659. 
Description of problem:
The `aws-ebs-csi-driver-node-` appears to be failing to deploy way too often in the CI recently
Version-Release number of selected component (if applicable):
4.14
How reproducible:
in a statistically significant pattern
Steps to Reproduce:
1. run OCP test suite many times for it to matter
Actual results:
fail [github.com/openshift/origin/test/extended/authorization/scc.go:76]: 1 pods failed before test on SCC errors Error creating: pods "aws-ebs-csi-driver-node-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[5]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, provider restricted-v2: .containers[0].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[0].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[0].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider restricted-v2: .containers[1].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[1].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[1].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider restricted-v2: .containers[2].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[2].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for DaemonSet.apps/v1/aws-ebs-csi-driver-node -n openshift-cluster-csi-drivers happened 4 times
Expected results:
Test pass
Additional info:
[sig-auth][Feature:SCC][Early] should not have pod creation failures during install [Suite:openshift/conformance/parallel]
This is a clone of issue OCPBUGS-34782. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The following test started to fail freequently in the periodic tests: External Storage [Driver: pd.csi.storage.gke.io] [Testpattern: Dynamic PV (block volmode)] provisioning should provision storage with pvc data source in parallel
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Sometimes, but way too often in the CI
Steps to Reproduce:
1. Run the periodic-ci-openshift-release-master-nightly-X.X-e2e-gcp-ovn-csi test
Actual results:
Provisioning of some volumes fails with time="2024-01-05T02:30:07Z" level=info msg="resulting interval message" message="{ProvisioningFailed failed to provision volume with StorageClass \"e2e-provisioning-9385-e2e-scw2z8q\": rpc error: code = Internal desc = CreateVolume failed to create single zonal disk pvc-35b558d6-60f0-40b1-9cb7-c6bdfa9f28e7: failed to insert zonal disk: unknown Insert disk operation error: rpc error: code = Internal desc = operation operation-1704421794626-60e299f9dba08-89033abf-3046917a failed (RESOURCE_OPERATION_RATE_EXCEEDED): Operation rate exceeded for resource 'projects/XXXXXXXXXXXXXXXXXXXXXXXX/zones/us-central1-a/disks/pvc-501347a5-7d6f-4a32-b0e0-cf7a896f316d'. Too frequent operations from the source resource. map[reason:ProvisioningFailed]}"
Expected results:
Test passes
Additional info:
Looks like we're hitting the API quota limits with the test
Failed test run example:
Link to Sippy:
Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/2027
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Observed in
there was a delay in creating master-0 ,
control plane services started on master-2, at this point (as master-0 wasn't yet in a provisioned state) we had 2 sets of provisioning services provisioning master-0 and presumably stomping over each other.
master-0 never came up
Please review the following PR: https://github.com/openshift/machine-api-provider-openstack/pull/100
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The MCO's gcp-e2e-op-single-node job https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-single-node has been failing consistently since early Jan. It always fails on TestKernelArguments but that happens to be the first time where it gets the node to reboot, after which the node never comes up, so we don't get must-gather and (for some reason) don't get any console gathers either. This is only 4.16 and only single node. Doing the same test on HA gcp clusters yield no issues. The test itself doesn't seem to matter as the next test would fail the same way if it was skipped. This can be reproduced so far only via a 4.16 clusterbot cluster.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100%
Steps to Reproduce:
1. install SNO 4.16 cluster 2. run MCO's TestKernelArguments 3.
Actual results:
Node never comes back up
Expected results:
Test passes
Additional info:
The story is to track i18n upload/download routine tasks which are perform every sprint.
A.C.
- Upload strings to Memosource at the start of the sprint and reach out to localization team
- Download translated strings from Memsource when it is ready
- Review the translated strings and open a pull request
- Open a followup story for next sprint
This is a clone of issue OCPBUGS-32550. The following is the description of the original issue:
—
Description of problem:
In the Safari browser when creating an app with either pipeline or build option the topology shows the status on the left-hand corner of the topology(More details can be checked in the screenshot or video)
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create an app 2. Go to topology 3.
Actual results:
UI is distorted with build labels, not in the appropriate position
Expected results:
UI should show labels properly
Additional info:
Safari 17.4.1
This is a clone of issue OCPBUGS-38439. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-38436. The following is the description of the original issue:
—
Description of problem:
e980 is a valid system type for the madrid region but it is not listed as such in the installer.
Version-Release number of selected component (if applicable):
How reproducible:
Easily
Steps to Reproduce:
1. Try to deploy to mad02 with SysType set to e980 2. Fail 3.
Actual results:
Installer exits
Expected results:
Installer should continue as it's a valid system type.
Additional info:
This is a clone of issue OCPBUGS-30218. The following is the description of the original issue:
—
Description of problem:
Pseudolocalization is not working in console.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Go to any console's page and add '?pseudolocalization=true' suffix to the URL 2. 3.
Actual results:
The page stays set with the same language
Expected results:
The page should be pseudolocalized language
Additional info:
Looks like this is the issue https://github.com/MattBoatman/i18next-pseudo/issues/4
Description of problem:
whereabouts reconciler is responsible for reclaiming dangling IPs, and freeing them to be available to allocate to new pods. This is crucial for scenarios where the amount of addresses are limited and dangling IPs prevent whereabouts from successfully allocating new IPs to new pods. The reconciliation schedule is currently hard-coded to run once a day, without a user-friendly way to configure.
Version-Release number of selected component (if applicable):
How reproducible:
Create a Whereabouts reconciler daemon set, not able to configure the reconciler schedule.
Steps to Reproduce:
1. Create a Whereabouts reconciler daemonset instructions: https://docs.openshift.com/container-platform/4.14/networking/multiple_networks/configuring-additional- network.html#nw-multus-creating-whereabouts-reconciler-daemon-set_configuring-additional-network 2. Run `oc get pods -n openshift-multus | grep whereabouts-reconciler` 3. Run `oc logs whereabouts-reconciler-xxxxx`
Actual results:
You can't configure the cron-schedule of the reconciler.
Expected results:
Be able to modify the reconciler cron schedule.
Additional info:
The fix for this bug is in two places: whereabouts, and cluster-network-operator. From this reason, in order to verify correctly we need to use both fixed components. Please read below for more details about how to apply the new configurations.
How to Verify:
Create a whereabouts-config ConfigMap with a custom value, and check in the whereabouts-reconciler pods' logs that it is updated, and triggering the clean up.
Steps to Verify:
1. Create a Whereabouts reconciler daemonset 2. Wait for the whereabouts-reconciler pods to be running. (takes time for the daemonset to get created). 3. See in logs: "[error] could not read file: <nil>, using expression from flatfile: 30 4 * * *" This means it uses the hardcoded default value. (Because no ConfigMap yet) 4. Run: oc create configmap whereabouts-config -n openshift-multus --from-literal=reconciler_cron_expression="*/2 * * * *" 5. Check in the logs for: "successfully updated CRON configuration" 6. Check that in the next 2 minutes the reconciler runs: "[verbose] starting reconciler run"
Description of problem:
When installing a cluster, if the CPMS is created with a template without a path, the ControlPlaneMachineSet operator is rejecting any modifications to / deletion of the CPMS CR.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. Install cluster with generated FD 2. Once the cluster is installed, attempt to delete the CPMS
Actual results:
Deletion of CPMS is rejected due to invalid template definition
Expected results:
Deletion of CPMS completes without error.
Additional info:
The job "pull-ci-openshift-cluster-control-plane-machine-set-operator-main-e2e-vsphere-operator" is currently failing with: ~~~ control plane machine set should be able to be updated Expected success, but got an error: <*errors.StatusError | 0xc000233c20>: admission webhook "controlplanemachineset.machine.openshift.io" denied the request: spec.template.machines_v1beta1_machine_openshift_io.spec.providerSpec.value.template: Invalid value: "ci-op-7xjyyytp-91aad-zrdm2-rhcos-generated-region-generated-zone": template must be provided as the full path { ErrStatus: { TypeMeta: {Kind: "", APIVersion: ""}, ListMeta: { SelfLink: "", ResourceVersion: "", Continue: "", RemainingItemCount: nil, }, Status: "Failure", Message: "admission webhook \"controlplanemachineset.machine.openshift.io\" denied the request: spec.template.machines_v1beta1_machine_openshift_io.spec.providerSpec.value.template: Invalid value: \"ci-op-7xjyyytp-91aad-zrdm2-rhcos-generated-region-generated-zone\": template must be provided as the full path", Reason: "Forbidden", Details: nil, Code: 403, }, } failed [FAILED] Timed out after 60.000s. ~~~
As an openshift developer, I want to remove the image openshift-proxy-pull-test-container from the build, so we will not be affected by the possible bugs during the image build.
we requested the ART team to add this image in the ticket https://issues.redhat.com/browse/ART-2961
Please review the following PR: https://github.com/openshift/kubevirt-csi-driver/pull/26
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/264
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-network-operator/pull/2156
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Usernames can contain all kinds of characters that are not allowed in resource names. Hash the name instead and use hex representation of the result to get a usable identifier.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. log in to the web console configured with a login to a 3rd party OIDC provider 2. go to the User Preferences page / check the logs in the javascript console
Actual results:
The User Preferences page shows empty values instead of defaults. The javascript console reports things like ``` consoleFetch failed for url /api/kubernetes/api/v1/namespaces/openshift-console-user-settings/configmaps/user-settings-kubeadmin r: configmaps "user-settings-kubeadmin" not found ```
Expected results:
I am able to persist my user preferences.
Additional info:
Please review the following PR: https://github.com/openshift/cluster-baremetal-operator/pull/394
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
https://github.com/openshift/api/pull/1829 needs to be backported to 4.15 and 4.14. The API team asked (https://redhat-internal.slack.com/archives/CE4L0F143/p1715024118699869) to have an test before they can review and approve a backport. This bug's goal is to implement an e2e test which would use the connect timeout tunning option.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Always
Steps to Reproduce:
N/A
Actual results:
Expected results:
Additional info:
The e2e test could have been a part of the initial implementation PR (https://github.com/openshift/cluster-ingress-operator/pull/1035).
The goal is to collect metrics about openshift lightspeed because we want to understand how users are making use of the product(configuration options they enable) as well as the experience they are having when using it(e.g. response times)
Represents the llm provider+model the customer is currently using
Labels
The cardinality of the metric is around 6 currently, may grow somewhat as we add supported providers+models in the future (not all provider + model combinations are valid, so it's not cardinality of models*providers)
Represents all the provider/model combinations the customer has configured in ols (but are not necessarily currently using)
Labels
The cardinality of the metric is around 4 currently since not all provider/model combinations are valid. May grow somewhat as we add supported models in the future.
number of api calls with path + response code
Labels
cardinality is around 12 (paths times number of likely response codes)
Description of problem:
While deploying a cluster with OVNKubnernetes or applying a cloud-provider-config change, all OCP nodes got a failing unit on them:
$ oc debug -q node/ostest-h9vbm-master-0 -- chroot /host sudo systemctl list-units --failed UNIT LOAD ACTIVE SUB DESCRIPTION ● afterburn-hostname.service loaded failed failed Afterburn HostnameLOAD = Reflects whether the unit definition was properly loaded. ACTIVE = The high-level unit activation state, i.e. generalization of SUB. SUB = The low-level unit activation state, values depend on unit type. 1 loaded units listed. $ oc debug -q node/ostest-h9vbm-master-0 -- chroot /host sudo systemctl status afterburn-hostname × afterburn-hostname.service - Afterburn Hostname Loaded: loaded (/etc/systemd/system/afterburn-hostname.service; enabled; preset: disabled) Active: failed (Result: exit-code) since Tue 2023-04-18 11:48:35 UTC; 2h 26min ago Main PID: 1309 (code=exited, status=123) CPU: 148msApr 18 11:48:35 ostest-h9vbm-master-0 openstack-afterburn-hostname[1314]: 1: maximum number of retries (10) reached Apr 18 11:48:35 ostest-h9vbm-master-0 openstack-afterburn-hostname[1314]: 2: failed to fetch Apr 18 11:48:35 ostest-h9vbm-master-0 openstack-afterburn-hostname[1314]: 3: error sending request for url (http://169.254.169.254/latest/meta-data/hostname): error trying to connect: tcp connect error: Network is unreachable (os error 101) Apr 18 11:48:35 ostest-h9vbm-master-0 openstack-afterburn-hostname[1314]: 4: error trying to connect: tcp connect error: Network is unreachable (os error 101) Apr 18 11:48:35 ostest-h9vbm-master-0 openstack-afterburn-hostname[1314]: 5: tcp connect error: Network is unreachable (os error 101) Apr 18 11:48:35 ostest-h9vbm-master-0 openstack-afterburn-hostname[1314]: 6: Network is unreachable (os error 101) Apr 18 11:48:35 ostest-h9vbm-master-0 hostnamectl[2494]: Too few arguments. Apr 18 11:48:35 ostest-h9vbm-master-0 systemd[1]: afterburn-hostname.service: Main process exited, code=exited, status=123/n/a Apr 18 11:48:35 ostest-h9vbm-master-0 systemd[1]: afterburn-hostname.service: Failed with result 'exit-code'. Apr 18 11:48:35 ostest-h9vbm-master-0 systemd[1]: Failed to start Afterburn Hostname. $ oc debug -q node/ostest-h9vbm-worker-0-fkxdr -- chroot /host sudo systemctl list-units --failed UNIT LOAD ACTIVE SUB DESCRIPTION ● afterburn-hostname.service loaded failed failed Afterburn HostnameLOAD = Reflects whether the unit definition was properly loaded. ACTIVE = The high-level unit activation state, i.e. generalization of SUB. SUB = The low-level unit activation state, values depend on unit type. 1 loaded units listed.
Once the installation of the config change is done, restarting the service resolves the issue:
$ oc debug -q node/ostest-h9vbm-worker-0-fkxdr -- chroot /host sudo systemctl restart afterburn-hostname $ oc debug -q node/ostest-h9vbm-worker-0-fkxdr -- chroot /host sudo systemctl status afterburn-hostname ○ afterburn-hostname.service - Afterburn Hostname Loaded: loaded (/etc/systemd/system/afterburn-hostname.service; enabled; preset: disabled) Active: inactive (dead) since Tue 2023-04-18 14:14:40 UTC; 9s ago Process: 171875 ExecStart=/usr/local/bin/openstack-afterburn-hostname (code=exited, status=0/SUCCESS) Main PID: 171875 (code=exited, status=0/SUCCESS) CPU: 119msApr 18 14:14:32 ostest-h9vbm-worker-0-fkxdr systemd[1]: Starting Afterburn Hostname... Apr 18 14:14:39 ostest-h9vbm-worker-0-fkxdr openstack-afterburn-hostname[171876]: Apr 18 14:14:39.521 WARN failed to locate config-drive, using the metadata service API instead Apr 18 14:14:39 ostest-h9vbm-worker-0-fkxdr openstack-afterburn-hostname[171876]: Apr 18 14:14:39.583 INFO Fetching http://169.254.169.254/latest/meta-data/hostname: Attempt #1 Apr 18 14:14:40 ostest-h9vbm-worker-0-fkxdr openstack-afterburn-hostname[171876]: Apr 18 14:14:40.237 INFO Fetch successful Apr 18 14:14:40 ostest-h9vbm-worker-0-fkxdr openstack-afterburn-hostname[171876]: Apr 18 14:14:40.237 INFO wrote hostname ostest-h9vbm-worker-0-fkxdr to /dev/stdout Apr 18 14:14:40 ostest-h9vbm-worker-0-fkxdr systemd[1]: afterburn-hostname.service: Deactivated successfully. Apr 18 14:14:40 ostest-h9vbm-worker-0-fkxdr systemd[1]: Finished Afterburn Hostname. error: non-zero exit code from debug container [stack@undercloud-0 ~]$ oc debug -q node/ostest-h9vbm-master-0 -- chroot /host sudo systemctl status afterburn-hostname × afterburn-hostname.service - Afterburn Hostname Loaded: loaded (/etc/systemd/system/afterburn-hostname.service; enabled; preset: disabled) Active: failed (Result: exit-code) since Tue 2023-04-18 11:48:35 UTC; 2h 26min ago Main PID: 1309 (code=exited, status=123) CPU: 148ms
Version-Release number of selected component (if applicable):
Observed on 4.13.0-0.nightly-2023-04-13-171034 and 4.12.13
How reproducible:
Always
Additional info:
More retries or expanding them in time can help resolve this. It seems that in OVN-K the network is taking time to get ready and therefore the retries are timed out with the current configuration before the network is ready. Must-gather link provided on private comment.
This is a clone of issue OCPBUGS-35727. The following is the description of the original issue:
—
Business required:
We had a recommendation to check the certificate of the default ingress controller expiration after it has expired. From the referenced KCS, it seems that many customers(hundreds) hit this issue. So, Oscar Arribas Arribas suggests that if we can have a recommendation to alert customers before certificate expiration.
Gathering method:
1. Gather all the ingresscontroller objects(we already gathered the default ingresscontroller) with commands:
oc get ingresscontrollers -n openshift-ingress-operator
2. Gather operator auto-generated certificate's validate dates with commands:
$ oc get ingresscontrollers -n openshift-ingress-operator -o yaml | grep -A1 defaultCertificate #### empty output here when certificate created by the operator
$ oc get secret router-ca -n openshift-ingress-operator -o yaml | grep crt | awk '{print $2}' | base64 -d | openssl x509 -noout -dates notBefore=Dec 28 00:00:00 2022 GMT notAfter=Jan 22 23:59:59 2024 GMT
$ oc get secret router-certs-default -n openshift-ingress -o yaml | grep crt | awk '{print $2}' | base64 -d | openssl x509 -noout -dates notBefore=Dec 28 00:00:00 2022 GMT notAfter=Jan 22 23:59:59 2024 GMT
3. Gather custom certificates' validate dates with commands:
$ oc get ingresscontrollers -n openshift-ingress-operator -o yaml | grep -A1 defaultCertificate defaultCertificate: name: [custom-cert-secret-1]
#### for each [custom-cert-secret] above $ oc get secret [custom-cert-secret-1] -n openshift-ingress -o yaml | grep crt | awk '{print $2}' | base64 -d | openssl x509 -noout -dates notBefore=Dec 28 00:00:00 2022 GMT notAfter=Jan 22 23:59:59 2024 GMT
Other Information:
An RFE to create a cluster alert is under reveiwing: https://issues.redhat.com/browse/RFE-4269
This is a clone of issue OCPBUGS-36897. The following is the description of the original issue:
—
Description of problem:
4.16 NodePool CEL validation breaking existing/older NodePools
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
100%
Steps to Reproduce:
1. Deploy 4.16 NodePool CRDs 2. Create NodePool resource without spec.replicas + spec.autoScaling 3.
Actual results:
The NodePool "22276350-mynodepool" is invalid: spec: Invalid value: "object": One of replicas or autoScaling should be set but not both
Expected results:
NodePool to apply successfully
Additional info:
Breaking change: https://github.com/openshift/hypershift/pull/3786
Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/223
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
VolumeSnapshots data is not displayed in PVC > VolumeSnapshots tab
Version-Release number of selected component (if applicable):
4.16.0-0.ci-2024-01-05-050911
How reproducible:
Steps to Reproduce:
1. Create a PVC i.e. "my-pvc" 2. Create a Pod and bind it to the "my-pvc" 3. Create a VolumeSnapshots and associate it with the "my-pvc" 4. Goto to PVC detail > VolumeSnapshots tab
Actual results:
VolumeSnapshots data is not displayed in PVC > VolumeSnapshots tab
Expected results:
VolumeSnapshots data should be displayed in PVC > VolumeSnapshots tab
Additional info:
[Jira:"Test Framework"] monitor test azure-metrics-collector collection failure in https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/28395/pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd/1724427658311241728
Looks like Azure is throttling our request. We should probably try some retry mechanism.
Relevant thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1699977299650309
We have detected several bugs in Console dynamic plugin SDK v1 as part of Kubevirt plugin PR #1804
These bugs affect dynamic plugins which target Console 4.15+
ERROR in [entry] [initial] kubevirt-plugin.494371abc020603eb01f.hot-update.js Missing call to loadPluginEntry
LOG from @openshift-console/dynamic-plugin-sdk-webpack/lib/webpack/loaders/dynamic-module-import-loader ../node_modules/ts-loader/index.js??ruleSet[1].rules[0].use[0]!./utils/hooks/useKubevirtWatchResource.ts
<w> Detected parse errors in /home/vszocs/work/kubevirt-plugin/src/utils/hooks/useKubevirtWatchResource.ts
WARNING in shared module @patternfly/react-core No required version specified and unable to automatically determine one. Unable to find required version for "@patternfly/react-core" in description file (/home/vszocs/work/kubevirt-plugin/node_modules/@openshift-console/dynamic-plugin-sdk/package.json). It need to be in dependencies, devDependencies or peerDependencies.
1. git clone Kubevirt plugin repo
2. switch to commit containing changes from PR #1804
3. yarn install && yarn dev to update dependencies and start local dev server
Description of problem:
When no release image is provided on a HostedCluster, the backend hypershift operator picks the latest OCP release image within the operator's support windown. Today this fails due to how the operator selects this default image. For example, hypershift operator 4.14 does not support 4.15, but the 4.15.0.rc.3 is picked as a default release image today. This is a result of not anticipating that release candidates would not be reported as the latest stable release. The filter used to pick the latest release needs to consider patch level releases before the next y stream release.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100%
Steps to Reproduce:
1.create a self managed hcp cluster and do not specify a release image
Actual results:
the hcp will be rejected because the default release image picked does not fall within the support window
Expected results:
hcp should be created with the latest release image in the support window
Additional info:
Please review the following PR: https://github.com/openshift/multus-admission-controller/pull/78
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-33806. The following is the description of the original issue:
—
I0516 19:40:24.080597 1 controller.go:156] mbooth-psi-ph2q7-worker-0-9z9nn: reconciling Machine I0516 19:40:24.113866 1 controller.go:200] mbooth-psi-ph2q7-worker-0-9z9nn: reconciling machine triggers delete I0516 19:40:32.487925 1 controller.go:115] "msg"="Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" "controller"="machine-controller" "name"="mbooth-psi-ph2q7-worker-0-9z9nn" "namespace"="openshift-machine-api" "object"={"name":"mbooth-psi-ph2q7-worker-0-9z9nn","namespace":"openshift-machine-api"} "reconcileID"="f477312c-dd62-49b2-ad08-28f48c506c9a" panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x242a275] goroutine 317 [running]: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1() /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116 +0x1e5 panic({0x29cfb00?, 0x40f1d50?}) /usr/lib/golang/src/runtime/panic.go:914 +0x21f sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/compute.(*Service).constructPorts(0x3056b80?, 0xc00074d3d0, 0xc0004fe100) /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/compute/instance.go:188 +0xb5 sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/compute.(*Service).DeleteInstance(0xc00074d388, 0xc000c61300?, {0x3038ae8, 0xc0008b7440}, 0xc00097e2a0, 0xc0004fe100) /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/compute/instance.go:678 +0x42d github.com/openshift/machine-api-provider-openstack/pkg/machine.(*OpenstackClient).Delete(0xc0001f2380, {0x304f708?, 0xc000c6df80?}, 0xc0008b7440) /go/src/sigs.k8s.io/cluster-api-provider-openstack/pkg/machine/actuator.go:341 +0x305 github.com/openshift/machine-api-operator/pkg/controller/machine.(*ReconcileMachine).Reconcile(0xc00045de50, {0x304f708, 0xc000c6df80}, {{{0xc00066c7f8?, 0x0?}, {0xc000dce980?, 0xc00074dd48?}}}) /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/github.com/openshift/machine-api-operator/pkg/controller/machine/controller.go:216 +0x1cfe sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x3052e08?, {0x304f708?, 0xc000c6df80?}, {{{0xc00066c7f8?, 0xb?}, {0xc000dce980?, 0x0?}}}) /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119 +0xb7 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0004eb900, {0x304f740, 0xc00045c500}, {0x2ac0340?, 0xc0001480c0?}) /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316 +0x3cc sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0004eb900, {0x304f740, 0xc00045c500}) /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266 +0x1c9 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2() /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x79 created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 269 /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x565
> kc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-ec.6 True False 7d3h Cluster version is 4.16.0-ec.6
> kc -n openshift-machine-api get machines.m mbooth-psi-ph2q7-worker-0-9z9nn -o yaml apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: machine.openshift.io/instance-state: ERROR openstack-resourceId: dc08c2a2-cbda-4892-a06b-320d02ec0c6c creationTimestamp: "2024-05-16T16:53:16Z" deletionGracePeriodSeconds: 0 deletionTimestamp: "2024-05-16T19:23:44Z" finalizers: - machine.machine.openshift.io generateName: mbooth-psi-ph2q7-worker-0- generation: 3 labels: machine.openshift.io/cluster-api-cluster: mbooth-psi-ph2q7 machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: mbooth-psi-ph2q7-worker-0 machine.openshift.io/instance-type: ci.m1.xlarge machine.openshift.io/region: regionOne machine.openshift.io/zone: "" name: mbooth-psi-ph2q7-worker-0-9z9nn namespace: openshift-machine-api ownerReferences: - apiVersion: machine.openshift.io/v1beta1 blockOwnerDeletion: true controller: true kind: MachineSet name: mbooth-psi-ph2q7-worker-0 uid: f715dba2-b0b2-4399-9ab6-19daf6407bd7 resourceVersion: "8391649" uid: 6d1ad181-5633-43eb-9b19-7c73c86045c3 spec: lifecycleHooks: {} metadata: {} providerID: openstack:///dc08c2a2-cbda-4892-a06b-320d02ec0c6c providerSpec: value: apiVersion: machine.openshift.io/v1alpha1 cloudName: openstack cloudsSecret: name: openstack-cloud-credentials namespace: openshift-machine-api flavor: ci.m1.xlarge image: "" kind: OpenstackProviderSpec metadata: creationTimestamp: null networks: - filter: {} subnets: - filter: tags: openshiftClusterID=mbooth-psi-ph2q7 rootVolume: diskSize: 50 sourceUUID: rhcos-4.16 volumeType: tripleo securityGroups: - filter: {} name: mbooth-psi-ph2q7-worker serverGroupName: mbooth-psi-ph2q7-worker serverMetadata: Name: mbooth-psi-ph2q7-worker openshiftClusterID: mbooth-psi-ph2q7 tags: - openshiftClusterID=mbooth-psi-ph2q7 trunk: true userDataSecret: name: worker-user-data status: addresses: - address: mbooth-psi-ph2q7-worker-0-9z9nn type: Hostname - address: mbooth-psi-ph2q7-worker-0-9z9nn type: InternalDNS conditions: - lastTransitionTime: "2024-05-16T16:56:05Z" status: "True" type: Drainable - lastTransitionTime: "2024-05-16T19:24:26Z" message: Node drain skipped status: "True" type: Drained - lastTransitionTime: "2024-05-16T17:14:59Z" status: "True" type: InstanceExists - lastTransitionTime: "2024-05-16T16:56:05Z" status: "True" type: Terminable lastUpdated: "2024-05-16T19:23:52Z" phase: Deleting
Previously, in OCPBUGS-32105, we fixed a bug where a race between the assisted-installer and the assisted-installer-controller to mark a Node as Joined would result in 30+ minutes of (unlogged) retries by the former if the latter won. This was indistinguishable from the installation process hanging and it would eventually timed out.
This bug has been fixed, but we were unable to reproduce the circumstances that caused it.
However, a reproduction by the customer reveals another problem: we now correctly retry checking the control plane nodes for readiness if we encounter a conflict with another write from assisted-installer-controller. However, we never reload fresh data from assisted-service - data that would show the host has already been updated and thus prevent us from trying to update it again. Therefore, we continue to get a conflict on every retry. (This is at least now logged, so we can see what is happening.)
This also suggests a potential way to reproduce the problem: whenever one control plane node has booted to the point that the assisted-installer-controller is running before the second control plane node has booted to the point that the Node is marked as ready in the k8s API, there is a possibility of a race. There is in fact no need for the write from assisted-installer-controller to come in the narrow window between when assisted-installer reads vs. writes to the assisted-service API, because assisted-installer is always using a stale read.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Installer now errors when attempting to use networkType: OpenShiftSDN; but the message still says "deprecated".
Version-Release number of selected component (if applicable):
4.15+
How reproducible:
100%
Steps to Reproduce:
1. Attempt to install 4.15+ with networkType: OpenShiftSDN Observe error in logs: time="2024-03-01T14:37:25Z" level=error msg="failed to fetch Master Machines: failed to load asset \"Install Config\": failed to create install config: invalid \"install-config.yaml\" file: networking.networkType: Invalid value: \"OpenShiftSDN\": networkType OpenShiftSDN is deprecated, please use OVNKubernetes"
Actual results:
Observe error in logs: time="2024-03-01T14:37:25Z" level=error msg="failed to fetch Master Machines: failed to load asset \"Install Config\": failed to create install config: invalid \"install-config.yaml\" file: networking.networkType: Invalid value: \"OpenShiftSDN\": networkType OpenShiftSDN is deprecated, please use OVNKubernetes"
Expected results:
A message more like: Observe error in logs: time="2024-03-01T14:37:25Z" level=error msg="failed to fetch Master Machines: failed to load asset \"Install Config\": failed to create install config: invalid \"install-config.yaml\" file: networking.networkType: Invalid value: \"OpenShiftSDN\": networkType OpenShiftSDN is not supported, please use OVNKubernetes"
Additional info:
See thread
This is a clone of issue OCPBUGS-42784. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-42745. The following is the description of the original issue:
—
flowschemas.v1beta3.flowcontrol.apiserver.k8s.io used in manifests/09_flowschema.yaml
Description of problem:
If there are several ICSP objects in a single file, only the first object in the file is taken into account when converting ICSP to IDMS.
Version-Release number of selected component (if applicable):
> 4.13
How reproducible:
Its reproducible and follow below steps
Steps to Reproduce:
1. Create a manifest file with multiple ICSP objects ( the manifest file the ICSP generated by oc-mirror will be having multiple ICSP objects) 2. Convert the ICSP to IDMS using below command oc adm migrate icsp <file1> --dest-dir=<dest dir> 3. Look at the generated IDMS manifest file in the destination directory. it will contain only the first ICSP object which is present in file and rest of the ICSP objects are ignored.
Actual results:
Only first object of ICSP got converted in to IDMS and rest of them are ignored.
Expected results:
If we have multiple entries for ICSP in a single file, while converting to IDMS it should convert all the ICSP object to IDMS.
Additional info:
Description of problem:
The HCP CSR flow allows any CN in the incoming CSR.
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
Using the CSR flow, any name you add to the CN in the CSR will be your username against the Kubernetes API server - check your username using the SelfSubjectRequest API (kubectl auth whoami)
Steps to Reproduce:
1.create CSR with CN=whatever 2.CSR signed, create kubeconfig 3.using kubeconfig, kubectl auth whoami should show whatever CN
Actual results:
any CN in CSR is the username against the cluster
Expected results:
we should only allow CNs with some known prefix (system:customer-break-glass:...)
Additional info:
Description of problem:
When creating hypershift cluster in disconnected env, the worker node cannot pass the validation of the assisted service due to an ignition error.
Version-Release number of selected component (if applicable):
4.14.z
How reproducible:
100 %
Steps to Reproduce:
1. Steps to install HCP cluster is mentioned in documentation: https://hypershift-docs.netlify.app/labs/dual/mce/agentserviceconfig/#assisted-service-customization 2. 3.
Actual results:
Node addition fails
Expected results:
Node should get added to the cluster
Additional info:
Change "the supported traditional Chinese" to "the supported simplified Chinese"
This is a clone of issue OCPBUGS-37534. The following is the description of the original issue:
—
Description of problem:
Prow jobs upgrading from 4.9 to 4.16 are failing when they upgrade from 4.12 to 4.13. Nodes become NotReady when MCO tries to apply the new 4.13 configuration to the MCPs. The failing job is: periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.9-azure-ipi-f28 We have reproduced the issue and we found an ordering cycle error in the journal log Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 systemd-journald.service[838]: Runtime Journal (/run/log/journal/960b04f10e4f44d98453ce5faae27e84) is 8.0M, max 641.9M, 633.9M free. Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found ordering cycle on network-online.target/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on node-valid-hostname.service/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on ovs-configuration.service/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on firstboot-osupdate.target/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on machine-config-daemon-firstboot.service/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on machine-config-daemon-pull.service/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Job network-online.target/start deleted to break ordering cycle starting with machine-config-daemon-pull.service/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: Queued start job for default target Graphical Interface. Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: systemd-journald.service: unit configures an IP firewall, but the local system does not support BPF/cgroup firewalling. Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: (This warning is only shown for the first unit using IP firewalling.) Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: systemd-journald.service: Deactivated successfully.
Version-Release number of selected component (if applicable):
Using IPI on Azure, these are the version involved in the current issue upgrading from 4.9 to 4.13: version: 4.13.0-0.nightly-2024-07-23-154444 version: 4.12.0-0.nightly-2024-07-23-230744 version: 4.11.59 version: 4.10.67 version: 4.9.59
How reproducible:
Always
Steps to Reproduce:
1. Upgrade an IPI on Azure cluster from 4.9 to 4.13. Theoretically, upgrading from 4.12 to 4.13 should be enough, but we reproduced it following the whole path.
Actual results:
Nodes become not ready $ oc get nodes NAME STATUS ROLES AGE VERSION ci-op-g94jvswm-cc71e-998q8-master-0 Ready master 6h14m v1.25.16+306a47e ci-op-g94jvswm-cc71e-998q8-master-1 Ready master 6h13m v1.25.16+306a47e ci-op-g94jvswm-cc71e-998q8-master-2 NotReady,SchedulingDisabled master 6h13m v1.25.16+306a47e ci-op-g94jvswm-cc71e-998q8-worker-centralus1-c7ngb NotReady,SchedulingDisabled worker 6h2m v1.25.16+306a47e ci-op-g94jvswm-cc71e-998q8-worker-centralus2-2ppf6 Ready worker 6h4m v1.25.16+306a47e ci-op-g94jvswm-cc71e-998q8-worker-centralus3-nqshj Ready worker 6h6m v1.25.16+306a47e And in the NotReady nodes we can see the ordering cycle error mentioned in the description of this ticket.
Expected results:
No ordering cycle error should happen and the upgrade should be executed without problems.
Additional info:
Please review the following PR: https://github.com/openshift/network-tools/pull/108
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-35188. The following is the description of the original issue:
—
Description of problem:
Now that capi/aws is the default in 4.16+, the old terraform aws configs won't be maintained since there is no way to use them. Users interested in the configs can still access them in the 4.15 branch where they are still maintained as the installer still uses terraform.
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
terraform aws configs are left in the repo.
Expected results:
Configs are removed.
Additional info:
Please review the following PR: https://github.com/openshift/baremetal-operator/pull/327
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-35450. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-12699. The following is the description of the original issue:
—
Description of problem:
Proxy settings in buildDefaults preserved in image
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
I have a customer, so during builds their developers need proxy access. For this they have configured buildDefaults on thier cluster as described here:https://docs.openshift.com/container-platform/4.10/cicd/builds/build-configuration.html. The problem is that buildDefaults.defaultProxy sets the proxy environment variables in uppercase. Several RedHat S2I images use tools that depend on curl. curl only supports lower-case proxy environment variables. As such the defaultProxy settings are not taken into account.To workaround this "behavior defect", they have configured: - buildDefaults.env.http_proxy - buildDefaults.env.https_proxy - buildDefaults.env.no_proxy But the side effect is that the lowercase environment variables are preserved in the container image. So at runtime, the proxy settings are still active and they constantly have to support developers to unset them again (when using non-fqdn for example). This is causing frustration for them and thier developers. 1. Why can't the buildDefaults.defaultProxy not be set in lower and uppercase proxy settings?2. Why are the buildDefaults.env preserved in the container image while buildDefaults.defaultProxy is correctly unset/removed from the container image. As the name implies, for us "buildDefaults" should only be kept during the build and settings should correctly be removed before pushing the image in the registry.Also have shared them the below KCS: https://access.redhat.com/solutions/1575513. But cu was not satisfied with that , and they responded with the following: The article does not provide a solution to the problem. It describes the same and gives a dirty workaround a developers will have to apply it on each individual buildconfig. This is not wanted. The fact that we set these envs using buildDefaults, is the same workaround. But still the core problem remains: the envs are preserved in the container image when using this workaround. This needs to be addressed by engineering so this is fixed properly.
Actual results:
Expected results:
Additional info:
Description of problem:
CCO reports credsremoved mode in metrics when the cluster is actually in the default mode. See https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/47349/rehearse-47349-pull-ci-openshift-cloud-credential-operator-release-4.16-e2e-aws-qe/1744240905512030208 (OCP-31768).
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always.
Steps to Reproduce:
1. Creates an AWS cluster with CCO in the default mode (ends up in mint) 2. Get the value of the cco_credentials_mode metric
Actual results:
credsremoved
Expected results:
mint
Root cause:
The controller-runtime client used in metrics calculator (https://github.com/openshift/cloud-credential-operator/blob/77a68ad01e75162bfa04097b22f80d305c192439/pkg/operator/metrics/metrics.go#L77) is unable to GET the root credentials Secret (https://github.com/openshift/cloud-credential-operator/blob/77a68ad01e75162bfa04097b22f80d305c192439/pkg/operator/metrics/metrics.go#L184) since it is backed by a cache which only contains target Secrets requested by other operators (https://github.com/openshift/cloud-credential-operator/blob/77a68ad01e75162bfa04097b22f80d305c192439/pkg/cmd/operator/cmd.go#L164-L168).
Please review the following PR: https://github.com/openshift/csi-node-driver-registrar/pull/62
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The olm-operator pod has initilization errors in the logs in a HyperShift deployment. It appears that the --writePackageServerStatusName="" passed in as an argument is being interpreted as \"\" instead of an empty string.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
$ kubectl -n master-coh67vr100a3so6e7erg logs olm-operator-75474cfd48-w2fp5
Actual results:
Several errors that look like this time="2024-04-19T12:41:32Z" level=error msg="initialization error - failed to ensure name=\"\" - ClusterOperator.config.openshift.io \"\\\"\\\"\" is invalid: metadata.name: Invalid value: \"\\\"\\\"\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')" monitor=clusteroperator
Expected results:
No errors
Additional info:
This is a clone of issue OCPBUGS-36495. The following is the description of the original issue:
—
Description of problem:
with the release of 4.16 prometheus adapter[0] is deprecated and there is a new alert[1] ClusterMonitoringOperatorDeprecatedConfig, there needs to be a better details on how these alerts can be handled which will reduce the support cases. [0] https://docs.openshift.com/container-platform/4.16/release_notes/ocp-4-16-release-notes.html#ocp-4-16-prometheus-adapter-removed [1] https://docs.openshift.com/container-platform/4.16/release_notes/ocp-4-16-release-notes.html#ocp-4-16-monitoring-changes-to-alerting-rules
Version-Release number of selected component (if applicable):
4.16
How reproducible:
NA
Steps to Reproduce:
NA
Actual results:
As per the current configuration the clarification for the alert is not provided with much information
Expected results:
more information should be provided on how to fix the alert.
Additional info:
As per the discussion, there will be a runbook added which will help in better understanding of the alert
Description of problem:
When we remove additionalTrustBundle CA of mirror registry(user-ca-bundle) that was passed via the install-config.yaml for agent installer installation, MCO does not remove certificatefrom the nodes.
$ oc version Client Version: 4.15.23 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: 4.15.23 Kubernetes Version: v1.28.11+add48d0 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.23 True False 3h2m Cluster version is 4.15.23
How reproducible:
Always
Steps to Reproduce:
1.Create cluster with additionalTrustBundle CA in install-config 2.Locate the mirror reg CA certificate stored on the node's /etc/pki/ directory ~~~ cd /etc/pki/ca-trust/source/anchors [root@master1 anchors]# ls -la total 216 drwxr-xr-x. 2 root root 49 Sep 18 05:23 . drwxr-xr-x. 4 root root 80 Sep 18 05:20 .. -rw-------. 1 root root 220593 Sep 18 05:23 openshift-config-user-ca-bundle.crt ~~~ 3. back up and delete the CM (user-ca-bundle) ~~~ $ oc delete configmap/user-ca-bundle -n openshift-config configmap "user-ca-bundle" deleted ~~~ 4. Observer if some changes happens at the MCO/MCP level due to the same. 5. Switch to the node and check same /etc/pki/../ to see if CA is present or not
Actual results:
Certificate still present under "/etc/pki/ca-trust/source/anchors" on the nodes. No new MC got generated # cd /etc/pki/ca-trust/source/anchors [root@master1 anchors]# ls -la total 216 drwxr-xr-x. 2 root root 49 Sep 18 05:23 . drwxr-xr-x. 4 root root 80 Sep 18 05:20 .. -rw-------. 1 root root 220593 Sep 18 05:23 openshift-config-user-ca-bundle.crt [root@master1 anchors]# cat openshift-config-user-ca-bundle.crt | grep "MIID2TCCAsGgAwIBAgIUb1e2U0GXeW5qmTlgzE8SSDvht2YwDQYJKoZIhvcNAQEL" MIID2TCCAsGgAwIBAgIUb1e2U0GXeW5qmTlgzE8SSDvht2YwDQYJKoZIhvcNAQEL MIID2TCCAsGgAwIBAgIUb1e2U0GXeW5qmTlgzE8SSDvht2YwDQYJKoZIhvcNAQEL
Expected results:
New MC should get created once the user-ca-bundle has been removed and roll out of MC should happen on the node. Certificate should be removed on the nodes.
Additional info:
If there is an existing configuration, running `hypershift install` should not overwrite it.
Please review the following PR: https://github.com/openshift/sdn/pull/600
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is duplicate of https://issues.redhat.com/browse/ART-8361 one since on ART bugs we are not able to set `target` so creating the issue here.
Description of problem:
The button text for VolumeSnapshotContents is incorrect
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-04-02-182836
How reproducible:
always
Steps to Reproduce:
1. Navigate to Storage -> VolumeSnapshotContents page /k8s/cluster/snapshot.storage.k8s.io~v1~VolumeSnapshotContent 2. check the create button text 3.
Actual results:
the text in the button shows 'Create VolumeSnapshot'
Expected results:
the text in the button should be 'Create VolumeSnapshotContents'
Additional info:
4.16 payload are failing because of multiple issues. One of them is due to missing python module on RHEL9.
Here is a slack thread on it: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1707744121181479
PR for the fix:
In order to address OCPBUGS-30905
Bump x/net to at least v0.24.0 to mitigate CVE-2023-45288
Description of problem:
Based on the discussion in https://issues.redhat.com/browse/OCPBUGS-24044
and the discussion in this slack [https://redhat-internal.slack.com/archives/CBWMXQJKD/p1700510945375019|thread] we need to update our CI and some of the work done for mutable scope in NE-621.
Specifically, we need to
Version-Release number of selected component (if applicable):
4.15
How reproducible:
100%
Steps to Reproduce:
1. Run CI TestUnmanagedDNSToManagedDNSInternalIngressController 2. Observe failure in unmanaged-migrated-internal
Actual results:
CI tests fail.
Expected results:
CI tests shouldn't fail.
Additional info:
This is a change from past behavior, as reported in https://issues.redhat.com/browse/OCPBUGS-24044. Further discussion revealed that the new behavior is currently expected but could be restored in the future. Notes to SRE and release notes are needed for this change to behavior.
Description of problem:
In ROSA/OCP 4.14.z, attaching AmazonEC2ContainerRegistryReadOnly policy to the worker nodes (in ROSA's case, this was attached to the ManagedOpenShift-Worker-Role, which is assigned by the installer to all the worker nodes), has no effect on ECR Image pull. User gets an authentication error. Attaching the policy ideally should avoid the need to provide an image-pull-secret. However, the error is resolved only if the user also provides an image-pull-secret. This is proven to work correctly in 4.12.z. Seems something has changed in the recent OCP versions.
Version-Release number of selected component (if applicable):
4.14.2 (ROSA)
How reproducible:
The issue is reproducible using the below steps.
Steps to Reproduce:
1. Create a deployment in ROSA or OCP on AWS, pointing at a private ECR repository 2. The image pulling will fail with Error: ErrImagePull & authentication required errors 3.
Actual results:
The image pull fails with "Error: ErrImagePull" & "authentication required" errors. However, the image pull is successful only if the user provides an image-pull-secret to the deployment.
Expected results:
The image should be pulled successfully by virtue of the ECR-read-only policy attached to the worker node role; without needing an image-pull-secret.
Additional info:
In other words:
in OCP 4.13 (and below) if a user adds the ECR:* permissions to the worker instance profile, then the user can specify ECR images and authentication of the worker node to ECR is done using the instance profile. In 4.14 this no longer works.
It is not sufficient as an alternative, to provide a pull secret in a deployment because AWS rotates ECR tokens every 12 hours. That is not a viable solution for customers that until OCP 4.13, did not have to rotate pull secrets constantly.
The experience in 4.14 should be the same as in 4.13 with ECR.
The current AWS policy that's used is this one: `arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly`
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ecr:GetAuthorizationToken", "ecr:BatchCheckLayerAvailability", "ecr:GetDownloadUrlForLayer", "ecr:GetRepositoryPolicy", "ecr:DescribeRepositories", "ecr:ListImages", "ecr:DescribeImages", "ecr:BatchGetImage", "ecr:GetLifecyclePolicy", "ecr:GetLifecyclePolicyPreview", "ecr:ListTagsForResource", "ecr:DescribeImageScanFindings" ], "Resource": "*" } ] }
Description of problem:
Altinfra build jobs are failing
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1.Build master installer and use latest nightly 4.16 release image 2.Run CAPI enabled installer with FeatureSet CustomNoUpgrade and featureGates: ["ClusterAPIInstall=true"]
Actual results:
Cluster fails to complete boostrap
Expected results:
Cluster is able to install completely
Additional info:
This bug is to track investigation into why altinfra e2e jobs were failing for: https://prow.ci.openshift.org/job-history/gs/test-platform-results/pr-logs/directory/pull-ci-openshift-installer-master-altinfra-e2e-vsphere-capi-ovn Upon looking into it, etcd operator was not being created. We saw the following:
CVO:
402 17:18:59.959209 1 task.go:124] error running apply for etcd "cluster" (108 of 937): failed to get resource type: no matches for kind "Etcd" in version "operator.openshift.io/v1" E0402 17:19:03.862993 1 task.go:124] error running apply for etcd "cluster" (108 of 937): failed to get resource type: no matches for kind "Etcd" in version "operator.openshift.io/v1" E0402 17:19:09.157126 1 task.go:124] error running apply for etcd "cluster" (108 of 937): failed to get resource type: no matches for kind "Etcd" in version "operator.openshift.io/v1" I0402 17:19:20.234944 1 task_graph.go:550] Result of work: [Could not update etcd "cluster" (108 of 937): the server does not recognize this resource, check extension API servers Cluster operator kube-apiserver is not available Cluster operator machine-api is not available Cluster operator authentication is not available Cluster operator image-registry is not available Cluster operator ingress is not available Cluster operator monitoring is not available Cluster operator openshift-apiserver is not available Could not update rolebinding "openshift/cluster-samples-operator-openshift-edit" (536 of 937): resource may have been deleted Could not update oauthclient "console" (597 of 937): the server does not recognize this resource, check extension API servers Could not update imagestream "openshift/driver-toolkit" (659 of 937): resource may have been deleted Could not update role "openshift/copied-csv-viewer" (727 of 937): resource may have been deleted Could not update role "openshift-console-operator/prometheus-k8s" (855 of 937): resource may have been deleted Could not update role "openshift-console/prometheus-k8s" (859 of 937): resource may have been deleted] I0402 17:19:20.235037 1 sync_worker.go:1166] Update error 108 of 937: UpdatePayloadResourceTypeMissing Could not update etcd "cluster" (108 of 937): the server does not recognize this resource, check extension API servers (*errors.withStack: failed to get resource type: no matches for kind "Etcd" in version "operator.openshift.io/v1") * Could not update etcd "cluster" (108 of 937): the server does not recognize this resource, check extension API servers
Description of problem:
Apply egressqos on OCP, the status of egressqos is empty. Check ovnkube-pod logs, it shows error like below:
I0429 09:39:19.013461 4771 egressqos.go:460] Processing sync for EgressQoS abc/default I0429 09:39:19.022635 4771 egressqos.go:463] Finished syncing EgressQoS default on namespace abc : 9.174361ms E0429 09:39:19.028426 4771 egressqos.go:368] failed to update EgressQoS object abc/default with status: Apply failed with 1 conflict: conflict with "ip-10-0-62-24.us-east-2.compute.internal" with subresource "status": .status.conditions I0429 09:39:19.031526 4771 egressqos.go:460] Processing sync for EgressQoS default/default I0429 09:39:19.039827 4771 egressqos.go:463] Finished syncing EgressQoS default on namespace default : 8.322774ms E0429 09:39:19.044060 4771 egressqos.go:368] failed to update EgressQoS object default/default with status: Apply failed with 1 conflict: conflict with "ip-10-0-70-102.us-east-2.compute.internal" with subresource "status": .status.conditions I0429 09:39:19.052877 4771 egressqos.go:460] Processing sync for EgressQoS abc/default I0429 09:39:19.055945 4771 egressqos.go:463] Finished syncing EgressQoS default on namespace abc : 3.182828ms E0429 09:39:19.060563 4771 egressqos.go:368] failed to update EgressQoS object abc/default with status: Apply failed with 1 conflict: conflict with "ip-10-0-62-24.us-east-2.compute.internal" with subresource "status": .status.conditions I0429 09:39:19.072238 4771 egressqos.go:460] Processing sync for EgressQoS default/default
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. create egressqos in ns abc
% cat egress_qos.yaml
kind: EgressQoS
apiVersion: k8s.ovn.org/v1
metadata:
name: default
namespace: abc
spec:
egress:
- dscp: 46
dstCIDR: 3.16.78.227/32
- dscp: 30
dstCIDR: 0.0.0.0/0
2. check egressqos
% oc get egressqos default -o yaml apiVersion: k8s.ovn.org/v1 kind: EgressQoS metadata: creationTimestamp: "2024-04-29T09:24:55Z" generation: 1 name: default namespace: abc resourceVersion: "376134" uid: f9dfe380-81ee-4edd-845d-49ba2c856e81 spec: egress: - dscp: 46 dstCIDR: 3.16.78.227/32 - dscp: 30 dstCIDR: 0.0.0.0/0 status: {}
3. check crd egressqos
% oc get crd egressqoses.k8s.ovn.org -o yaml apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: annotations: controller-gen.kubebuilder.io/version: v0.8.0 creationTimestamp: "2024-04-29T05:23:12Z" generation: 1 name: egressqoses.k8s.ovn.org ownerReferences: - apiVersion: operator.openshift.io/v1 blockOwnerDeletion: true controller: true kind: Network name: cluster uid: 3bfac7ab-ca29-477f-a97f-27592b7e176d resourceVersion: "3642" uid: 25dabf13-611f-4c29-bf22-4a0b56e4b7f7 spec: conversion: strategy: None group: k8s.ovn.org names: kind: EgressQoS listKind: EgressQoSList plural: egressqoses singular: egressqos scope: Namespaced versions: - name: v1 schema: openAPIV3Schema: description: EgressQoS is a CRD that allows the user to define a DSCP value for pods egress traffic on its namespace to specified CIDRs. Traffic from these pods will be checked against each EgressQoSRule in the namespace's EgressQoS, and if there is a match the traffic is marked with the relevant DSCP value. properties: apiVersion: description: 'APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources' type: string kind: description: 'Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds' type: string metadata: properties: name: pattern: ^default$ type: string type: object spec: description: EgressQoSSpec defines the desired state of EgressQoS properties: egress: description: a collection of Egress QoS rule objects items: properties: dscp: description: DSCP marking value for matching pods' traffic. maximum: 63 minimum: 0 type: integer dstCIDR: description: DstCIDR specifies the destination's CIDR. Only traffic heading to this CIDR will be marked with the DSCP value. This field is optional, and in case it is not set the rule is applied to all egress traffic regardless of the destination. format: cidr type: string podSelector: description: PodSelector applies the QoS rule only to the pods in the namespace whose label matches this definition. This field is optional, and in case it is not set results in the rule being applied to all pods in the namespace. properties: matchExpressions: description: matchExpressions is a list of label selector requirements. The requirements are ANDed. items: description: A label selector requirement is a selector that contains values, a key, and an operator that relates the key and values. properties: key: description: key is the label key that the selector applies to. type: string operator: description: operator represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists and DoesNotExist. type: string values: description: values is an array of string values. If the operator is In or NotIn, the values array must be non-empty. If the operator is Exists or DoesNotExist, the values array must be empty. This array is replaced during a strategic merge patch. items: type: string type: array required: - key - operator type: object type: array matchLabels: additionalProperties: type: string description: matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels map is equivalent to an element of matchExpressions, whose key field is "key", the operator is "In", and the values array contains only "value". The requirements are ANDed. type: object type: object required: - dscp type: object type: array required: - egress type: object status: description: EgressQoSStatus defines the observed state of EgressQoS type: object type: object served: true storage: true subresources: status: {} status: acceptedNames: kind: EgressQoS listKind: EgressQoSList plural: egressqoses singular: egressqos conditions: - lastTransitionTime: "2024-04-29T05:23:12Z" message: no conflicts found reason: NoConflicts status: "True" type: NamesAccepted - lastTransitionTime: "2024-04-29T05:23:12Z" message: the initial names have been accepted reason: InitialNamesAccepted status: "True" type: Established storedVersions: - v1
Actual results:
egressqos status is not updated correctly
Expected results:
egressqos status should be updated once applied.
Additional info:
% oc version Client Version: 4.16.0-0.nightly-2024-04-26-145258 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: 4.16.0-0.nightly-2024-04-26-145258 Kubernetes Version: v1.29.4+d1ec84a
UDP Packets are subject to SNAT in a self-managed OCP 4.13.13 cluster on Azure (OVN-K as CNI) using a Load Balancer Service with `externalTrafficPolicy: Local`. UDP Packets correctly arrive to the Node hosting the Pod but the source IP seen by the Pod is the OVN GW Router of the Node.
I've reproduced the customer scenario with the following steps:
This is issue is very critical because it is blocking customer business.
This is a clone of issue OCPBUGS-41341. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-39126. The following is the description of the original issue:
—
Description of problem:
Difficult to detect in which component I should report this bug. The description is the following. Today we can install RH operators using one precise namespace or just all namepaces that will install the operator in "openshift-operators" namespace. if this operator creates a serviceMonitor that should be scrapped by platform prometheus, this will have a token authentication and security configured in its definition. But if the operator is installed in "openshift-operators" namespace, it's user workload monitoring that will try to scrappe it since this mentioned namespace has not the corresponding label to be scrapped by platform monitoring and we don't want it to have it because in this namespace we can also install community operators. The result is that user workload monitoring will scrap this namespace and the service monitors will be skipped since they are configured with security against platform monitoring and UWM will not hande this. A possible workaround is to do: oc label namespace openshift-operators openshift.io/user-monitoring=false losing functionality since some RH operators will not be monitored if installed in openshift-operators.
Version-Release number of selected component (if applicable):
4.16
This is a clone of issue OCPBUGS-33803. The following is the description of the original issue:
—
The MCO currently lays down a file at /etc/mco/internal-registry-pull-secret.json, which is extracted from the machine-os-puller SA into ControllerConfig. It is then templated down to a MachineConfig. For some reason, this SA is now being refreshed every hour or so, causing a new MachineConfig to be generated every hour. This also causes CI issues as the machineconfigpools will randomly update to a new config in the middle of a test.
More context: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1715888365021729
Release controller > 4.14.2 > HyperShift conformance run > gathered assets:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance/1722648207965556736/artifacts/e2e-aws-ovn-conformance/dump/artifacts/namespaces/clusters-e8d2a8003773eacb6a8b/core/pods/logs/kube-apiserver-5f47c7b667-42h2f-audit-logs.log | grep -v '/var/log/kube-apiserver/audit.log. has' | jq -r 'select(.user.username == "system:admin" and .verb == "create" and .requestURI == "/apis/operator.openshift.io/v1/storages") | .userAgent' | sort | uniq -c 65 hosted-cluster-config-operator-manager $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance/1722648207965556736/artifacts/e2e-aws-ovn-conformance/dump/artifacts/namespaces/clusters-e8d2a8003773eacb6a8b/core/pods/logs/kube-apiserver-5f47c7b667-42h2f-audit-logs.log | grep -v '/var/log/kube-apiserver/audit.log. has' | jq -r 'select(.user.username == "system:admin" and .verb == "create" and .requestURI == "/apis/operator.openshift.io/v1/storages") | .requestReceivedTimestamp + " " + (.responseStatus | (.code | tostring) + " " + .reason)' | head -n5 2023-11-09T17:17:15.130454Z 409 AlreadyExists 2023-11-09T17:17:15.163256Z 409 AlreadyExists 2023-11-09T17:17:15.198908Z 409 AlreadyExists 2023-11-09T17:17:15.230532Z 409 AlreadyExists 2023-11-09T17:17:22.899579Z 409 AlreadyExists
That's banging away pretty hard with creation attempts that keep getting 409ed, presumably because an earlier creation attempt succeeded. If the controller needs very quick latency in re-creation, perhaps an informing watch? If the controller can handle some re-creation latency, perhaps a quieter poll?
4.14.2. I haven't checked other releases.
Likely 100%. I saw similar behavior in an unrelated dump, and confirmed the busy 409s in the first CI run I checked.
1. Dump a hosted cluster.
2. Inspect its audit logs for hosted-cluster-config-operator-manager create activity.
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance/1722648207965556736/artifacts/e2e-aws-ovn-conformance/dump/artifacts/namespaces/clusters-e8d2a8003773eacb6a8b/core/pods/logs/kube-apiserver-5f47c7b667-42h2f-audit-logs.log | grep -v '/var/log/kube-apiserver/audit.log. has' | jq -r 'select(.userAgent == "hosted-cluster-config-operator-manager" and .verb == "create") | .verb + " " + (.responseStatus.code | tostring)' | sort | uniq -c 130 create 409
Zero or rare 409 creation request from this user-agent.
The user agent seems to be defined here, so likely the fix will involve changes to that manager.
Please review the following PR: https://github.com/openshift/kubernetes-metrics-server/pull/21
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Since the golang.org/x/oauth2 package has been upgraded, GCP installs have been failing with level=info msg=Credentials loaded from environment variable "GOOGLE_CLOUD_KEYFILE_JSON", file "/var/run/secrets/ci.openshift.io/cluster-profile/gce.json" level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: [platform.gcp.project: Internal error: failed to create cloud resource service: Get "http://169.254.169.254/computeMetadata/v1/universe/universe_domain": dial tcp 169.254.169.254:80: connect: connection refused, : Internal error: failed to create compute service: Get "http://169.254.169.254/computeMetadata/v1/universe/universe_domain": dial tcp 169.254.169.254:80: connect: connection refused]
Version-Release number of selected component (if applicable):
4.16/master
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The bump has been introduced by https://github.com/openshift/installer/pull/8020
This is a clone of issue OCPBUGS-33882. The following is the description of the original issue:
—
As of OpenShift 4.16, CRD management is more complex. This is an artifact of improvements made to feature gates and feature sets. David Eads and I agreed that, to avoid confusion, we should aim to stop having CRDs installed via operator repos, and, if their types live in o/api, install them from there instead.
We started this by moving the ControlPlaneMachineSet back to o/api, which is part of the MachineAPI capability.
Unbeknown to us at the time, the way the installer currently works is that all resources that are rendered, get applied by a cluster-bootstrap tools, roughly here and not by CVO.
Cluster-bootstrap is not capability aware, so installed the CPMS CRD, which in turn broke the check in the CSR approver which stops it from crashing on MachineAPI less clusters.
Options for moving forward include:
I'm not sure presently which of the 2nd or 3rd options is better, nor am I sure how I would expect the caps to come into knowledge of the "renderers", installer can provide them as args in bootkube.sh.template?
Original bug below, description of what's happening above
Description of problem:
After running tests on an SNO with Telco DU profile for a couple of hours kubernetes.io/kubelet-serving CSRs in Pending state start showing up and accumulating in time.
Version-Release number of selected component (if applicable):
4.16.0-rc.1
How reproducible:
once so far
Steps to Reproduce:
1. Deploy SNO with DU profile with disabled capabilities: installConfigOverrides: "{\"capabilities\":{\"baselineCapabilitySet\": \"None\", \"additionalEnabledCapabilities\": [ \"NodeTuning\", \"ImageRegistry\", \"OperatorLifecycleManager\" ] }}" 2. Leave the node running tests overnight for a couple of hours 3. Check for Pending CSRs
Actual results:
oc get csr -A | grep Pending | wc -l 27
Expected results:
No pending CSRs Also oc logs will return a tls internal error: oc -n openshift-cluster-machine-approver --insecure-skip-tls-verify-backend=true logs machine-approver-866c94c694-7dwks Defaulted container "kube-rbac-proxy" out of: kube-rbac-proxy, machine-approver-controller Error from server: Get "https://[2620:52:0:8e6::d0]:10250/containerLogs/openshift-cluster-machine-approver/machine-approver-866c94c694-7dwks/kube-rbac-proxy": remote error: tls: internal error
Additional info:
Checking the machine-approver-controller container logs on the node we can see the reconciliation is failing be cause it cannot find the Machine API which is disabled from the capabilities. I0514 13:25:09.266546 1 controller.go:120] Reconciling CSR: csr-dw9c8 E0514 13:25:09.275585 1 controller.go:138] csr-dw9c8: Failed to list machines in API group machine.openshift.io/v1beta1: no matches for kind "Machine" in version "machine.openshift.io/v1beta1" E0514 13:25:09.275665 1 controller.go:329] "Reconciler error" err="Failed to list machines: no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\"" controller="certificatesigningrequest" controllerGroup="certificates.k8s.io" controllerKind="CertificateSigningRequest" CertificateSigningRequest="csr-dw9c8" namespace="" name="csr-dw9c8" reconcileID="6f963337-c6f1-46e7-80c4-90494d21653c" I0514 13:25:43.792140 1 controller.go:120] Reconciling CSR: csr-jvrvt E0514 13:25:43.798079 1 controller.go:138] csr-jvrvt: Failed to list machines in API group machine.openshift.io/v1beta1: no matches for kind "Machine" in version "machine.openshift.io/v1beta1" E0514 13:25:43.798128 1 controller.go:329] "Reconciler error" err="Failed to list machines: no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\"" controller="certificatesigningrequest" controllerGroup="certificates.k8s.io" controllerKind="CertificateSigningRequest" CertificateSigningRequest="csr-jvrvt" namespace="" name="csr-jvrvt" reconcileID="decbc5d9-fa10-45d1-92f1-1c999df956ff"
This is a clone of issue OCPBUGS-39453. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-35321. The following is the description of the original issue:
—
Description of problem:
Customer have update its cluster to 4.15.11 from 4.11.x After updating the cluster to openshift 4.15.11 the value for vCenter Cluster in vsphere connection configuration is missing. From GUI it should be observable. -> Not displaying the vcenter clsuter name in GUI. -> We have also see the Cloud-config all things are at it's place but we missing some parameter from openshift console in v-sphere connection configuration. Please find the attached screenshot for more reference here.
Version-Release number of selected component (if applicable):
How reproducible:
Customer have reproduced that issue we are yet to do so.
Steps to Reproduce:
[x] -- Customer have update it's cluster from 4.11.x to 4.15.11 after upgrade cluster looks fine & healthy in it's state but missing a parameter from the v-sphere connection configuration in open-shift console as shown in attached screenshot.
Expected results:
Additional info:
golangci offers a "--fix" option
There fixed when possible.
oc-mirror - maxVersion of the imageset config is ignored for operators
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create 2 imageset that we are using: _imageset-config-test1-1.yaml:_ ~~~ kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 storageConfig: local: path: /local/oc-mirror/test1/metadata mirror: platform: architectures: - amd64 graph: true channels: - name: stable-4.12 type: ocp minVersion: 4.12.1 maxVersion: 4.12.1 shortestPath: true operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12 packages: - name: cincinnati-operator channels: - name: v1 minVersion: 5.0.1 maxVersion: 5.0.1 ~~~ _imageset-config-test1-2.yaml:_ ~~~ kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 storageConfig: local: path: /local/oc-mirror/test1/metadata mirror: platform: architectures: - amd64 graph: true channels: - name: stable-4.12 type: ocp minVersion: 4.12.1 maxVersion: 4.12.1 shortestPath: true operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12 packages: - name: cincinnati-operator channels: - name: v1 minVersion: 5.0.1 maxVersion: 5.0.1 - name: local-storage-operator channels: - name: stable minVersion: 4.12.0-202305262042 maxVersion: 4.12.0-202305262042 - name: odf-operator channels: - name: stable-4.12 minVersion: 4.12.4-rhodf maxVersion: 4.12.4-rhodf - name: rhsso-operator channels: - name: stable minVersion: 7.6.4-opr-002 maxVersion: 7.6.4-opr-002 - catalog: registry.redhat.io/redhat/redhat-marketplace-index:v4.12 packages: - name: k10-kasten-operator-rhmp channels: - name: stable minVersion: 6.0.6 maxVersion: 6.0.6 additionalImages: - name: registry.redhat.io/rhel8/postgresql-13:1-125 ~~~ 2. Generate a first .tar file from the first imageset-config file (imageset-config-test1-1.yaml) oc mirror --config=imageset-config-test1-1.yaml file:///local/oc-mirror/test1 3. Use the first .tar file to populate our registry oc mirror --from=/root/oc-mirror/test1/mirror_seq1_000000.tar docker://registry-url/oc-mirror1 4.Generate a second .tar file from the second imageset-config file (imageset-config-test1-2.yaml) oc mirror --config=imageset-config-test1-2.yaml file:///local/oc-mirror/test1 5. Populate the private registry named `oc-mirror1` with the second .tar file: oc mirror --from=/root/oc-mirror/test1/mirror_seq2_000000.tar docker://registry-url/oc-mirror1 6. Check the catalog index for **odf** and **rhsso** operators [root@test ~]# oc-mirror list operators --package odf-operator --catalog=registry-url/oc-mirror1/redhat/redhat-operator-index:v4.12 --channel stable-4.12 VERSIONS 4.12.7-rhodf 4.12.8-rhodf 4.12.4-rhodf 4.12.5-rhodf 4.12.6-rhodf [root@test ~]# oc-mirror list operators --package rhsso-operator --catalog=registry-url/oc-mirror1/redhat/redhat-operator-index:v4.12 --channel stable VERSIONS 7.6.4-opr-002 7.6.4-opr-003 7.6.5-opr-001 7.6.5-opr-002
Actual results:
Check the catalog index for **odf** and **rhsso** operators. oc-mirror is not respecting the minVersion & maxVersion [root@test ~]# oc-mirror list operators --package odf-operator --catalog=registry-url/oc-mirror1/redhat/redhat-operator-index:v4.12 --channel stable-4.12 VERSIONS 4.12.7-rhodf 4.12.8-rhodf 4.12.4-rhodf 4.12.5-rhodf 4.12.6-rhodf [root@test ~]# oc-mirror list operators --package rhsso-operator --catalog=registry-url/oc-mirror1/redhat/redhat-operator-index:v4.12 --channel stable VERSIONS 7.6.4-opr-002 7.6.4-opr-003 7.6.5-opr-001 7.6.5-opr-002
Expected results:
oc-mirror should respect the minVersion & maxVersion [root@test ~]# oc-mirror list operators --package odf-operator --catalog=registry-url/oc-mirror2/redhat/redhat-operator-index:v4.12 --channel stable-4.12 VERSIONS 4.12.4-rhodf [root@test ~]# oc-mirror list operators --package rhsso-operator --catalog=registry-url/oc-mirror2/redhat/redhat-operator-index:v4.12 --channel stable VERSIONS 7.6.4-opr-002
Additional info:
Description of problem:
Cluster install fails on ASH, nodes tainted with node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-01-24-133352
How reproducible:
Always
Steps to Reproduce:
1. Built a cluster on ASH $ oc get node NAME STATUS ROLES AGE VERSION ropatil-261ash1-x9kcj-master-0 NotReady control-plane,master 7h v1.29.1+0e0d15b ropatil-261ash1-x9kcj-master-1 NotReady control-plane,master 7h1m v1.29.1+0e0d15b ropatil-261ash1-x9kcj-master-2 NotReady control-plane,master 7h1m v1.29.1+0e0d15b $ oc get node -o yaml | grep uninitialized key: node.cloudprovider.kubernetes.io/uninitialized key: node.cloudprovider.kubernetes.io/uninitialized key: node.cloudprovider.kubernetes.io/uninitialized $ oc get po -n openshift-cloud-controller-manager NAME READY STATUS RESTARTS AGE azure-cloud-controller-manager-7b75cbbd64-qzhmm 0/1 CrashLoopBackOff 43 (20s ago) 4h54m azure-cloud-controller-manager-7b75cbbd64-w5cl8 1/1 Running 70 (2m52s ago) 7h33m azure-cloud-node-manager-9r8gb 0/1 CrashLoopBackOff 93 (79s ago) 7h33m azure-cloud-node-manager-jn8lv 0/1 CrashLoopBackOff 93 (82s ago) 7h33m azure-cloud-node-manager-n4vt4 0/1 CrashLoopBackOff 93 (102s ago) 7h33m $ oc -n openshift-cloud-controller-manager logs -f azure-cloud-controller-manager-7b75cbbd64-w5cl8 -c cloud-controller-manager Error from server: no preferred addresses found; known addresses: []
Actual results:
Cluster install failed on ASH
Expected results:
Cluster install succeed on ASH
Additional info:
log-bundle: https://drive.google.com/file/d/1QQwyQ1MxuunZx6AXqOTt6KwYwUk2GW7R/view?usp=sharing
Description of problem:
AWS HyperShift clusters' nodes cannot join cluster with custom domain name in DHCP Option Set
Version-Release number of selected component (if applicable):
Any
How reproducible:
100%
Steps to Reproduce:
1. Create a VPC for a HyperShift/ROSA HCP cluster in AWS 2. Replace the VPC's DHCP Option Set with another with a custom domain name (example.com or really any domain of your choice) 3. Attempt to install a HyperShift/ROSA HCP cluster with a nodepool
Actual results:
All EC2 instances will fail to become nodes. They will generate CSR's based on the default domain name - ec2.internal for us-east-1 or ${region}.compute.internal for other regions (e.g. us-east-2.compute.internal)
Expected results:
Either that they become nodes or that we document that custom domain names in DHCP Option Sets are not allowed with HyperShift at this time. There is currently no pressing need for this feature, though customers do use this in ROSA Classic/OCP successfully.
Additional info:
This is a known gap currently in cluster-api-provider-aws (CAPA) https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/1691
Description of problem:
https://github.com/openshift/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#tabledata includes a reference to `pf-c-table__action`, but v1+ of console-dynamic-plugin-sdk requires PatternFly 5, so the reference should be updated to `pf-v5-c-table__action`.
This is a clone of issue OCPBUGS-34819. The following is the description of the original issue:
—
Some AWS installs are failing to bootstrap due to an issue where CAPA may fail to create load balancer resources, but still declare that infrastructure is ready (see upstream issue for more details).
In these cases, load balancers are failing to be created due to either rate limiting:
time="2024-05-25T21:43:07Z" level=debug msg="E0525 21:43:07.975223 356 awscluster_controller.go:280] \"failed to reconcile load balancer\" err=<" time="2024-05-25T21:43:07Z" level=debug msg="\t[failed to modify target group attribute: Throttling: Rate exceeded"
or in some cases another error:
time="2024-06-01T06:43:58Z" level=debug msg="E0601 06:43:58.902534 356 awscluster_controller.go:280] \"failed to reconcile load balancer\" err=<" time="2024-06-01T06:43:58Z" level=debug msg="\t[failed to apply security groups to load balancer \"ci-op-jnqi01di-5feef-92njc-int\": ValidationError: A load balancer ARN must be specified" time="2024-06-01T06:43:58Z" level=debug msg="\t\tstatus code: 400, request id: 77446593-03d2-40e9-93c0-101590d150c6, failed to create target group for load balancer: DuplicateTargetGroupName: A target group with the same name 'apiserver-target-1717224237' exists, but with different settings"
We have an upstream PR in progress to retry the reconcile logic for load balancers.
Original component readiness report below.
=====
Component Readiness has found a potential regression in install should succeed: cluster bootstrap.
There is no significant evidence of regression
Sample (being evaluated) Release: 4.16
Start Time: 2024-05-28T00:00:00Z
End Time: 2024-06-03T23:59:59Z
Success Rate: 96.60%
Successes: 227
Failures: 8
Flakes: 0
Base (historical) Release: 4.15
Start Time: 2024-02-01T00:00:00Z
End Time: 2024-02-28T23:59:59Z
Success Rate: 99.87%
Successes: 767
Failures: 1
Flakes: 0
Description of problem:
seems the issue still here, test on 4.15.0-0.nightly-2023-12-04-223539, there is status.message for each zone, but there is no summarized status, so move to Assigned.
apbexternalroute yaml file is:
apiVersion: k8s.ovn.org/v1 kind: AdminPolicyBasedExternalRoute metadata: name: default-route-policy spec: from: namespaceSelector: matchLabels: kubernetes.io/metadata.name: test nextHops: static: - ip: "172.18.0.8" - ip: "172.18.0.9"
and Status section as below:
% oc get apbexternalroute NAME LAST UPDATE STATUS default-route-policy 12s <--- still empty % oc describe apbexternalroute default-route-policy | tail -n 10 Status: Last Transition Time: 2023-12-06T02:12:11Z Messages: qiowang-120620-gtt85-master-2.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 qiowang-120620-gtt85-master-0.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 qiowang-120620-gtt85-worker-a-55fzx.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 qiowang-120620-gtt85-master-1.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 qiowang-120620-gtt85-worker-b-m98ms.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 qiowang-120620-gtt85-worker-c-vtl8q.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 Events: <none>
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Description of problem:
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1.Creates CredentialsRequest including the spec.providerSpec.stsIAMRoleARN string. 2.Cloud Credential Operator could not populate Secret based on CredentialsRequest. $ oc get secret -A | grep test-mihuang #Secret not found. $ oc get CredentialsRequest -n openshift-cloud-credential-operator NAME AGE ... test-mihuang 44s 3.
Actual results:
Secret not create successfully.
Expected results:
Successfully created the secret on the hosted cluster.
Additional info:
This is a clone of issue OCPBUGS-43921. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-43898. The following is the description of the original issue:
—
Description of problem:
OCP 4.17 requires permissions to tag network interfaces (ENIs) on instance creation in support of the Egress IP feature. ROSA HCP uses managed IAM policies, which are reviewed and gated by AWS. The current policy AWS has applied does not allow us to tag ENIs out of band, only ones that have 'red-hat-managed: true`, which are going to be tagged during instance creation. However, in order to support backwards compatibility for existing clusters, we need to roll out a CAPA patch that allows us to call `RunInstances` with or without the ability to tag ENIs. Once we backport this to the Z streams, upgrade clusters and rollout the updated policy with AWS, we can then go back and revert the backport. For more information see https://issues.redhat.com/browse/SDE-4496
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Run OLM on 4.15 cluster 2. 3.
Actual results:
OLM pod will panic
Expected results:
Should run just fine
Additional info:
This issue is due to failure of initiate a new map if nil
Please review the following PR: https://github.com/openshift/image-customization-controller/pull/116
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Node Overview Pane not displaying
Version-Release number of selected component (if applicable):
How reproducible:
In the openshift console, under Compute > Node > Node Details > the Overview tab does not display
Steps to Reproduce:
In the openshift console, under Compute > Node > Node Details > the Overview tab does not display
Actual results:
Overview tab does not display
Expected results:
Overview tab should display
Additional info:
{code
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Oh no! Something went wrong’ will be shown when user go to MultiClusterEngine details -> Yaml tab
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-20-215234
How reproducible:
Always
Steps to Reproduce:
1. Install 'multicluster engine for Kubernetes' operator in the cluster 2. Use the default value to create a new MultiClusterEngine 3. Navigate to the MultiClusterEngine details -> Yaml Tab
Actual results: ‘Oh no! Something went wrong.’ error will be shown with below details TypeErrorDescription: Cannot read properties of null (reading 'editor')
Expected results:
no error
Additional info:
This bug fix is in conjunction with https://issues.redhat.com/browse/OCPBUGS-22778
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The storage team added CSI and ephemeral volumes in 4.12 and 4.13 but the affected SCCs are not being reconciled, resulting in these capabilities unreachable to the hands of the expected end users.
Version-Release number of selected component (if applicable):
4.13+
How reproducible:
100%
Steps to Reproduce:
1.check either of "anyuid", "hostaccess", "hostmount-anyuid", "hostnetwork", "nonroot", "restricted" SCCs on a cluster upgraded from 4.11
Actual results:
no "csi" and "ephemeral" in .volumes
Expected results:
"csi" and "ephemeral" in .volumes
Additional info:
This is a clone of issue OCPBUGS-29777. The following is the description of the original issue:
—
Description of problem:
RWOP accessMode is tech preview feature starting from OCP 4.14 and GA in 4.16. But on OCP console UI, there is not option available for creating a PVC with RWOP accessMode
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Login to OCP console in Administrator mode (4.14/4.15/4.16) 2. Go to 'Storage -> PersistentVolumeClaim -> Click on Create PersistentVolumeClaim' 3. Check under 'Access Mode*', RWOP option is not present
Actual results:
RWOP accessMode option is not present
Expected results:
RWOP accessMode option is present
Additional info:
Storage feature: https://issues.redhat.com/browse/STOR-1171
Description of problem:
The apiserver-url.env file is a dependency of all CCM components. These mostly run on the masters, however, on Azure, they also run on workers. A recent change in kube (https://github.com/kubernetes/kubernetes/pull/121028) means that a previous bug has been fixed that now means that workers no longer bootstrap, since Kubelet no longer sets an IP address. To resolve this issue, we need the CNM to be able to talk to KAS outside of the CNI, this works already on masters, but the url env file is missing on workers so they get stuck.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-api-provider-alibaba/pull/48
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/whereabouts-cni/pull/223
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The design doc for ImageDigestMirrorSet states: "ImageContentSourcePolicy CRD will be marked as deprecated and will be supported during all of 4.x. Update and coexistence of ImageDigestMirrorSet/ ImageTagMirrorSet and ImageContentSourcePolicy is supported. We encourage users to move to IDMS while supporting both in the cluster, but will not remove ICSP in OCP 4.x.". see: https://github.com/openshift/machine-config-operator/blob/master/docs/ImageMirrorSetDesign.md#goals see also: https://github.com/openshift/enhancements/blob/master/enhancements/api-review/add-new-CRD-ImageDigestMirrorSet-and-ImageTagMirrorSet-to-config.openshift.io.md#update-the-implementation-for-migration-path for the rationale behind it. but the hypershift-operator is reading ImageContentSourcePolicy only if no ImageDigestMirrorSet exists on the cluster, see: https://github.com/openshift/hypershift/blob/main/support/globalconfig/imagecontentsource.go#L101-L102
Version-Release number of selected component (if applicable):
4.14, 4.15, 4.16
How reproducible:
100%
Steps to Reproduce:
1. Set both an ImageContentSourcePolicy and ImageDigestMirrorSet with different content on the management cluster 2. 3.
Actual results:
the hypershift-operator consumes only the ImageDigestMirrorSet content ignoring the ImageContentSourcePolicy one.
Expected results:
since both ImageDigestMirrorSet and ImageContentSourcePolicy (although deprecated) are still supported on the management cluster, the hypershift-operator should align.
Additional info:
currently oc-mirror (v1) is only generating imageContentSourcePolicy.yaml without any imageDigestMirrorSet.yaml equivalent breaking the hypershift disconnected scenario on clusters where an IDMS is already there for other reasons.
In three clusters, I am receiving the alert:
"Multiple default storage classes are marked as default. The storage class is chosen for the PVC is depended on version of the cluster.
Starting with OpenShift 4.13, a persistent volume claim (PVC) requesting the default storage class gets the most recently created default storage class if multiple default storage classes exist."
But the alert clearly shows only one default SC:
"Red Hat recommends to set only one storage class as the default one.
Current default storage classes:
ocs-external-storagecluster-ceph-rbd"
This is confirmed with 'oc get sc'
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
ocs-external-storagecluster-ceph-rbd (default) openshift-storage.rbd.csi.ceph.com Delete Immediate true 351d
ocs-external-storagecluster-ceph-rbd-windows openshift-storage.rbd.csi.ceph.com Delete Immediate true 11d
ocs-external-storagecluster-cephfs openshift-storage.cephfs.csi.ceph.com Delete Immediate true 351d
openshift-storage.noobaa.io openshift-storage.noobaa.io/obc Delete Immediate false 351d
Description of problem:
ResourceYAMLEditor has no create option. THis means that can be used only for editing objects.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Use ResourceYAMLEditor in a different page from the details page 2. 3.
Actual results:
Be able to create the object
See the samples in the sidebar
See 'Create' button instead of 'Save'
Expected results:
only "save' button and no samples. Additional info: {code:none}
Description of problem:
The go docs in the install-config's platform.aws.lbType is misleading as well as on the ingress object (oc explain ingresses.config.openshift.io.spec.loadBalancer.platform.aws.type). Both say: "When this field is specified, the default ingresscontroller will be created using the specified load-balancer type." That is true, but what is missing is that ALL ingresscontrollers will be created using the specified load-balancer type by default (not just the default ingresscontroller). This missing information can be confusing to users.
Version-Release number of selected component (if applicable):
4.12+
How reproducible:
100%
Steps to Reproduce:
openshift-install explain installconfig.platform.aws.lbType - or - oc explain ingresses.config.openshift.io.spec.loadBalancer.platform.aws.type
Actual results:
./openshift-install explain installconfig.platform.aws.lbType KIND: InstallConfig VERSION: v1RESOURCE: <string> LBType is an optional field to specify a load balancer type. When this field is specified, the default ingresscontroller will be created using the specified load-balancer type. ... [same with ingress.spec.loadBalancer.platform.aws.type]
Expected results:
My suggestion: ./openshift-install explain installconfig.platform.aws.lbType KIND: InstallConfig VERSION: v1RESOURCE: <string> LBType is an optional field to specify a load balancer type. When this field is specified, all ingresscontrollers (including the default ingresscontroller) will be created using the specified load-balancer type by default. ... [same with ingress.spec.loadBalancer.platform.aws.type]
Additional info:
Since the change should be the same thing for both the installconfig and ingress object, this bug would handle both.
This is just a minor typo, but since its in an Info message that will appear on every installation it should be fixed.
time="2024-05-08T17:30:57-04:00" level=info msg="Waiting up to 15m0s (until 5:45PM EDT) for network infrastructure to become ready..."
time="2024-05-08T17:33:09-04:00" level=info msg="Netork infrastructure is ready" <==
Please review the following PR: https://github.com/openshift/apiserver-network-proxy/pull/47
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
Specified ClusterImageSetRef in SiteConfig CR in HubCluster mistmatches the specific image pulled from quay when trying to install 4.12.x managed SNO clusters. This behavior has not been detected when installing SNO 4.14.x
How reproducible:
Install 4.12.19 from ACM using clusterImageSetNameRef: "img4.12.19-x86-64-appsub"
Actual results:
Image pulled is actually img4.12.19-multi-x86-64-appsub
# oc adm release info 4.12.19 | more
Name: 4.12.19
Digest: sha256:4db028028aae12cb82b784ae54aaca6888f224b46565406f1a89831a67f42030
Created: 2023-05-24T06:58:32Z
OS/Arch: linux/amd64
Manifests: 647
Metadata files: 1
Pull From: quay.io/openshift-release-dev/ocp-release@sha256:4db028028aae12cb82b784ae54aaca6888f224b46565406f1a89831a67f42030
Metadata:
release.openshift.io/architecture: multi url: https://access.redhat.com/errata/RHSA-2023:3287
Expected results:
Image pulled is img4.12.19-x86-64-appsub
# oc adm release info 4.12.19 | more
Name: 4.12.19
Digest: sha256:41fd42cc8b9f86fc86cc8763dcf27e976299ff632a336d393b8e643bd8a5f967
Pull From: quay.io/openshift-release-dev/ocp-release@sha256:41fd42cc8b9f86fc86cc8763dcf27e976299ff632a336d393b8e643bd8a5f967
Metadata:
url: https://access.redhat.com/errata/RHSA-2023:3287
Description of problem:
When deploying with a service ID, the installer is unable to query resource groups.
Version-Release number of selected component (if applicable):
4.13-4.16
How reproducible:
Easily
Steps to Reproduce:
1. Create a service ID with seemingly enough permissions to do an IPI install 2. Deploy to power vs with IPI 3. Fail
Actual results:
Fail to deploy a cluster with service ID
Expected results:
cluster create should succeed
Additional info:
This is a clone of issue OCPBUGS-37102. The following is the description of the original issue:
—
Description of problem:
Enabling KMS for IBM Cloud will result in the kube-apiserver failing with the following configuration error: 17:45:45 E0711 17:43:00.264407 1 run.go:74] "command failed" err="error while parsing file: resources[0].providers[0]: Invalid value: config.ProviderConfiguration{AESGCM:(*config.AESConfiguration)(nil), AESCBC:(*config.AESConfiguration)(nil), Secretbox:(*config.SecretboxConfiguration)(nil), Identity:(*config.IdentityConfiguration)(0x89b4c60), KMS:(*config.KMSConfiguration)(0xc000ff1900)}: more than one provider specified in a single element, should split into different list elements"
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-33383. The following is the description of the original issue:
—
Description of problem:
Admission webhook warning on creation of Route - violates policy 299 - unknow field "metadata.defaultAnnotations" Admission webhook warning on creation of buildConfig - violates policy 299 - unknow field "spec.source.git.type"
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Navigate to Import from git form and create a deployment 2. See the `Admission webhook warning` toast notification
Actual results:
Admission webhook warning - violates policy 299 - unknow field "metadata.defaultAnnotations" showing up on creation of Route and Admission webhook warning on creation of buildConfig - violates policy 299 - unknow field "spec.source.git.type"
Expected results:
No Admission webhook warning should show
Additional info:
Description of problem:
Copying BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2250911 on OCP side (as fix is needed on console). [UI] In Openshift-storage-client namespace, 'RWX' access mode RBD PVC with volumemode'Filesystem' can be created from Client. However, this is an invalid combination for RBD PVC creation From ODF Operator UI of other Platforms. Volume mode is not available when Cepfrbd storageclass and RWX access mode selected on other platform. This is visible in client operator view. This attempt to create PVc and stuck in pending state
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Deploy Provider Client setup. 2. From UI Create PVC, select storage class : ceph-rbd, RWX access mode, check filemode : in case of this bug 'Filesystem' and 'block' volume mode is visible on UI, select volumemode: Filesystem and create the PVC.
Actual results:
PVC Created and stuck in pending status. PVC event shows error like: Generated from openshift-storage-client.rbd.csi.ceph.com_csi-rbdplugin-provisioner-6d9dcb9fc7-vjj22_2bd4ede5-9418-4c8e-80ae-169b5cb4fa8012 times in the last 13 minutes failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = InvalidArgument desc = multi node access modes are only supported on rbd `block` type volumes
Expected results:
Volumemode should not be visible on page when PVC with RWX access mode and RBD storage class is selected.
Additional info:
Screenshots are attached to the BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2250911 https://bugzilla.redhat.com/show_bug.cgi?id=2250911#c3
Description of problem:
Due to recent changes in tuned (https://github.com/openshift/cluster-node-tuning-operator/pull/1045/) The profiles directory was moved from /etc/tuned/<openshift-node-performance-performance-profile-name> to /var/lib/ocp-tuned/profiles/<openshift-node-performacne-performance-profile-name>
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
everytime
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
If the authentication.config/cluster Type=="" but the OAuth/User APIs are already missing, the console-operator won't update the authentication.config/cluster status with its own client as it's crashing on being unable to retrieve OAuthClients.
Version-Release number of selected component (if applicable):
4.15.0
How reproducible:
100%
Steps to Reproduce:
1. scale oauth-apiserver to 0 2. set featuregates to TechPreviewNotUpgradable 3. watch the authentication.config/cluster .status.oidcClients
Actual results:
The client for the console does not appear.
Expected results:
The client for the console should appear.
Additional info:
Description of problem
Debug into one of the worker nodes on the hosted cluster:
oc debug node/ip-10-1-0-97.ca-central-1.compute.internal nslookup kubernetes.default.svc.cluster.local Server: 10.1.0.2 Address: 10.1.0.2#53 ** server can't find kubernetes.default.svc.cluster.local: NXDOMAIN curl -k https://172.30.0.1:443/readyz curl: (7) Failed to connect to 172.30.0.1 port 443: Connection refused sh-5.1# curl -k https://172.20.0.1:443/readyz ok
Version-Release number of selected component (if applicable):
4.15.20
Steps to Reproduce:
Unknown
Actual results:
Pods on a hosted cluster's workers unable to connect to their internal kube apiserver via the service IP.
Expected results:
Pods on a hosted cluster's workers have connectivity to their kube apiserver via the service IP.
Additional info:
Checked the "Konnectivity server" logs on Dynatrace and found the error below occurs repeatedly
E0724 01:02:00.223151 1 server.go:895] "DIAL_RSP contains failure" err="dial tcp 172.30.176.80:8443: i/o timeout" dialID=8375732890105363305 agentID="1eab211f-6ea1-46ea-bc78-14d75d6ba325" E0724 01:02:00.223482 1 tunnel.go:150] "Received failure on connection" err="read tcp 10.128.17.15:8090->10.128.82.107:52462: use of closed network connection"
Relevant OHSS Ticket: https://issues.redhat.com/browse/OHSS-36053
We shut down the bootstrap node before the control plane hosts are provisioned:
Apr 24 17:30:05 localhost.localdomain master-bmh-update.sh[10498]: openshift-machine-api openshift-4 true 8m24s Apr 24 17:30:25 localhost.localdomain master-bmh-update.sh[4461]: Waiting for 2 masters to become provisioned Apr 24 17:30:25 localhost.localdomain master-bmh-update.sh[10602]: NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE Apr 24 17:30:25 localhost.localdomain master-bmh-update.sh[10602]: openshift-machine-api openshift-0 provisioning cluster4-59zbh-master-0 true 8m46s Apr 24 17:30:25 localhost.localdomain master-bmh-update.sh[10602]: openshift-machine-api openshift-1 provisioning cluster4-59zbh-master-1 true 8m45s Apr 24 17:30:25 localhost.localdomain master-bmh-update.sh[10602]: openshift-machine-api openshift-2 provisioning cluster4-59zbh-master-2 true 8m45s Apr 24 17:30:25 localhost.localdomain master-bmh-update.sh[10602]: openshift-machine-api openshift-3 true 8m44s Apr 24 17:30:25 localhost.localdomain master-bmh-update.sh[10602]: openshift-machine-api openshift-4 true 8m44s Apr 24 17:30:45 localhost.localdomain master-bmh-update.sh[4461]: Stopping provisioning services... Apr 24 17:30:45 localhost.localdomain master-bmh-update.sh[10708]: deactivating Apr 24 17:30:45 localhost.localdomain master-bmh-update.sh[4461]: Unpause all baremetal hosts Apr 24 17:30:45 localhost.localdomain master-bmh-update.sh[10724]: baremetalhost.metal3.io/openshift-0 annotated Apr 24 17:30:45 localhost.localdomain master-bmh-update.sh[10724]: baremetalhost.metal3.io/openshift-1 annotated Apr 24 17:30:45 localhost.localdomain master-bmh-update.sh[10724]: baremetalhost.metal3.io/openshift-2 annotated Apr 24 17:30:45 localhost.localdomain master-bmh-update.sh[10724]: baremetalhost.metal3.io/openshift-3 annotated Apr 24 17:30:45 localhost.localdomain master-bmh-update.sh[10724]: baremetalhost.metal3.io/openshift-4 annotated Apr 24 17:30:45 localhost.localdomain systemd[1]: Finished Update master BareMetalHosts with introspection data.
Description of problem:
PipelineRun logs page navigation is broken on navigate through the task on the PiplineRun log tab.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Navigate to PipelineRuns details page and select the Logs tab. 2. Navigate through the tasks of the PipelineRun tasks
Actual results:
- Details tab gets active on selection of any task - Logs page gets empty on seldction of Logs tab again - Last task is not selected for completed PipelineRuns
Expected results:
- Logs tab should be active when user is not the Logs tab - Last task should be selected in case of the completed PipelineRuns
Additional info:
It is a regression after change in logic of tab selection in HorizontalNav component.
Video- https://drive.google.com/file/d/15fx9GWO2dRh4uaibRmZ4VTk4HFxQ7NId/view?usp=sharing
Description of problem:
When adding parameters to a pipeline there is an error when trying to save. It seems a resource[] section is added, this doesn't happen when using yaml resources and oc client. Discussed with Vikram Raj
Version-Release number of selected component (if applicable):
4.14.12
How reproducible:
Always
Steps to Reproduce:
1.Create a pipeline 2.Add a parameter 3.Save the pipeline
Actual results:
Error shown
Expected results:
Save successful
Additional info:
This ticket is to satisfy the bot on GH
This is a clone of issue OCPBUGS-34689. The following is the description of the original issue:
—
Description of problem:
Customer is running Openshift on AHV and their Tenable Security Scan reported the following vulnerability on the Nutanix Cloud Controller Manager Deployment. https://www.tenable.com/plugins/nessus/42873 on port 10258 SSL Medium Strength Cipher Suites Supported (SWEET32) The Nutanix Cloud Controller Manager deployment runs two pods and exposes port 10258 to the outside world. sh-4.4# netstat -ltnp|grep -w '10258' tcp6 0 0 :::10258 :::* LISTEN 10176/nutanix-cloud sh-4.4# ps aux|grep 10176 root 10176 0.0 0.2 1297832 59764 ? Ssl Feb15 4:40 /bin/nutanix-cloud-controller-manager --v=3 --cloud-provider=nutanix --cloud-config=/etc/cloud/nutanix_config.json --controllers=* --configure-cloud-routes=false --cluster-name=trulabs-8qmx4 --use-service-account-credentials=true --leader-elect=true --leader-elect-lease-duration=137s --leader-elect-renew-deadline=107s --leader-elect-retry-period=26s --leader-elect-resource-namespace=openshift-cloud-controller-manager root 1403663 0.0 0.0 9216 1100 pts/0 S+ 14:17 0:00 grep 10176 [centos@provisioner-trulabs-0-230518-065321 ~]$ oc get pods -A -o wide | grep nutanix openshift-cloud-controller-manager nutanix-cloud-controller-manager-5c4cdbb9c-jnv7c 1/1 Running 0 4d18h 172.17.0.249 trulabs-8qmx4-master-1 <none> <none> openshift-cloud-controller-manager nutanix-cloud-controller-manager-5c4cdbb9c-vtrz5 1/1 Running 0 4d18h 172.17.0.121 trulabs-8qmx4-master-0 <none> <none> [centos@provisioner-trulabs-0-230518-065321 ~]$ oc describe pod -n openshift-cloud-controller-manager nutanix-cloud-controller-manager-5c4cdbb9c-jnv7c Name: nutanix-cloud-controller-manager-5c4cdbb9c-jnv7c Namespace: openshift-cloud-controller-manager Priority: 2000000000 Priority Class Name: system-cluster-critical Service Account: cloud-controller-manager Node: trulabs-8qmx4-master-1/172.17.0.249 Start Time: Thu, 15 Feb 2024 19:24:52 +0000 Labels: infrastructure.openshift.io/cloud-controller-manager=Nutanix k8s-app=nutanix-cloud-controller-manager pod-template-hash=5c4cdbb9c Annotations: operator.openshift.io/config-hash: b3e08acdcd983115fe7a2b94df296362b20c35db781c8eec572fbe24c3a7c6aa Status: Running IP: 172.17.0.249 IPs: IP: 172.17.0.249 Controlled By: ReplicaSet/nutanix-cloud-controller-manager-5c4cdbb9c Containers: cloud-controller-manager: Container ID: cri-o://f5c0f39e1907093c9359aa2ac364c5bcd591918b06103f7955b30d350c730a8a Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7f3e7b600d94d1ba0be1edb328ae2e32393acba819742ac3be5e6979a3dcbf4c Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7f3e7b600d94d1ba0be1edb328ae2e32393acba819742ac3be5e6979a3dcbf4c Port: 10258/TCP Host Port: 10258/TCP Command: /bin/bash -c #!/bin/bash set -o allexport if [[ -f /etc/kubernetes/apiserver-url.env ]]; then source /etc/kubernetes/apiserver-url.env fi exec /bin/nutanix-cloud-controller-manager \ --v=3 \ --cloud-provider=nutanix \ --cloud-config=/etc/cloud/nutanix_config.json \ --controllers=* \ --configure-cloud-routes=false \ --cluster-name=$(OCP_INFRASTRUCTURE_NAME) \ --use-service-account-credentials=true \ --leader-elect=true \ --leader-elect-lease-duration=137s \ --leader-elect-renew-deadline=107s \ --leader-elect-retry-period=26s \ --leader-elect-resource-namespace=openshift-cloud-controller-manager State: Running Started: Thu, 15 Feb 2024 19:24:56 +0000 Ready: True Restart Count: 0 Requests: cpu: 200m memory: 128Mi Environment: OCP_INFRASTRUCTURE_NAME: trulabs-8qmx4 NUTANIX_SECRET_NAMESPACE: openshift-cloud-controller-manager NUTANIX_SECRET_NAME: nutanix-credentials POD_NAMESPACE: openshift-cloud-controller-manager (v1:metadata.namespace) Mounts: /etc/cloud from nutanix-config (ro) /etc/kubernetes from host-etc-kube (ro) /etc/pki/ca-trust/extracted/pem from trusted-ca (ro) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4ht28 (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: nutanix-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: cloud-conf Optional: false trusted-ca: Type: ConfigMap (a volume populated by a ConfigMap) Name: ccm-trusted-ca Optional: false host-etc-kube: Type: HostPath (bare host directory volume) Path: /etc/kubernetes HostPathType: Directory kube-api-access-4ht28: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master:NoSchedule op=Exists node.cloudprovider.kubernetes.io/uninitialized:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 120s node.kubernetes.io/not-ready:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists for 120s Events: <none> Medium Strength Ciphers (> 64-bit and < 112-bit key, or 3DES) Name Code KEX Auth Encryption MAC ---------------------- ---------- --- ---- --------------------- --- ECDHE-RSA-DES-CBC3-SHA 0xC0, 0x12 ECDH RSA 3DES-CBC(168) SHA1 DES-CBC3-SHA 0x00, 0x0A RSA RSA 3DES-CBC(168) SHA1 The fields above are : {Tenable ciphername} {Cipher ID code} Kex={key exchange} Auth={authentication} Encrypt={symmetric encryption method} MAC={message authentication code} {export flag} [centos@provisioner-trulabs-0-230518-065321 ~]$ curl -v telnet://172.17.0.2:10258 * About to connect() to 172.17.0.2 port 10258 (#0) * Trying 172.17.0.2... * Connected to 172.17.0.2 (172.17.0.2) port 10258 (#0)
Version-Release number of selected component (if applicable):
How reproducible:
The nutanix CCM manager pod running in the OCP cluster does not set the option "--tls-cipher-suites".
Steps to Reproduce:
Create an OCP Nutanix cluster.
Actual results:
Run the below cli returns nothing. $ oc describe pod -n openshift-cloud-controller-manager nutanix-cloud-controller-manager-... | grep "\--tls-cipher-suites"
Expected results:
Expect the nutanix CCM manager deployment set the proper option "--tls-cipher-suites".
Additional info:
Description of problem:
Unable to run oc commands in FIPS enable OCP cluster on PowerVS
Version-Release number of selected component (if applicable):
4.15.0-ec2
How reproducible:
Deploy OCP cluster with FIPS enabled
Steps to Reproduce:
1. Enable the var in var.tfvars - fips_compliant = true 2. Deploy the cluster 3. run oc commands
Actual results:
[root@rdr-swap-fips-syd05-bastion-0 ~]# oc version FIPS mode is enabled, but the required OpenSSL library is not available [root@rdr-swap-fips-syd05-bastion-0 ~]# oc debug node/syd05-master-0.rdr-swap-fips.ibm.com FIPS mode is enabled, but the required OpenSSL library is not available [root@rdr-swap-fips-syd05-bastion-0 ~]# fips-mode-setup --check FIPS mode is enabled.
Expected results:
# oc debug node/syd05-master-0.rdr-swap-fips1.ibm.com Temporary namespace openshift-debug-dns7d is created for debugging node... Starting pod/syd05-master-0rdr-swap-fips1ibmcom-debug-hs4dr ... To use host binaries, run `chroot /host` Pod IP: 193.168.200.9
Additional info:
Not able to collect must gather logs due to the issue links - https://access.redhat.com/solutions/7034387
Starting with this 4.16-ci payload, we see these failures (examples shown below). It happens on aws, azure, and gcp:
4.16.0-0.ci-2024-03-16-025152 Rejected 38 hours ago 03-16T02:51:52Z https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.ci/release/4.16.0-0.ci-2024-03-16-025152
aggregated-aws-ovn-upgrade-4.16-minor Failed
Failed: suite=[openshift-tests], [sig-auth] all workloads in ns/openshift-must-gather-smq72 must set the 'openshift.io/required-scc' annotation
aggregated-azure-sdn-upgrade-4.16-minor Failed
Failed: suite=[openshift-tests], [sig-auth] all workloads in ns/openshift-must-gather-494qg must set the 'openshift.io/required-scc' annotation
This looks like the culprit: https://github.com/openshift/origin/pull/28589 ; revert = https://github.com/openshift/origin/pull/28659.
Description of problem:
During the destroy cluster operation, unexpected results from the IBM Cloud API calls for Disks can result in panics when response data (or responses) are missing, resulting in unexpected failures during destroy.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Unknown, dependent on IBM Cloud API responses
Steps to Reproduce:
1. Successfully create IPI cluster on IBM Cloud 2. Attempt to cleanup (destroy) the cluster
Actual results:
Golang panic attempting to parse a HTTP response that is missing or lacking data. level=info msg=Deleted instance "ci-op-97fkzvv2-e6ed7-5n5zg-master-0" E0918 18:03:44.787843 33 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 228 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x6a3d760?, 0x274b5790}) /go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xfffffffe?}) /go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75 panic({0x6a3d760, 0x274b5790}) /usr/lib/golang/src/runtime/panic.go:884 +0x213 github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).waitForDiskDeletion.func1() /go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/disk.go:84 +0x12a github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).Retry(0xc000791ce0, 0xc000573700) /go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/ibmcloud.go:99 +0x73 github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).waitForDiskDeletion(0xc000791ce0, {{0xc00160c060, 0x29}, {0xc00160c090, 0x28}, {0xc0016141f4, 0x9}, {0x82b9f0d, 0x4}, {0xc00160c060, ...}}) /go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/disk.go:78 +0x14f github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).destroyDisks(0xc000791ce0) /go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/disk.go:118 +0x485 github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).executeStageFunction.func1() /go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/ibmcloud.go:201 +0x3f k8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1({0x7f7801e503c8, 0x18}) /go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:109 +0x1b k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext({0x227a2f78?, 0xc00013c000?}, 0xc000a9b690?) /go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:154 +0x57 k8s.io/apimachinery/pkg/util/wait.poll({0x227a2f78, 0xc00013c000}, 0xd0?, 0x146fea5?, 0x7f7801e503c8?) /go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:245 +0x38 k8s.io/apimachinery/pkg/util/wait.PollImmediateInfiniteWithContext({0x227a2f78, 0xc00013c000}, 0x4136e7?, 0x28?) /go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:229 +0x49 k8s.io/apimachinery/pkg/util/wait.PollImmediateInfinite(0x100000000000000?, 0x806f00?) /go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:214 +0x46 github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).executeStageFunction(0xc000791ce0, {{0x82bb9a3?, 0xc000a9b7d0?}, 0xc000111de0?}, 0x840366?, 0xc00054e900?) /go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/ibmcloud.go:198 +0x108 created by github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).destroyCluster /go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/ibmcloud.go:172 +0xa87 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference
Expected results:
Destroy IBM Cloud Disks during cluster destroy, or provide a useful error message to follow up on.
Additional info:
The ability to reproduce is relatively low, as it requires the IBM Cloud API's to return specific data (or lack there of), which is currently unknown why the HTTP respoonse and/or data is missing. IBM Cloud already has a PR to attempt to mitigate this issue, like done with other destroy resource calls. Potentially followup for additional resources as necessary. https://github.com/openshift/installer/pull/7515
This is a clone of issue OCPBUGS-35037. The following is the description of the original issue:
—
Description of problem:
Contrary to terraform, we do not delete the S3 bucket used for ignition during bootstrapping.
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
always
Steps to Reproduce:
1. Deploy cluster 2. Check that openshift-bootstrap-data-$infraID bucket exists and is empty. 3.
Actual results:
Empty bucket left.
Expected results:
Bucket is deleted.
Additional info:
This is a clone of issue OCPBUGS-42386. The following is the description of the original issue:
—
Description of problem:
usually providing a cluster with unaccepted update, such as unsigned payload without force, is treated with releaseaccepted=false progressing=false. however by scaling cvo deployment down and up again, progressing=true is observed, causing oc adm upgrade as well as oc adm upgrade status to display incorrect information, and clusterversion object to display empty capabilities and history item with version ""
Version-Release number of selected component (if applicable):
4.16.0-rc.4 but observed as well as early as 4.10.67
How reproducible:
100%
Steps to Reproduce:
1. target the cluster at unsigned build without using force ❯ oc adm upgrade --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:36cfa8cebb86ded6e1d51c308d31eb7b2c2e7705a0df6f698c690b6fba8b7e7a 2. scale cvo down and up again ❯ oc scale --replicas 0 -n openshift-cluster-version deployments/cluster-version-operator deployment.apps/cluster-version-operator scaled ❯ oc scale --replicas 1 -n openshift-cluster-version deployments/cluster-version-operator deployment.apps/cluster-version-operator scaled
Actual results:
oc adm update displays "info: An upgrade is in progress. Working towards..."
also a warning of "Architecture has not been configured"
❯ oc adm upgrade info: An upgrade is in progress. Working towards registry.ci.openshift.org/ocp/release@sha256:36cfa8cebb86ded6e1d51c308d31eb7b2c2e7705a0df6f698c690b6fba8b7e7a ReleaseAccepted=False Reason: RetrievePayload Message: Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:36cfa8cebb86ded6e1d51c308d31eb7b2c2e7705a0df6f698c690b6fba8b7e7a" failure=The update cannot be verified: unable to verify sha256:36cfa8cebb86ded6e1d51c308d31eb7b2c2e7705a0df6f698c690b6fba8b7e7a against keyrings: verifier-public-key-redhat Upstream is unset, so the cluster will use an appropriate default. Channel: stable-4.16 warning: Cannot display available updates: Reason: NoArchitecture Message: Architecture has not been configured.
clusterversion object have Progressing True, "capabilities: {}" as well as a partial history item with version ""
❯ oc get clusterversion version -oyaml apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2024-06-10T11:36:51Z" generation: 3 name: version resourceVersion: "70199" uid: 9c80848b-9f3a-4f0d-8472-a2ccce1c4023 spec: channel: stable-4.16 clusterID: e74054ac-e0fe-4cf7-a457-4887ba96cff9 desiredUpdate: architecture: "" force: false image: registry.ci.openshift.org/ocp/release@sha256:36cfa8cebb86ded6e1d51c308d31eb7b2c2e7705a0df6f698c690b6fba8b7e7a version: "" status: availableUpdates: null capabilities: {} conditions: - lastTransitionTime: "2024-06-10T11:37:17Z" message: Architecture has not been configured. reason: NoArchitecture status: "False" type: RetrievedUpdates - lastTransitionTime: "2024-06-10T11:37:17Z" message: Capabilities match configured spec reason: AsExpected status: "False" type: ImplicitlyEnabledCapabilities - lastTransitionTime: "2024-06-10T14:06:42Z" message: 'Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:36cfa8cebb86ded6e1d51c308d31eb7b2c2e7705a0df6f698c690b6fba8b7e7a" failure=The update cannot be verified: unable to verify sha256:36cfa8cebb86ded6e1d51c308d31eb7b2c2e7705a0df6f698c690b6fba8b7e7a against keyrings: verifier-public-key-redhat' reason: RetrievePayload status: "False" type: ReleaseAccepted - lastTransitionTime: "2024-06-10T12:06:31Z" message: Done applying 4.16.0-rc.4 status: "True" type: Available - lastTransitionTime: "2024-06-10T12:06:31Z" status: "False" type: Failing - lastTransitionTime: "2024-06-10T14:07:30Z" message: Working towards registry.ci.openshift.org/ocp/release@sha256:36cfa8cebb86ded6e1d51c308d31eb7b2c2e7705a0df6f698c690b6fba8b7e7a status: "True" type: Progressing desired: image: registry.ci.openshift.org/ocp/release@sha256:36cfa8cebb86ded6e1d51c308d31eb7b2c2e7705a0df6f698c690b6fba8b7e7a version: "" history: - completionTime: null image: registry.ci.openshift.org/ocp/release@sha256:36cfa8cebb86ded6e1d51c308d31eb7b2c2e7705a0df6f698c690b6fba8b7e7a startedTime: "2024-06-10T14:07:30Z" state: Partial verified: false version: "" - completionTime: "2024-06-10T12:06:31Z" image: quay.io/openshift-release-dev/ocp-release@sha256:6c236c400d3bad9b2b54d8a3b247c508f6f13511d37666de1eecca8e43bce0f6 startedTime: "2024-06-10T11:37:17Z" state: Completed verified: false version: 4.16.0-rc.4 observedGeneration: 3 versionHash: AjnKTa_3kbg=
in upgrade status, Progressing to an empty target with Completion 0%
= Control Plane = Assessment: Progressing Target Version: (from 4.16.0-rc.4) Completion: 0% Duration: 2m26.971091165s Operator Status: 33 Healthy
Expected results:
clusterversion stays the same as before scale toggle
apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2024-06-10T11:36:51Z" generation: 3 name: version resourceVersion: "69881" uid: 9c80848b-9f3a-4f0d-8472-a2ccce1c4023 spec: channel: stable-4.16 clusterID: e74054ac-e0fe-4cf7-a457-4887ba96cff9 desiredUpdate: architecture: "" force: false image: registry.ci.openshift.org/ocp/release@sha256:36cfa8cebb86ded6e1d51c308d31eb7b2c2e7705a0df6f698c690b6fba8b7e7a version: "" status: availableUpdates: null capabilities: enabledCapabilities: - Build - CSISnapshot - CloudControllerManager - CloudCredential - Console - DeploymentConfig - ImageRegistry - Ingress - Insights - MachineAPI - NodeTuning - OperatorLifecycleManager - Storage - baremetal - marketplace - openshift-samples knownCapabilities: - Build - CSISnapshot - CloudControllerManager - CloudCredential - Console - DeploymentConfig - ImageRegistry - Ingress - Insights - MachineAPI - NodeTuning - OperatorLifecycleManager - Storage - baremetal - marketplace - openshift-samples conditions: - lastTransitionTime: "2024-06-10T11:37:17Z" message: 'Unable to retrieve available updates: currently reconciling cluster version 4.16.0-rc.4 not found in the "stable-4.16" channel' reason: VersionNotFound status: "False" type: RetrievedUpdates - lastTransitionTime: "2024-06-10T11:37:17Z" message: Capabilities match configured spec reason: AsExpected status: "False" type: ImplicitlyEnabledCapabilities - lastTransitionTime: "2024-06-10T14:06:42Z" message: 'Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:36cfa8cebb86ded6e1d51c308d31eb7b2c2e7705a0df6f698c690b6fba8b7e7a" failure=The update cannot be verified: unable to verify sha256:36cfa8cebb86ded6e1d51c308d31eb7b2c2e7705a0df6f698c690b6fba8b7e7a against keyrings: verifier-public-key-redhat' reason: RetrievePayload status: "False" type: ReleaseAccepted - lastTransitionTime: "2024-06-10T12:06:31Z" message: Done applying 4.16.0-rc.4 status: "True" type: Available - lastTransitionTime: "2024-06-10T12:06:31Z" status: "False" type: Failing - lastTransitionTime: "2024-06-10T12:06:31Z" message: Cluster version is 4.16.0-rc.4 status: "False" type: Progressing desired: image: quay.io/openshift-release-dev/ocp-release@sha256:6c236c400d3bad9b2b54d8a3b247c508f6f13511d37666de1eecca8e43bce0f6 url: https://access.redhat.com/errata/RHEA-2024:0041 version: 4.16.0-rc.4 history: - completionTime: "2024-06-10T12:06:31Z" image: quay.io/openshift-release-dev/ocp-release@sha256:6c236c400d3bad9b2b54d8a3b247c508f6f13511d37666de1eecca8e43bce0f6 startedTime: "2024-06-10T11:37:17Z" state: Completed verified: false version: 4.16.0-rc.4 observedGeneration: 2 versionHash: AjnKTa_3kbg=
no upgrade is in progress message for release that is not accepted
❯ oc adm upgrade Cluster version is 4.16.0-rc.4 ReleaseAccepted=False Reason: RetrievePayload Message: Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:36cfa8cebb86ded6e1d51c308d31eb7b2c2e7705a0df6f698c690b6fba8b7e7a" failure=The update cannot be verified: unable to verify sha256:36cfa8cebb86ded6e1d51c308d31eb7b2c2e7705a0df6f698c690b6fba8b7e7a against keyrings: verifier-public-key-redhat Upstream is unset, so the cluster will use an appropriate default. Channel: stable-4.16 warning: Cannot display available updates: Reason: VersionNotFound Message: Unable to retrieve available updates: currently reconciling cluster version 4.16.0-rc.4 not found in the "stable-4.16" channel
Additional info:
it is possible to kick the cluster out of this state, by applying --clear, which causing the cluster to breefly progress into its original version, followed by 3 items appearing in history
❯ oc adm upgrade --clear Cleared the update field, still at registry.ci.openshift.org/ocp/release@sha256:36cfa8cebb86ded6e1d51c308d31eb7b2c2e7705a0df6f698c690b6fba8b7e7a ❯ oc adm upgrade info: An upgrade is in progress. Working towards 4.16.0-rc.4: 116 of 894 done (12% complete) Upstream is unset, so the cluster will use an appropriate default. Channel: stable-4.16 warning: Cannot display available updates: Reason: VersionNotFound Message: Unable to retrieve available updates: currently reconciling cluster version 4.16.0-rc.4 not found in the "stable-4.16" channel
❯ oc get clusterversion version -oyaml apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2024-06-10T11:36:51Z" generation: 4 name: version resourceVersion: "72594" uid: 9c80848b-9f3a-4f0d-8472-a2ccce1c4023 spec: channel: stable-4.16 clusterID: e74054ac-e0fe-4cf7-a457-4887ba96cff9 status: availableUpdates: null capabilities: enabledCapabilities: - Build - CSISnapshot - CloudControllerManager - CloudCredential - Console - DeploymentConfig - ImageRegistry - Ingress - Insights - MachineAPI - NodeTuning - OperatorLifecycleManager - Storage - baremetal - marketplace - openshift-samples knownCapabilities: - Build - CSISnapshot - CloudControllerManager - CloudCredential - Console - DeploymentConfig - ImageRegistry - Ingress - Insights - MachineAPI - NodeTuning - OperatorLifecycleManager - Storage - baremetal - marketplace - openshift-samples conditions: - lastTransitionTime: "2024-06-10T11:37:17Z" message: 'Unable to retrieve available updates: currently reconciling cluster version 4.16.0-rc.4 not found in the "stable-4.16" channel' reason: VersionNotFound status: "False" type: RetrievedUpdates - lastTransitionTime: "2024-06-10T11:37:17Z" message: Capabilities match configured spec reason: AsExpected status: "False" type: ImplicitlyEnabledCapabilities - lastTransitionTime: "2024-06-10T14:13:07Z" message: Payload loaded version="4.16.0-rc.4" image="quay.io/openshift-release-dev/ocp-release@sha256:6c236c400d3bad9b2b54d8a3b247c508f6f13511d37666de1eecca8e43bce0f6" architecture="amd64" reason: PayloadLoaded status: "True" type: ReleaseAccepted - lastTransitionTime: "2024-06-10T12:06:31Z" message: Done applying 4.16.0-rc.4 status: "True" type: Available - lastTransitionTime: "2024-06-10T12:06:31Z" status: "False" type: Failing - lastTransitionTime: "2024-06-10T14:14:00Z" message: Cluster version is 4.16.0-rc.4 status: "False" type: Progressing desired: image: quay.io/openshift-release-dev/ocp-release@sha256:6c236c400d3bad9b2b54d8a3b247c508f6f13511d37666de1eecca8e43bce0f6 url: https://access.redhat.com/errata/RHEA-2024:0041 version: 4.16.0-rc.4 history: - completionTime: "2024-06-10T14:14:00Z" image: quay.io/openshift-release-dev/ocp-release@sha256:6c236c400d3bad9b2b54d8a3b247c508f6f13511d37666de1eecca8e43bce0f6 startedTime: "2024-06-10T14:13:07Z" state: Completed verified: false version: 4.16.0-rc.4 - completionTime: "2024-06-10T14:13:07Z" image: registry.ci.openshift.org/ocp/release@sha256:36cfa8cebb86ded6e1d51c308d31eb7b2c2e7705a0df6f698c690b6fba8b7e7a startedTime: "2024-06-10T14:07:30Z" state: Partial verified: false version: "" - completionTime: "2024-06-10T12:06:31Z" image: quay.io/openshift-release-dev/ocp-release@sha256:6c236c400d3bad9b2b54d8a3b247c508f6f13511d37666de1eecca8e43bce0f6 startedTime: "2024-06-10T11:37:17Z" state: Completed verified: false version: 4.16.0-rc.4 observedGeneration: 4 versionHash: AjnKTa_3kbg=
also trying to apply a rollback at this state, resulting in invalid SemVer error
❯ OC_ENABLE_CMD_UPGRADE_ROLLBACK=true oc adm upgrade rollback error: previous version "" invalid SemVer: Version string empty
This is a clone of issue OCPBUGS-33644. The following is the description of the original issue:
—
Description of problem:
After running tests on an SNO with Telco DU profile for a couple of hours kubernetes.io/kubelet-serving CSRs in Pending state start showing up and accumulating in time.
Version-Release number of selected component (if applicable):
4.16.0-rc.1
How reproducible:
once so far
Steps to Reproduce:
1. Deploy SNO with DU profile with disabled capabilities: installConfigOverrides: "{\"capabilities\":{\"baselineCapabilitySet\": \"None\", \"additionalEnabledCapabilities\": [ \"NodeTuning\", \"ImageRegistry\", \"OperatorLifecycleManager\" ] }}" 2. Leave the node running tests overnight for a couple of hours 3. Check for Pending CSRs
Actual results:
oc get csr -A | grep Pending | wc -l 27
Expected results:
No pending CSRs Also oc logs will return a tls internal error: oc -n openshift-cluster-machine-approver --insecure-skip-tls-verify-backend=true logs machine-approver-866c94c694-7dwks Defaulted container "kube-rbac-proxy" out of: kube-rbac-proxy, machine-approver-controller Error from server: Get "https://[2620:52:0:8e6::d0]:10250/containerLogs/openshift-cluster-machine-approver/machine-approver-866c94c694-7dwks/kube-rbac-proxy": remote error: tls: internal error
Additional info:
Checking the machine-approver-controller container logs on the node we can see the reconciliation is failing be cause it cannot find the Machine API which is disabled from the capabilities. I0514 13:25:09.266546 1 controller.go:120] Reconciling CSR: csr-dw9c8 E0514 13:25:09.275585 1 controller.go:138] csr-dw9c8: Failed to list machines in API group machine.openshift.io/v1beta1: no matches for kind "Machine" in version "machine.openshift.io/v1beta1" E0514 13:25:09.275665 1 controller.go:329] "Reconciler error" err="Failed to list machines: no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\"" controller="certificatesigningrequest" controllerGroup="certificates.k8s.io" controllerKind="CertificateSigningRequest" CertificateSigningRequest="csr-dw9c8" namespace="" name="csr-dw9c8" reconcileID="6f963337-c6f1-46e7-80c4-90494d21653c" I0514 13:25:43.792140 1 controller.go:120] Reconciling CSR: csr-jvrvt E0514 13:25:43.798079 1 controller.go:138] csr-jvrvt: Failed to list machines in API group machine.openshift.io/v1beta1: no matches for kind "Machine" in version "machine.openshift.io/v1beta1" E0514 13:25:43.798128 1 controller.go:329] "Reconciler error" err="Failed to list machines: no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\"" controller="certificatesigningrequest" controllerGroup="certificates.k8s.io" controllerKind="CertificateSigningRequest" CertificateSigningRequest="csr-jvrvt" namespace="" name="csr-jvrvt" reconcileID="decbc5d9-fa10-45d1-92f1-1c999df956ff"
This is a clone of issue OCPBUGS-34077. The following is the description of the original issue:
—
In OCP 4.16.0, the default role bindings for image puller, image pusher, and deployer are created, even if the respective capabilities are disabled on the cluster.
This is a clone of issue OCPBUGS-37713. The following is the description of the original issue:
—
Under heavy load(?) crictl can fail and return errors which iptables-alerter does not handle correctly, and as a result, it may accidentally end up checking for iptables rules in hostNetwork pods, and then logging events about it.
In a CI run of etcd-operator-e2e I've found the following panic in the operator logs:
E0125 11:04:58.158222 1 health.go:135] health check for member (ip-10-0-85-12.us-west-2.compute.internal) failed: err(context deadline exceeded) panic: send on closed channel goroutine 15608 [running]: github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth.func1() github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:58 +0xd2 created by github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:54 +0x2a5
which unfortunately is an incomplete log file. The operator recovered itself by restarting, we should fix the panic nonetheless.
Job run for reference:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1186/pull-ci-openshift-cluster-etcd-operator-master-e2e-operator/1750466468031500288
Description of problem:
- Observed that after upgrade to 4.13.30 (from 4.13.24) On all nodes/projects (replicated on two clusters that underwent the same upgrade) - traffic routed from HostNetworked pods (router-default) calling to backends intermittently timeout/fail to reach their destination.
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-from-openshift-ingress namespace: testing spec: ingress: - from: - namespaceSelector: matchLabels: policy-group.network.openshift.io/ingress: "" podSelector: {} policyTypes: - Ingress
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Upgrade cluster to 4.13.30
2. Apply test pod running basic HTTP instance at random port
3. Apply networkpolicy to allow-from-ingress and begin curl loop against target pod directly from ingressnode (or other worker node) at host chroot level (nodeIP).
4. Observe that curls time out intermittently --> replicator curl loop is below (note inclusion of --connect-timeout flag to help allow loop to continue more rapidly without waiting for full 2m connect timeout on typical syn failure).
$ while true; do curl --connect-timeout 5 --noproxy '*' -k -w "dnslookup: %{time_namelookup} | connect: %{time_connect} | appconnect: %{time_appconnect} | pretransfer: %{time_pretransfer} | starttransfer: %{time_starttransfer} | total: %{time_total} | size: %{size_download} | response: %{response_code}\n" -o /dev/null -s https://<POD>:<PORT>; done
Actual results:
- Traffic to all backends is dropped/degraded as a result of this intermittent failure marking valid/healthy pods as unavailable due to the connection failure to the backends.
Expected results:
- traffic should not be iimpeded, especially when the application of the networkpolicy to allow said traffic is implemented.
Additional info:
– additional required template details in first comment below.
RCA UPDATE:
So the problem is that host-network namespace is not labeled by ingress controller and if router pods are hostNetworked, network policy with `policy-group.network.openshift.io/ingress: ""` selector won't allow incoming connections. To reproduce, we need to run ingress controller with `EndpointPublishingStrategy=HostNetwork` https://docs.openshift.com/container-platform/4.14/networking/nw-ingress-controller-endpoint-publishing-strategies.html and then check host-network namespace labels with
oc get ns openshift-host-network --show-labels
# expected this
kubernetes.io/metadata.name=openshift-host-network,network.openshift.io/policy-group=ingress,policy-group.network.openshift.io/host-network=,policy-group.network.openshift.io/ingress=
# but before the fix you will see
kubernetes.io/metadata.name=openshift-host-network,policy-group.network.openshift.io/host-network=
Another way to verify this is the same problem (disruptive, only recommended for test environments) is to make CNO unmanaged
oc scale deployment cluster-version-operator -n openshift-cluster-version --replicas=0 oc scale deployment network-operator -n openshift-network-operator --replicas=0
and then label openshift-host-network namespace manually based on expected labels ^ and see if the problem disappears
Potentially affected versions (may need to reproduce to confirm)
4.16.0, 4.15.0, 4.14.0 since https://issues.redhat.com//browse/OCPBUGS-8070
4.13.30 https://issues.redhat.com/browse/OCPBUGS-22293
4.12.48 https://issues.redhat.com/browse/OCPBUGS-24039
Mitigation/support KCS:
https://access.redhat.com/solutions/7055050
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Try to deploy in mad02 or mad04 with powervs 2. Cannot import boot image 3. fail
Actual results:
Fail
Expected results:
Cluster comes up
Additional info:
If the user specifies baselineCapabilitySet: None in the install-config and does not specifically enable the capability baremetal, yet still uses platform: baremetal then the install will reliably fail.
This failure takes the form of a timeout with the bootkube logs (not easily accessible to the user) full of errors like:
bootkube.sh[46065]: "99_baremetal-provisioning-config.yaml": unable to get REST mapping for "99_baremetal-provisioning-config.yaml": no matches for kind "Provisioning" in version "metal3.io/v1alpha1" bootkube.sh[46065]: "99_openshift-cluster-api_hosts-0.yaml": unable to get REST mapping for "99_openshift-cluster-api_hosts-0.yaml": no matches for kind "BareMetalHost" in version "metal3.io/v1alpha1"
Since the installer can tell when processing the install-config if the baremetal capability is missing, we should detect this and error out immediately to save the user an hour of their life and us a support case.
Although this was found on an agent install, I believe the same will apply to a baremetal IPI install.
Description of problem:
Facing error while creating manifests: ./openshift-install create manifests --dir openshift-config FATAL failed to fetch Master Machines: failed to generate asset "Master Machines": failed to create master machine objects: failed to create provider: unexpected end of JSON input Using below document : https://docs.openshift.com/container-platform/4.14/installing/installing_gcp/installing-gcp-vpc.html#installation-gcp-config-yaml_installing-gcp-vpc
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-34200. The following is the description of the original issue:
—
Description of problem:
The value box in the ConfigMap Form view is no longer resizable. It is resizable as expected in OCP version 4.14.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
OCP Console -> Administrator -> Workloads -> ConfigMaps -> Create ConfigMap -> Form view -> value
Actual results:
Value window box should be resizable.
Expected results:
It is not resizable anymore in 4.15 OpenShift Clusters.
Additional info:
Please review the following PR: https://github.com/openshift/machine-api-provider-powervs/pull/67
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This fix contains the following changes coming from updated version of kubernetes up to v1.29.6:
Changelog:
v1.29.6: https://github.com/kubernetes/kubernetes/blob/release-1.29/CHANGELOG/CHANGELOG-1.29.md#changelog-since-v1295
Description of problem:
A high number of kubeproxy rules is observed on OCP cluster installed with version 4.15.19 and OpenshiftSDN network plugin. The quantity of redundant rules increases continuosly and it seems more related to the rules from openshift-ingress namespace. In the following example, it is present in a cluster node 157k redundant rules related to KUBE-MARK-MASQ rule:
$ less iptables-nat_rules.txt | grep openshift-ingress/router-nodeport-<svc-name> | wc -l
157761 <-----
0 0 KUBE-MARK-MASQ all -- !tun0 * 0.0.0.0/0 0.0.0.0/0 /* masquerade traffic for openshift-ingress/router-nodeport-<svc-name>:http external destinations */
Nodeport services seems to be more affected by the issue.
After a the node reboot, the quantity of rules drop, however some hours later, the issue reoccurs.
Version-Release number of selected component (if applicable): OCP 4.15.19
How reproducible: Not easily
Actual results: Affected nodes are firing alerts NodeProxyApplySlow and ClusterProxyApplySlow
Expected results: Cluster shouldn't create this high quantity of redundant rules
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms: RHOCP
Description of problem:
ovnkube-master-b5dwz 5/6 CrashLoopBackOff 15 (4m49s ago) 75m ovnkube-master-dm6g5 5/6 CrashLoopBackOff 15 (3m50s ago) 72m ovnkube-master-lzltc 5/6 CrashLoopBackOff 16 (31s ago) 76m
Relevant logs :
1 ovnkube.go:369] failed to start network controller manager: failed to start default network controller: failed to sync address sets on controller init: failed to transact address set sync ops: error in transact with ops [{Op:insert Table:Address_Set Row:map[addresses:{GoSet:[172.21.4.58 172.30.113.119 172.30.113.93 172.30.140.204 172.30.184.23 172.30.20.1 172.30.244.26 172.30.250.254 172.30.29.56 172.30.39.131 172.30.54.87 172.30.54.93 172.30.70.9]} external_ids:{GoMap:map[direction:ingress gress-index:0 ip-family:v4 ...]} log:false match:ip4.src == {$a10011776377603330168, $a10015887742824209439, $a10026019104056290237, $a10029515256826812638, $a5952808452902781817, $a10084011578527782670, $a10086197949337628055, $a10093706521660045086, $a10096260576467608457, $a13012332091214445736, $a10111277808835218114, $a10114713358929465663, $a101155018460287381, $a16191032114896727480, $a14025182946114952022, $a10127722282178953052, $a4829957937622968220, $a10131833063630260035, $a3533891684095375041, $a7785003721317615588, $a10594480726457361847, $a10147006001458235329, $a12372228123457253136, $a10016996505620670018, $a10155660392008449200, $a10155926828030234078, $a15442683337083171453, $a9765064908646909484, $a7550609288882429832, $a11548830526886645428, $a10204075722023637394, $a10211228835433076965, $a5867828639604451547, $a10222049254704513272, $a13856077787103972722, $a11903549070727627659,.... (this is a very long list of ACL)
This is a clone of issue OCPBUGS-32304. The following is the description of the original issue:
—
Description of problem:
There is one pod of metal3 operator in constant failure state. The cluster was acting as Hub cluster with ACM + GitOps for SNO installation. It was working well for a few days, until this moment when no other sites could be deployed. oc get pods -A | grep metal3 openshift-machine-api metal3-64cf86fb8b-fg5b9 3/4 CrashLoopBackOff 35 (108s ago) 155m openshift-machine-api metal3-baremetal-operator-84875f859d-6kj9s 1/1 Running 0 155m openshift-machine-api metal3-image-customization-57f8d4fcd4-996hd 1/1 Running 0 5h
Version-Release number of selected component (if applicable):
OCP version: 4.16.ec5
How reproducible:
Once it starts to fail, it does not recover.
Steps to Reproduce:
1. Unclear. Install Hub cluster with ACM+GitOps 2. (Perhaps: Update AgentServiceConfig
Actual results:
Pod crashing and installation of spoke cluster fails
Expected results:
Pod running and installation of spoke cluster succeds.
Additional info:
Logs of metal3-ironic-inspector: `[kni@infra608-1 ~]$ oc logs pods/metal3-64cf86fb8b-fg5b9 -c metal3-ironic-inspector + CONFIG=/etc/ironic-inspector/ironic-inspector.conf + export IRONIC_INSPECTOR_ENABLE_DISCOVERY=false + IRONIC_INSPECTOR_ENABLE_DISCOVERY=false + export INSPECTOR_REVERSE_PROXY_SETUP=true + INSPECTOR_REVERSE_PROXY_SETUP=true + . /bin/tls-common.sh ++ export IRONIC_CERT_FILE=/certs/ironic/tls.crt ++ IRONIC_CERT_FILE=/certs/ironic/tls.crt ++ export IRONIC_KEY_FILE=/certs/ironic/tls.key ++ IRONIC_KEY_FILE=/certs/ironic/tls.key ++ export IRONIC_CACERT_FILE=/certs/ca/ironic/tls.crt ++ IRONIC_CACERT_FILE=/certs/ca/ironic/tls.crt ++ export IRONIC_INSECURE=true ++ IRONIC_INSECURE=true ++ export 'IRONIC_SSL_PROTOCOL=-ALL +TLSv1.2 +TLSv1.3' ++ IRONIC_SSL_PROTOCOL='-ALL +TLSv1.2 +TLSv1.3' ++ export 'IPXE_SSL_PROTOCOL=-ALL +TLSv1.2 +TLSv1.3' ++ IPXE_SSL_PROTOCOL='-ALL +TLSv1.2 +TLSv1.3' ++ export IRONIC_VMEDIA_SSL_PROTOCOL=ALL ++ IRONIC_VMEDIA_SSL_PROTOCOL=ALL ++ export IRONIC_INSPECTOR_CERT_FILE=/certs/ironic-inspector/tls.crt ++ IRONIC_INSPECTOR_CERT_FILE=/certs/ironic-inspector/tls.crt ++ export IRONIC_INSPECTOR_KEY_FILE=/certs/ironic-inspector/tls.key ++ IRONIC_INSPECTOR_KEY_FILE=/certs/ironic-inspector/tls.key ++ export IRONIC_INSPECTOR_CACERT_FILE=/certs/ca/ironic-inspector/tls.crt ++ IRONIC_INSPECTOR_CACERT_FILE=/certs/ca/ironic-inspector/tls.crt ++ export IRONIC_INSPECTOR_INSECURE=true ++ IRONIC_INSPECTOR_INSECURE=true ++ export IRONIC_VMEDIA_CERT_FILE=/certs/vmedia/tls.crt ++ IRONIC_VMEDIA_CERT_FILE=/certs/vmedia/tls.crt ++ export IRONIC_VMEDIA_KEY_FILE=/certs/vmedia/tls.key ++ IRONIC_VMEDIA_KEY_FILE=/certs/vmedia/tls.key ++ export IPXE_CERT_FILE=/certs/ipxe/tls.crt ++ IPXE_CERT_FILE=/certs/ipxe/tls.crt ++ export IPXE_KEY_FILE=/certs/ipxe/tls.key ++ IPXE_KEY_FILE=/certs/ipxe/tls.key ++ export RESTART_CONTAINER_CERTIFICATE_UPDATED=false ++ RESTART_CONTAINER_CERTIFICATE_UPDATED=false ++ export MARIADB_CACERT_FILE=/certs/ca/mariadb/tls.crt ++ MARIADB_CACERT_FILE=/certs/ca/mariadb/tls.crt ++ export IPXE_TLS_PORT=8084 ++ IPXE_TLS_PORT=8084 ++ mkdir -p /certs/ironic ++ mkdir -p /certs/ironic-inspector ++ mkdir -p /certs/ca/ironic mkdir: cannot create directory '/certs/ca/ironic': Permission denied
If the user relies on mirror registries, and clusterimageset is set to a tagged image (e.g. quay.io/openshift-release-dev/ocp-release:4.15.0-multi), as opposed to a by digest image (e.g. quay.io/openshift-release-dev/ocp-release@sha256:b86422e972b9c838dfdb8b481a67ae08308437d6489ea6aaf150242b1d30fa1c), then `oc` will fail to pull with:
--icsp-file only applies to images referenced by digest and will be ignored for tags
Instead we should probably block it at the reconcile stage, or give the user clearer CR errors so they don't have to dig in the assisted service logs to figure out what went wrong
The oc error is actually much more confusing - oc ignores the icsp, tries to pull from quay, and runs into issues because mirror registries are trypically used in disconnected environments where quay is unreachable / has a different certificate - so there's a lot of red herrings the user will chase until they realize they should have used digest
Description of problem:
When using a custom CNI plugin in a hostedcluster, multus requires some CSRs to be approved. The component approving these CSRs is the network-node-identity. This component only gets the proper RBAC rules configured when networkType is set to Calico. In the current implementation, there is an condition that will apply the required RBAC if the networkType is set to Calico[1]. When using other CNI plugins, like Cilium, you're supposed to set networkType to Other. With current implementation, you won't get the required RBAC in place and as such, the required CSRs won't be approved automatically. [1] https://github.com/openshift/hypershift/blob/release-4.14/control-plane-operator/controllers/hostedcontrolplane/cno/clusternetworkoperator.go#L139
Version-Release number of selected component (if applicable):
Latest
How reproducible:
Always
Steps to Reproduce:
1. Set hostedcluster.spec.networking.networkType to Other 2. Wait for the HC to start deploying and for the Nodes to join the cluster 3. The nodes will remain in NotReady. Multus pods will complaing about certificates not being ready. 4. If you list CSRs you will find pending CSRs.
Actual results:
RBAC not properly configured when networkType set to Other
Expected results:
RBAC properly configured when networkType set to Other
Additional info:
Slack discussion: https://redhat-internal.slack.com/archives/C01C8502FMM/p1704824277049609
Please review the following PR: https://github.com/openshift/hypershift/pull/3303
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
oc supports parsing multiple IDMS in single file. this is a prerequisite for feature OCPNODE-2281
Fallout of https://issues.redhat.com/browse/OCPBUGS-35371
We simply do not have enough visibility into why these kubelet endpoints are going down, outside of a reboot, while kubelet itself stays up.
A big step would be charting them with the intervals. Add a new monitor test to query prometheus at the end of the run looking for when these targets were down.
Prom query:
max by (node, metrics_path) (up{job="kubelet"}) == 0
Then perhaps a test to flake if we see this happen outside of a node reboot. This seems to happen on every gcp-ovn (non-upgrade) job I look at. It does NOT seem to happen on AWS.
Description of the problem:
When trying to create cluster with s390x architecture, an error occurs that stops cluster creation. The error is "cannot use Skip MCO reboot because it's not compatible with the s390x architecture on version 4.15.0-ec.3 of OpenShift"
How reproducible:
Always
Steps to reproduce:
Create cluster with architecture s390x
Actual results:
Create failed
Expected results:
Create should succeed
Description of problem:
Various jobs are failing in e2e-gcp-operator due to the LoadBalancer-Type Service not going "ready", which means it most likely not getting an IP address. Tests so far affected are: - TestUnmanagedDNSToManagedDNSInternalIngressController - TestScopeChange - TestInternalLoadBalancerGlobalAccessGCP - TestInternalLoadBalancer - TestAllowedSourceRanges For example, in TestInternalLoadBalancer, the Load Balancer never comes back ready: operator_test.go:1454: Expected conditions: map[Admitted:True Available:True DNSManaged:True DNSReady:True LoadBalancerManaged:True LoadBalancerReady:True] Current conditions: map[Admitted:True Available:False DNSManaged:True DNSReady:False Degraded:True DeploymentAvailable:True DeploymentReplicasAllAvailable:True DeploymentReplicasMinAvailable:True DeploymentRollingOut:False EvaluationConditionsDetected:False LoadBalancerManaged:True LoadBalancerProgressing:False LoadBalancerReady:False Progressing:False Upgradeable:True] Where DNSReady:False and LoadBalancerReady:False.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
10% of the time
Steps to Reproduce:
1. Run e2e-gcp-operator many times until you see one of these failures
Actual results:
Test Failure
Expected results:
Not failure
Additional info:
Search.CI Links:
TestScopeChange
TestInternalLoadBalancerGlobalAccessGCP & TestInternalLoadBalancer
This does not seem related to https://issues.redhat.com/browse/OCPBUGS-6013. The DNS E2E tests actually pass this same condition check.
Please review the following PR: https://github.com/openshift/route-controller-manager/pull/36
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The command does not honor Windows path separators.
Related to https://issues.redhat.com//browse/OCPBUGS-28864 (access restricted and not publicly visible). This report serves as a target issue for the fix and its backport to older OCP versions. Please see more details in https://issues.redhat.com//browse/OCPBUGS-28864.
This is a clone of issue OCPBUGS-39029. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-38289. The following is the description of the original issue:
—
Description of problem:
The cluster-wide proxy is getting injected for remote-write config automatically but not the noProxy URLs in Prometheus k8s CR which is available in openshift-monitoring project which is expected. However, if the remote-write endpoint is in noProxy region, then metrics are not transferred.
Version-Release number of selected component (if applicable):
RHOCP 4.16.4
How reproducible:
100%
Steps to Reproduce:
1. Configure proxy custom resource in RHOCP 4.16.4 cluster 2. Create cluster-monitoring-config configmap in openshift-monitoring project 3. Inject remote-write config (without specifically configuring proxy for remote-write) 4. After saving the modification in cluster-monitoring-config configmap, check the remoteWrite config in Prometheus k8s CR. Now it contains the proxyUrl but NOT the noProxy URL(referenced from cluster proxy). Example snippet: ============== apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: [...] name: k8s namespace: openshift-monitoring spec: [...] remoteWrite: - proxyUrl: http://proxy.abc.com:8080 <<<<<====== Injected Automatically but there is no noProxy URL. url: http://test-remotewrite.test.svc.cluster.local:9090
Actual results:
The proxy URL from proxy CR is getting injected in Prometheus k8s CR automatically when configuring remoteWrite but it doesn't have noProxy inherited from cluster proxy resource.
Expected results:
The noProxy URL should get injected in Prometheus k8s CR as well.
Additional info:
Please review the following PR: https://github.com/openshift/whereabouts-cni/pull/242
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-dns-operator/pull/401
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
CNO assumes only master and worker machine config pools present on the cluster, While running CI with 24 nodes, it's found that there are two more pools infra and workload present. So these pools are also taken into consideration while rolling out ipsec machine config.
# omg get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE infra rendered-infra-52f7615d8c841e7570b7ab6cbafecac8 True False False 3 3 3 0 38m master rendered-master-fbb5d8e1337d1244d30291ffe3336e45 True False False 3 3 3 0 1h10m worker rendered-worker-52f7615d8c841e7570b7ab6cbafecac8 False True False 24 12 12 0 1h10m workload rendered-workload-52f7615d8c841e7570b7ab6cbafecac8 True False False 0 0 0 0 38m
Several oc examples are incorrect. These are used in the CLI reference docs, but would also appear in the oc CLI help.
The commands that don't work have been removed manually from the CLI reference docs via this update: https://github.com/openshift/openshift-docs/compare/9907074162999c982a8a97c45665c98913d848c9..441f3419ef460d9863a45e4c2d6914b1c019e1d1
List of commands:
For more information, see the feedback on these PRs:
Description of problem:
The Helm Plugin's index view is parsing a given chart entry's into multiple tiles if the individual entry names vary. This is inconsistent with the Helm CLI experience, which treats all items in an index entry (i.e. all versions of a given chart) to be a part of the same chart.
Version-Release number of selected component (if applicable):
All
How reproducible:
100%
Steps to Reproduce:
1. Open the Developer Console, Helm Plugin 2. Select a namespace and Click to create a helm release 3. Search for the developer-hub chart in the catalog (this is an example demonstrating the problem)
Actual results:
There are two tiles for Developer Hub, but only one index entry in the corresponding index (https://charts.openshift.io)
Expected results:
A single tile should exist for this single index entry.
Additional info:
The cause of this is an expected indexing inconsistency, but the experience should align with the Helm CLI's behavior, and should still represent a single catalog tile per index entry.
This is a clone of issue OCPBUGS-38560. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-37945. The following is the description of the original issue:
—
Description of problem:
openshift-install create cluster leads to error: ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed during pre-provisioning: unable to initialize folders and templates: failed to import ova: failed to lease wait: Invalid configuration for device '0'. Vsphere standard port group
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. openshift-install create cluster 2. Choose Vsphere 3. fill in the blanks 4. Have a standard port group
Actual results:
error
Expected results:
cluster creation
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
HO uses the ICSP/IDMS from mgmt cluster to extract the OCP release metadata to be used in the HostedCluster. But they are extracted only once in main.go: https://github.com/jparrill/hypershift/blob/9bf1403ae09c0f262ebfe006267e3b442cc70149/hypershift-operator/main.go#L287-L293 before starting the HC and NP controllers but they are never refreshed anymore when ICSP/IDMS changes on the management cluster neither when a new HostedCluster is created.
Version-Release number of selected component (if applicable):
4.14 4.15 4.16
How reproducible:
100%
Steps to Reproduce:
1. ensure that HO is already running 2. create an ICSP or a IMDS on the management cluster 3. try to create an hosted-cluster
Actual results:
the imageRegistryOverrides setting for the new hosted-cluster ignores the ICSP/IMDS created when the HO was already running. Killing HO operator pod and wait for it to restart will bring to a different result.
Expected results:
HO is consistently consuming ICSP/IMDS info at runtime without the need to be restarted
Additional info:
It affects disconnected deployments
Description of problem:
Priority Class override for ignition-server deployment was accidentally ripped out when a new reconcileProxyDeployment() func was introduced.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1.Create a cluster with priority class override opted in 2.Override priority class in HC 3.Check ignition server deployment priority class
Actual results:
doesn't override priority class
Expected results:
overridden priority class
Additional info:
Description of problem:
The e2e-gcp-op-layering CI job seems to be continuously and consistently failing during the teardown process. In particular, it appears to be the TestOnClusterBuildRollsOutImage test that is failing whenever it attempts to tear down the node. See: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/4060/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-layering/1744805949165539328 for an example of a failing job.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
Open a PR to the GitHub MCO repository.
Actual results:
The teardown portion of the TestOnClusterBuildsRollout test fails thusly: utils.go:1097: Deleting machine ci-op-v5qcditr-46b3f-bh29c-worker-c-fcl9f / node ci-op-v5qcditr-46b3f-bh29c-worker-c-fcl9f utils.go:1098: Error Trace: /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:1098 /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/onclusterbuild_test.go:103 /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/helpers_test.go:149 /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:79 /usr/lib/golang/src/testing/testing.go:1150 /usr/lib/golang/src/testing/testing.go:1328 /usr/lib/golang/src/testing/testing.go:1570 Error: Received unexpected error: exit status 1 Test: TestOnClusterBuildRollsOutImage utils.go:1097: Deleting machine ci-op-v5qcditr-46b3f-bh29c-worker-c-fcl9f / node ci-op-v5qcditr-46b3f-bh29c-worker-c-fcl9f utils.go:1098: Error Trace: /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:1098 /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/onclusterbuild_test.go:103 /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/helpers_test.go:149 /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:79 /usr/lib/golang/src/testing/testing.go:1150 /usr/lib/golang/src/testing/testing.go:1328 /usr/lib/golang/src/testing/testing.go:1312 /usr/lib/golang/src/runtime/panic.go:522 /usr/lib/golang/src/testing/testing.go:980 /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:1098 /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/onclusterbuild_test.go:103 /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/helpers_test.go:149 /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:79 /usr/lib/golang/src/testing/testing.go:1150 /usr/lib/golang/src/testing/testing.go:1328 /usr/lib/golang/src/testing/testing.go:1570 Error: Received unexpected error: exit status 1 Test: TestOnClusterBuildRollsOutImage
Expected results:
This part of the test should pass.
Additional info:
The way the test teardown process currently works is that it shells out to the oc command to delete the underlying Machine and Node. We delete the underlying machine and node so that the cloud provider will provision us a new one due to issues with opting out of on-cluster builds that have yet to be resolved. At the time this test was written, it was implemented in this way to avoid having to vendor the Machine client and API into the MCO codebase which has since happened. I suspect the issue is that oc is failing in some way since we get an exit status 1 from where it is invoked. Now that the Machine client and API are vendored into the MCO codebase, it makes more sense for us to use those directly instead of shelling out to oc in order to do this since we would get more verbose error messages instead.
Description of problem:
When we pin an image while using ImageDigestMirrorSets, we get this failure: E0422 14:22:29.588035 2366 daemon.go:1380] Fatal error from auxiliary tools: failed to get auth config for image example.io/digest-example/mybusy@sha256:0415f56ccc05526f2af5a7ae8654baec97d4a614f24736e8eef41a 4591f08019: no auth found for image: "example.io/digest-example/mybusy@sha256:0415f56ccc05526f2af5a7ae8654baec97d4a614f24736e8eef41a4591f08019"
Version-Release number of selected component (if applicable):
pre-merge
How reproducible:
Always
Steps to Reproduce:
1. Create an ImageDigestMirrorSet to configure mirrors for an image apiVersion: config.openshift.io/v1 kind: ImageDigestMirrorSet metadata: name: digest-mirror spec: imageDigestMirrors: - mirrors: - quay.io/openshifttest/busybox source: example.io/digest-example/mybusy mirrorSourcePolicy: NeverContactSource # do not redirect to the source registry if the pull from the mirror is failed 2. Create a pinnedimageset using the values in the ImageDigestMirrorSet apiVersion: machineconfiguration.openshift.io/v1alpha1 kind: PinnedImageSet metadata: creationTimestamp: "2024-04-22T12:49:51Z" generation: 1 labels: machineconfiguration.openshift.io/role: worker name: my-worker-pinned-images resourceVersion: "78482" uid: f06c94c4-067f-4404-b3c2-11d5aff4e0cb spec: pinnedImages: - name: example.io/digest-example/mybusy@sha256:0415f56ccc05526f2af5a7ae8654baec97d4a614f24736e8eef41a4591f08019
Actual results:
In the MCD logs we can see the following failure $ oc -n openshift-machine-config-operator logs machine-config-daemon-dmtxx ... I0422 14:26:14.117242 2438 pinned_image_set.go:274] Reconciling pinned image set: my-worker-pinned-images-2: generation: 1 E0422 14:26:14.125965 2438 pinned_image_set.go:981] failed to get auth config for image example.io/digest-example/mybusy@sha256:0415f56ccc05526f2af5a7ae8654baec97d4a614f24736e8eef41a4591f08019: no auth found for image: "example.io/digest-example/mybusy@sha256:0415f56ccc05526f2af5a7ae8654baec97d4a614f24736e8eef41a4591f08019" W0422 14:26:14.125990 2438 pinned_image_set.go:983] failed: worker max retries: 15
Expected results:
Since we can pull the image when we debug the node, we should be able to pin it: sh-5.1# crictl pull example.io/digest-example/mybusy@sha256:0415f56ccc05526f2af5a7ae8654baec97d4a614f24736e8eef41a4591f08019 Image is up to date for example.io/digest-example/mybusy@sha256:0415f56ccc05526f2af5a7ae8654baec97d4a614f24736e8eef41a4591f08019
Additional info:
We get a similar error when we try to pin release images while using releases built by clusterbot. For example, we can try to pin the rhel-coreos image: $ oc adm release info --image-for rhel-coreos registry.build03.ci.openshift.org/ci-ln-4cx6v6b/stable@sha256:85a096a567ca287ba9c0fe36642e49c34eb4dd541914f8823750e4b186fce569 If we try to pin it, we get a similar error: E0422 13:03:14.979410 2951 daemon.go:1380] Fatal error from auxiliary tools: failed to get auth config for image registry.build03.ci.openshift.org/ci-ln-4cx6v6b/stable@sha256:85a096a567ca287ba9c0fe36642e49c34eb4dd541914f8823750e4b186fce569: no auth found for image: "registry.build03.ci.openshift.org/ci-ln-4cx6v6b/stable@sha256:85a096a567ca287ba9c0fe36642e49c34eb4dd541914f8823750e4b186fce569" But the image can actually be pulled: sh-5.1# crictl pull registry.build03.ci.openshift.org/ci-ln-4cx6v6b/stable@sha256:85a096a567ca287ba9c0fe36642e49c34eb4dd541914f8823750e4b186fce569 Image is up to date for registry.build03.ci.openshift.org/ci-ln-4cx6v6b/stable@sha256:85a096a567ca287ba9c0fe36642e49c34eb4dd541914f8823750e4b186fce569
This is a clone of issue OCPBUGS-38119. The following is the description of the original issue:
—
Description of problem:
It would be nice to have each of the e2e test specs shown in the test grid report (https://testgrid.k8s.io/redhat-openshift-olm#periodic-ci-openshift-operator-framework-olm-master-periodics-e2e-gcp-olm&show-stale-tests=). I noticed that the test grid for 4.14 is exhibiting the right behaviour: https://testgrid.k8s.io/redhat-openshift-olm#periodic-ci-openshift-operator-framework-olm-release-4.14-periodics-e2e-gcp-olm&show-stale-tests= So, we should make the junit e2e report look like what it looks like in the 4.14 branch.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Open browser of your choice 2. Go to the link in the description section 3. Direct eyeballs to screen
Actual results:
No e2e specs in the test grid table
Expected results:
e2e specs in the test grid table
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cloud-provider-aws/pull/59
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cloud-provider-vsphere/pull/60
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Upgrading OCP from 4.14.7 to 4.15.0 nightly build failed on Provider cluster which is part of provider-client setup. Platform: IBM Cloud Bare Metal cluster. Steps done: Step 1. $ oc patch clusterversions/version -p '{"spec":{"channel":"stable-4.15"}}' --type=merge clusterversion.config.openshift.io/version patched Step 2: $ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-18-050837 --allow-explicit-upgrade --force warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Requesting update to release image registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-18-050837 The cluster was not upgraded successfully. $ oc get clusteroperator | grep -v "4.15.0-0.nightly-2024-01-18-050837 True False False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.15.0-0.nightly-2024-01-18-050837 True False True 111s APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()... console 4.15.0-0.nightly-2024-01-18-050837 False False False 111s RouteHealthAvailable: console route is not admitted dns 4.15.0-0.nightly-2024-01-18-050837 True True False 12d DNS "default" reports Progressing=True: "Have 4 available DNS pods, want 5.\nHave 5 available node-resolver pods, want 6." etcd 4.15.0-0.nightly-2024-01-18-050837 True False True 12d EtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [{Member:ID:14147288297306253147 name:"baremetal2-06.qe.rh-ocs.com" peerURLs:"https://52.116.161.167:2380" clientURLs:"https://52.116.161.167:2379" Healthy:false Took: Error:create client failure: failed to make etcd client for endpoints [https://52.116.161.167:2379]: context deadline exceeded} {Member:ID:15369339084089827159 name:"baremetal2-03.qe.rh-ocs.com" peerURLs:"https://52.116.161.164:2380" clientURLs:"https://52.116.161.164:2379" Healthy:true Took:9.617293ms Error:<nil>} {Member:ID:17481226479420161008 name:"baremetal2-04.qe.rh-ocs.com" peerURLs:"https://52.116.161.165:2380" clientURLs:"https://52.116.161.165:2379" Healthy:true Took:9.090133ms Error:<nil>}]... image-registry 4.15.0-0.nightly-2024-01-18-050837 True True False 12d Progressing: All registry resources are removed... machine-config 4.14.7 True True True 7d22h Unable to apply 4.15.0-0.nightly-2024-01-18-050837: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, MachineConfigPool master has not progressed to latest configuration: controller version mismatch for rendered-master-9b7e02d956d965d0906def1426cb03b5 expected eaab8f3562b864ef0cc7758a6b19cc48c6d09ed8 has 7649b9274cde2fb50a61a579e3891c8ead2d79c5: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-34b4781f1a0fe7119765487c383afbb3, retrying]] monitoring 4.15.0-0.nightly-2024-01-18-050837 False True True 7m54s UpdatingUserWorkloadPrometheus: client rate limiter Wait returned an error: context deadline exceeded, UpdatingUserWorkloadThanosRuler: waiting for ThanosRuler object changes failed: waiting for Thanos Ruler openshift-user-workload-monitoring/user-workload: context deadline exceeded network 4.15.0-0.nightly-2024-01-18-050837 True True False 12d DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 2 nodes)... node-tuning 4.15.0-0.nightly-2024-01-18-050837 True True False 98m Working towards "4.15.0-0.nightly-2024-01-18-050837" $ oc get machineconfigpool NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-9b7e02d956d965d0906def1426cb03b5 False True True 3 0 0 1 12d worker rendered-worker-4f54b43e9f934f0659761929f55201a1 False True True 3 1 1 1 12d $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.7 True True 120m Unable to apply 4.15.0-0.nightly-2024-01-18-050837: an unknown error has occurred: MultipleErrors $ oc get nodes NAME STATUS ROLES AGE VERSION baremetal2-01.qe.rh-ocs.com Ready worker 12d v1.27.8+4fab27b baremetal2-02.qe.rh-ocs.com Ready worker 12d v1.27.8+4fab27b baremetal2-03.qe.rh-ocs.com Ready control-plane,master,worker 12d v1.27.8+4fab27b baremetal2-04.qe.rh-ocs.com Ready control-plane,master,worker 12d v1.27.8+4fab27b baremetal2-05.qe.rh-ocs.com Ready worker 12d v1.28.5+c84a6b8 baremetal2-06.qe.rh-ocs.com Ready,SchedulingDisabled control-plane,master,worker 12d v1.27.8+4fab27b ---------------------------------------------------- During the efforts to bring the cluster back to a good state, these steps were done: The node baremetal2-06.qe.rh-ocs.com was uncordoned. Tried to upgrade to using the command $ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-22-051500 --allow-explicit-upgrade --force --allow-upgrade-with-warnings=true warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. warning: --allow-upgrade-with-warnings is bypassing: the cluster is already upgrading: Reason: ClusterOperatorsDegraded Message: Unable to apply 4.15.0-0.nightly-2024-01-18-050837: wait has exceeded 40 minutes for these operators: etcd, kube-apiserverRequesting update to release image registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-22-051500 Upgrade to 4.15.0-0.nightly-2024-01-22-051500 also was not successful. Node baremetal2-01.qe.rh-ocs.com was drained manually to see if that works. Some clusteroperators stayed on the previous version. Some moved to Degraded state. $ oc get machineconfigpool NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-9b7e02d956d965d0906def1426cb03b5 False True False 3 1 1 0 13d worker rendered-worker-4f54b43e9f934f0659761929f55201a1 False True True 3 1 1 1 13d $ oc get pdb -n openshift-storage NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 11d rook-ceph-mon-pdb N/A 1 1 11d rook-ceph-osd N/A 1 1 3h17m $ oc rsh rook-ceph-tools-57fd4d4d68-p2psh ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 5.23672 root default -5 1.74557 host baremetal2-01-qe-rh-ocs-com 1 ssd 0.87279 osd.1 up 1.00000 1.00000 4 ssd 0.87279 osd.4 up 1.00000 1.00000 -7 1.74557 host baremetal2-02-qe-rh-ocs-com 3 ssd 0.87279 osd.3 up 1.00000 1.00000 5 ssd 0.87279 osd.5 up 1.00000 1.00000 -3 1.74557 host baremetal2-05-qe-rh-ocs-com 0 ssd 0.87279 osd.0 up 1.00000 1.00000 2 ssd 0.87279 osd.2 up 1.00000 1.00000 OCP must-gather logs - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/hcp414-aaa/hcp414-aaa_20240112T084548/logs/must-gather-ibm-bm2-provider/must-gather.local.1079362865726528648/
Version-Release number of selected component (if applicable):
Initial version: OCP 4.14.7 ODF 4.14.4-5.fusion-hci OpenShift Virtualization: kubevirt-hyperconverged-operator.4.16.0-380 Local Storage: local-storage-operator.v4.14.0-202312132033 OpenShift Data Foundation Client : ocs-client-operator.v4.14.4-5.fusion-hci
How reproducible:
Reporting the first occurance of the isue.
Steps to Reproduce:
1. On a Provider-client HCI setup , upgrade provider cluster to a nightly build of OCP
Actual results:
OCP upgrade not successful. Some operators become degraded. worker machineconfigpool have 1 degraded machine count.
Expected results:
OCP upgrade to nightly build from 4.14.7 should be success.
Additional info:
There are 3 hosted clients present
Description of problem:
setup cluster cluster on vsphere by usermanaged ELB, with install-config.yaml apiVIPs: - 10.38.153.2 ingressVIPs: - 10.38.153.3 loadBalancer: type: UserManaged networking: machineNetwork: - cidr: "10.38.153.0/25" featureSet: TechPreviewNoUpgrade after cluster is started, Found the keeplaived still running on worker nodes. omc get pod -n openshift-vsphere-infra NAME READY STATUS RESTARTS AGE coredns-ci-op-2kch7ldp-72b07-7l4vs-master-0 2/2 Running 0 1h coredns-ci-op-2kch7ldp-72b07-7l4vs-master-1 2/2 Running 0 59m coredns-ci-op-2kch7ldp-72b07-7l4vs-master-2 2/2 Running 0 59m coredns-ci-op-2kch7ldp-72b07-7l4vs-worker-0-tqc74 2/2 Running 0 39m coredns-ci-op-2kch7ldp-72b07-7l4vs-worker-1-s654k 2/2 Running 0 37m keepalived-ci-op-2kch7ldp-72b07-7l4vs-worker-0-tqc74 2/2 Running 0 39m keepalived-ci-op-2kch7ldp-72b07-7l4vs-worker-1-s654k 2/2 Running 0 37m
Version-Release number of selected component (if applicable):
4.15
How reproducible:
always
Steps to Reproduce:
1. setup vsphere on multi-subnet network with ELB, job https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/46458/rehearse-46458-periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-vsphere-ipi-zones-multisubnets-external-lb-usermanaged-f28/1732560150155235328 2. 3.
Actual results:
Expected results:
keepalived should not be running on worker node.
Additional info:
must-gather logs: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/pr-logs/pull/openshift_release/46458/rehearse-46458-periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-vsphere-ipi-zones-multisubnets-external-lb-usermanaged-f28/1732560150155235328/artifacts/vsphere-ipi-zones-multisubnets-external-lb-usermanaged-f28/gather-must-gather/artifacts/must-gather.tar
Version-Release number of selected component (if applicable):
{code:none}
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
In the OCP upgrades from 4.13 to 4.14, the canary route configuration is changed as below:
Canary route configuration in OCP 4.13 $ oc get route -n openshift-ingress-canary canary -oyaml apiVersion: route.openshift.io/v1 kind: Route metadata: labels: ingress.openshift.io/canary: canary_controller name: canary namespace: openshift-ingress-canary spec: host: canary-openshift-ingress-canary.apps.<cluster-domain>.com <---- canary route configured with .spec.host Canary route configuration in OCP 4.14: $ oc get route -n openshift-ingress-canary canary -oyaml apiVersion: route.openshift.io/v1 kind: Route labels: ingress.openshift.io/canary: canary_controller name: canary namespace: openshift-ingress-canary spec: port: targetPort: 8080 subdomain: canary-openshift-ingress-canary <---- canary route configured with .spec.subdomain
After the upgrade, the following messages are printed in the ingress-operator pod:
2024-04-24T13:16:34.637Z ERROR operator.init controller/controller.go:265 Reconciler error {"controller": "canary_controller", "object": {"name":"default","namespace":"openshift-ingress-operator"}, "namespace": "openshift-ingress-operator", "name": "default", "reconcileID": "46290893-d755-4735-bb01-e8b707be4053", "error": "failed to ensure canary route: failed to update canary route openshift-ingress-canary/canary: Route.route.openshift.io \"canary\" is invalid: spec.subdomain: Invalid value: \"canary-openshift-ingress-canary\": field is immutable"}
The issue is resolved when the canary route is deleted.
See below the audit logs from the process:
# The route can't be updated with error 422: {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"4e8bfb36-21cc-422b-9391-ef8ff42970ca","stage":"ResponseComplete","requestURI":"/apis/route.openshift.io/v1/namespaces/openshift-ingress-canary/routes/canary","verb":"update","user":{"username":"system:serviceaccount:openshift-ingress-operator:ingress-operator","groups":["system:serviceaccounts","system:serviceaccounts:openshift-ingress-operator","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["ingress-operator-746cd8598-hq2st"],"authentication.kubernetes.io/pod-uid":["f3ebccdf-f3b3-420d-8ea5-e33d98945403"]}},"sourceIPs":["10.128.0.93","10.128.0.2"],"userAgent":"Go-http-client/2.0","objectRef":{"resource":"routes","namespace":"openshift-ingress-canary","name":"canary","uid":"3e179946-d4e3-45ad-9380-c305baefd14e","apiGroup":"route.openshift.io","apiVersion":"v1","resourceVersion":"297888"},"responseStatus":{"metadata":{},"status":"Failure","message":"Route.route.openshift.io \"canary\" is invalid: spec.subdomain: Invalid value: \"canary-openshift-ingress-canary\": field is immutable","reason":"Invalid","details":{"name":"canary","group":"route.openshift.io","kind":"Route","causes":[{"reason":"FieldValueInvalid","message":"Invalid value: \"canary-openshift-ingress-canary\": field is immutable","field":"spec.subdomain"}]},"code":422},"requestReceivedTimestamp":"2024-04-24T13:16:34.630249Z","stageTimestamp":"2024-04-24T13:16:34.636869Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"openshift-ingress-operator\" of ClusterRole \"openshift-ingress-operator\" to ServiceAccount \"ingress-operator/openshift-ingress-operator\""}} # Route is deleted manually "kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"70821b58-dabc-4593-ba6d-5e81e5d27d21","stage":"ResponseComplete","requestURI":"/aps/route.openshift.io/v1/namespaces/openshift-ingress-canary/routes/canary","verb":"delete","user":{"username":"system:admin","groups":["system:masters","syste:authenticated"]},"sourceIPs":["10.0.91.78","10.128.0.2"],"userAgent":"oc/4.13.0 (linux/amd64) kubernetes/7780c37","objectRef":{"resource":"routes","namespace:"openshift-ingress-canary","name":"canary","apiGroup":"route.openshift.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","details":{"ame":"canary","group":"route.openshift.io","kind":"routes","uid":"3e179946-d4e3-45ad-9380-c305baefd14e"},"code":200},"requestReceivedTimestamp":"2024-04-24T1324:39.558620Z","stageTimestamp":"2024-04-24T13:24:39.561267Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}} # Route is created again {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"92e6132a-aa1d-482d-a1dc-9ce021ae4c37","stage":"ResponseComplete","requestURI":"/aps/route.openshift.io/v1/namespaces/openshift-ingress-canary/routes","verb":"create","user":{"username":"system:serviceaccount:openshift-ingress-operator:ingres-operator","groups":["system:serviceaccounts","system:serviceaccounts:openshift-ingress-operator","system:authenticated"],"extra":{"authentication.kubernetesio/pod-name":["ingress-operator-746cd8598-hq2st"],"authentication.kubernetes.io/pod-uid":["f3ebccdf-f3b3-420d-8ea5-e33d98945403"]}},"sourceIPs":["10.128.0.93""10.128.0.2"],"userAgent":"Go-http-client/2.0","objectRef":{"resource":"routes","namespace":"openshift-ingress-canary","name":"canary","apiGroup":"route.opensift.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":201},"requestReceivedTimestamp":"2024-04-24T13:24:39.577255Z","stageTimestamp":"2024-04-24T1:24:39.584371Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"openshift-ingress-perator\" of ClusterRole \"openshift-ingress-operator\" to ServiceAccount \"ingress-operator/openshift-ingress-operator\""}}
Version-Release number of selected component (if applicable):
Ocp upgrade between 4.13 and 4.14
How reproducible:
Upgrade the cluster from OCP 4.13 to 4.14 and check the ingress operator pod logs
Steps to Reproduce:
1. Install cluster in OCP 4.13 2. Upgrade to OCP 4.14 3. Check the ingress operator logs
Actual results:
Reported errors above
Expected results:
The ingress canary route should be update without isssues
Additional info:
As a developer, I want to be able to:
so that I can achieve
Description of criteria:
This is a clone of issue OCPBUGS-36296. The following is the description of the original issue:
—
Currently the manifests directory has:
0000_30_cluster-api_00_credentials-request.yaml 0000_30_cluster-api_00_namespace.yaml ...
CredentialsRequests go into the openshift-cloud-credential-operator namespace, so they can come before or after the openshift-cluster-api namespace. But because they ask for Secrets in the openshift-cluster-api namespace, there would be less race and drama if the CredentialsRequest manifests were given a name that sorted them after the namespace. Like 0000_30_cluster-api_01_credentials-request.yaml.
I haven't gone digging in history, it may have been like this since forever.
Every time.
With a release image pullspec like registry.ci.openshift.org/ocp/release:4.17.0-0.nightly-2024-06-27-184535:
$ oc adm release extract --to manifests registry.ci.openshift.org/ocp/release:4.17.0-0.nightly-2024-06-27-184535 $ ls manifests/0000_30_cluster-api_* | grep 'namespace\|credentials-request'
$ ls manifests/0000_30_cluster-api_* | grep 'namespace\|credentials-request' manifests/0000_30_cluster-api_00_credentials-request.yaml manifests/0000_30_cluster-api_00_namespace.yaml
$ ls manifests/0000_30_cluster-api_* | grep 'namespace\|credentials-request' manifests/0000_30_cluster-api_00_namespace.yaml manifests/0000_30_cluster-api_01_credentials-request.yaml
This is a clone of issue OCPBUGS-34712. The following is the description of the original issue:
—
Description of problem:
in the doc installing_ibm_cloud_public/installing-ibm-cloud-customizations.html have not the tested instance type list
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1.https://docs.openshift.com/container-platform/4.15/installing/installing_ibm_cloud_public/installing-ibm-cloud-customizations.html have not list the tested vm
Actual results:
have not list the tested type
Expected results:
list the tested instance type as https://docs.openshift.com/container-platform/4.15/installing/installing_azure/installing-azure-customizations.html#installation-azure-tested-machine-types_installing-azure-customizations
Additional info:
Seeing failures for SDN periodics running [sig-network][Feature:tuning] sysctl allowlist update should start a pod with custom sysctl only when the sysctl is added to whitelist [Suite:openshift/conformance/parallel] beginning with 4.16.0-0.nightly-2024-01-05-205447
Jan 5 23:14:22.066: INFO: At 2024-01-05 23:14:09 +0000 UTC - event for testpod: {kubelet ip-10-0-54-42.us-west-2.compute.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_testpod_e2e-test-tuning-bzspr_2a9ce6e0-726d-47a6-ac64-71d430926574_0(968a55c5afd81e077b1d15a4129084d5f15002ac3ae6aa9fe32648e841940fe2): error adding pod e2e-test-tuning-bzspr_testpod to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): timed out waiting for the condition
That payload contains OCPBUGS-26222: Adds a wait on unix socket readiness not sure that is the cause but will investigate.
This is a clone of issue OCPBUGS-33645. The following is the description of the original issue:
—
Description of problem:
After enabling separate alertmanager instance for user-defined alert routing, the alertmanager-user-workload pods are initialized but the configmap alertmanager-trusted-ca-bundle is not injected in the pods. [-] https://docs.openshift.com/container-platform/4.15/observability/monitoring/enabling-alert-routing-for-user-defined-projects.html#enabling-a-separate-alertmanager-instance-for-user-defined-alert-routing_enabling-alert-routing-for-user-defined-projects
Version-Release number of selected component (if applicable):
RHOCP 4.13, 4.14 and 4.15
How reproducible:
100%
Steps to Reproduce:
1. Enable user-workload monitoring using[a] 2. Enable separate alertmanager instance for user-defined alert routing using [b] 3. Check if alertmanager-trusted-ca-bundle configmap is injected in alertmanager-user-workload pods which are running in openshift-user-workload-monitoring project. $ oc describe pod alertmanager-user-workload-0 -n openshift-user-workload-monitoring | grep alertmanager-trusted-ca-bundle [a] https://docs.openshift.com/container-platform/4.15/observability/monitoring/enabling-monitoring-for-user-defined-projects.html#enabling-monitoring-for-user-defined-projects_enabling-monitoring-for-user-defined-projects [b] https://docs.openshift.com/container-platform/4.15/observability/monitoring/enabling-alert-routing-for-user-defined-projects.html#enabling-a-separate-alertmanager-instance-for-user-defined-alert-routing_enabling-alert-routing-for-user-defined-projects
Actual results:
alertmanager-user-workload pods are NOT injected with alertmanager-trusted-ca-bundle configmap.
Expected results:
alertmanager-user-workload pods should be injected with alertmanager-trusted-ca-bundle configmap.
Additional info:
Similar configmap is injected fine in alertmanager-main pods which are running in openshift-monitoring project.
Description of problem:
It was notices that the openshift-hyperkube RPM which is primarilly, perhaps exclusively, used to install the kubelet in RHCOS or other environments included kube-apiserver, kube-controller-manager, and kube-scheduler binaries. Those binaries are all built and used via container images, which as far as I can tell don't make use of the RPM.
Version-Release number of selected component (if applicable):
4.12 - 4.16
How reproducible:
100%
Steps to Reproduce:
1. rpm -ql openshift-hyperkube on any node 2. 3.
Actual results:
# rpm -ql openshift-hyperkube /usr/bin/hyperkube /usr/bin/kube-apiserver /usr/bin/kube-controller-manager /usr/bin/kube-scheduler /usr/bin/kubelet /usr/bin/kubensenter # ls -lah /usr/bin/kube-apiserver /usr/bin/kube-controller-manager /usr/bin/kube-scheduler /usr/bin/hyperkube /usr/bin/kubensenter /usr/bin/kubelet -rwxr-xr-x. 2 root root 945 Jan 1 1970 /usr/bin/hyperkube -rwxr-xr-x. 2 root root 129M Jan 1 1970 /usr/bin/kube-apiserver -rwxr-xr-x. 2 root root 114M Jan 1 1970 /usr/bin/kube-controller-manager -rwxr-xr-x. 2 root root 54M Jan 1 1970 /usr/bin/kube-scheduler -rwxr-xr-x. 2 root root 105M Jan 1 1970 /usr/bin/kubelet -rwxr-xr-x. 2 root root 3.5K Jan 1 1970 /usr/bin/kubensenter
Expected results:
Just the kubelet and deps on the host OS, that's all that's necessary
Additional info:
My proposed change would be for people that cared about making this slim to install `openshift-hyperkube-kubelet` instead.
Not sure which component this bug should be associated with.
I am not even sure if importing respects ImageTagMirrorSet.
We could not figure out in the slack conversion.
https://redhat-internal.slack.com/archives/C013VBYBJQH/p1709583648013199
Description of problem:
The expecting behaviour of ImageTagMirrorSet of redirecting the pulling of a proxy to quay.io did not work out.
Version-Release number of selected component (if applicable):
oc --context build02 get clusterversion version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-ec.3 True False 7d4h
Steps to Reproduce:
oc --context build02 get ImageTagMirrorSet quay-proxy -o yaml apiVersion: config.openshift.io/v1 kind: ImageTagMirrorSet metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"config.openshift.io/v1","kind":"ImageTagMirrorSet","metadata":{"annotations":{},"name":"quay-proxy"},"spec":{"imageTagMirrors":[{"mirrors":["quay.io/openshift/ci"],"source":"quay-proxy.ci.openshift.org/openshift/ci"}]}} creationTimestamp: "2024-03-05T03:49:59Z" generation: 1 name: quay-proxy resourceVersion: "4895378740" uid: 69fb479e-85bd-4a16-a38f-29b08f2636c3 spec: imageTagMirrors: - mirrors: - quay.io/openshift/ci source: quay-proxy.ci.openshift.org/openshift/ci oc --context build02 tag --source docker quay-proxy.ci.openshift.org/openshift/ci:ci_ci-operator_latest hongkliu-test/proxy-test-2:011 --as system:admin Tag proxy-test-2:011 set to quay-proxy.ci.openshift.org/openshift/ci:ci_ci-operator_latest. oc --context build02 get is proxy-test-2 -o yaml apiVersion: image.openshift.io/v1 kind: ImageStream metadata: annotations: openshift.io/image.dockerRepositoryCheck: "2024-03-05T20:03:02Z" creationTimestamp: "2024-03-05T20:03:02Z" generation: 2 name: proxy-test-2 namespace: hongkliu-test resourceVersion: "4898915153" uid: f60b3142-1f5f-42ae-a936-a9595e794c05 spec: lookupPolicy: local: false tags: - annotations: null from: kind: DockerImage name: quay-proxy.ci.openshift.org/openshift/ci:ci_ci-operator_latest generation: 2 importPolicy: importMode: Legacy name: "011" referencePolicy: type: Source status: dockerImageRepository: image-registry.openshift-image-registry.svc:5000/hongkliu-test/proxy-test-2 publicDockerImageRepository: registry.build02.ci.openshift.org/hongkliu-test/proxy-test-2 tags: - conditions: - generation: 2 lastTransitionTime: "2024-03-05T20:03:02Z" message: 'Internal error occurred: quay-proxy.ci.openshift.org/openshift/ci:ci_ci-operator_latest: Get "https://quay-proxy.ci.openshift.org/v2/": EOF' reason: InternalError status: "False" type: ImportSuccess items: null tag: "011"
Actual results:
The status of the stream shows that it still tries to connect to quay-proxy.
Expected results:
The request goes to quay.io directly.
Additional info:
The proxy has been shut down completely just to simplify the case. If it was on, there are Access logs showing the proxy get the requests for the image. oc scale deployment qci-appci -n ci --replicas 0 deployment.apps/qci-appci scaled I also checked the pull secret in the namespace and it has correct pull credentials to both proxy and quay.io.
This is a clone of issue OCPBUGS-29510. The following is the description of the original issue:
—
Description of problem:
When a cluster is configured for direct OIDC configuration (authentication.config/cluster .spec.type=OIDC), console pods will be in crashloop until an OIDC client is configured for the console.
Version-Release number of selected component (if applicable):
4.15.0
How reproducible:
100% in Hypershift; 100% in TechPreviewNoUpgrade featureset on standalone OpenShift
Steps to Reproduce:
1. Update authentication.config/cluster so that Type=OIDC
Actual results:
The console operator tries to create a new console rollout, but the pods crashloop. This is because the operator sets the console pods to "disabled". This would normally actually mean a privilege escalation, fortunately the configuration prevents a successful deploy.
Expected results:
Console pods are healthy, they show a page which says that no authentication is currently configured.
Additional info:
Please review the following PR: https://github.com/openshift/openstack-cinder-csi-driver-operator/pull/160
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
In https://github.com/openshift/release/pull/47618 there are quite a few warnings from snyk in the presubmit rehearsal jobs that have not been reported in the bugs filed against storage We need to go through each one and either fix (in the case of legit bugs) or ignore (false positives / test cases) to avoid having a presubmit job that always fails
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
run 'security' presubmit rehearsal jobs in https://github.com/openshift/release/pull/47618
Actual results:
snyk issues reported
Expected results:
clean test runs
Additional info:
The goal is to collect metrics about AdminNetworkPolicy and BaselineAdminNetworkPolicy CRDs because its essentially to understand how the users are using this feature and in fact if they are using it OR not. This is required for 4.16 Feature https://issues.redhat.com/browse/SDN-4157 and hoping to get approval and PRs merged before the code freeze time frame for 4.16 (April 26th 2024)
admin_network_policy_total represents the total number of admin network policies in the cluster
Labels: None
See https://github.com/ovn-org/ovn-kubernetes/pull/4239 for more information
Cardinality of the metric is at most 1.
baseline_admin_network_policy_total represents the total number of baseline admin network policies in the cluster (0 or 1)
Labels: None
See https://github.com/ovn-org/ovn-kubernetes/pull/4239 for more information
Cardinality of the metric is at most 1.
We don't need the above two anymore because we have https://redhat-internal.slack.com/archives/C0VMT03S5/p1712567951869459?thread_ts=1712346681.157809&cid=C0VMT03S5
Instead of that we are adding two other metrics for rule count: (https://issues.redhat.com/browse/MON-3828 )
admin_network_policy_db_objects_total represents the total number of OVN NBDB objects (table_name) owned by AdminNetworkPolicy controller in the cluster
{{Labels: }}
See https://github.com/ovn-org/ovn-kubernetes/pull/4254 for more information
Cardinality of the metric is at most 3.
baseline_admin_network_policy_db_objects_total represents the total number of OVN NBDB objects (table_name) owned by BaselineAdminNetworkPolicy controller in the cluster
{{Labels: }}
See https://github.com/ovn-org/ovn-kubernetes/pull/4254 for more information
Cardinality of the metric is at most 3.
After performing an Agent Based Installation on Baremetal, the master node which was initially the rendezvous host is not joining to the cluster.
Checking podman containers on this node we see that 'assisted-installer' pod appears with 143 exit code after the second master is detected as ready:
2024-04-01T15:21:14.677437000Z time="2024-04-01T15:21:14Z" level=info msg="Found 1 ready master nodes" 2024-04-01T15:21:19.684831000Z time="2024-04-01T15:21:19Z" level=info msg="Found a new ready master node <second-master> with id <master-id>"
podman pods status:
$ podman ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 20b338ab8906 localhost/podman-pause:4.4.1-1707368644 16 hours ago Up 16 hours d2b97e733b33-infra 0876c611f655 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:27c5328e1d9a0d7db874c6e52efae631ab3c29a3d4da50c50b2e783dcb784128 /bin/bash start_d... 16 hours ago Up 16 hours assisted-db a9a116bed3a7 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:27c5328e1d9a0d7db874c6e52efae631ab3c29a3d4da50c50b2e783dcb784128 /assisted-service 16 hours ago Up 16 hours service 0afbe44c2cf2 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:27c5328e1d9a0d7db874c6e52efae631ab3c29a3d4da50c50b2e783dcb784128 /usr/local/bin/ag... 16 hours ago Exited (0) 16 hours ago apply-host-config 45da1bdf2440 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b3daca74ad515845d5f8dcf384f0e51d58751a2785414edc3f20969a6fc0403 next_step_runner ... 16 hours ago Up 16 hours next-step-runner 8d1306b0ea3a quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:79e97d8cbd27e2c7402f7e016de97ca2b1f4be27bd52a981a27e7a2132be1ef4 --role bootstrap ... 16 hours ago Exited (143) 15 hours ago assisted-installer 8b0cc08890b4 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7f44844c4024dfa35688eac52e5e3d1540311771c4a24fef1ba4a6dccecc0e55 start --node-name... 16 hours ago Exited (0) 16 hours ago hungry_varahamihira 4916c14b9f7e registry.redhat.io/rhel9/support-tools:latest /usr/bin/bash 34 seconds ago Up 34 seconds toolbox-core
crio pods status:
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD 03b89032db0bc 98fc664e8c2aa859c10ec8ea740b083c7c85925d75506bcb85c6c9c640945c36 13 seconds ago Exited etcd 182 5d42cdad70890 etcd-bootstrap-member-<failed-master-name>.local 01008c6e32e5a quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6b38d75b297fa52d1ba29af0715cec2430cd5fda1a608ed0841a09c55c292fb3 16 hours ago Running coredns 0 5f8736b856a0c coredns-<failed-master-name> 5e00e89ebef34 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2e119d0d9f8470dd634a62329d2670602c5f169d0d9bbe5ad25cee07e716c94b 16 hours ago Exited render-config 0 5f8736b856a0c coredns-<failed-master-name> f5098d5d27a39 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2e119d0d9f8470dd634a62329d2670602c5f169d0d9bbe5ad25cee07e716c94b 16 hours ago Running keepalived-monitor 0 4fb91cefa8a9e keepalived-<failed-master-name> a1e9d4c8cf477 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d24879d39e10fcf00a7c28ab23de1d6cf0c433a1234ff34880f12642b75d4512 16 hours ago Running keepalived 0 4fb91cefa8a9e keepalived-<failed-master-name> de21bc99f0d3f quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8c74c57f91f0f7ed26bb62f58c7b84c55750e51947fd6cc5711fa18f30b9f68c 16 hours ago Running etcdctl 0 5d42cdad70890 etcd-bootstrap-member-<failed-master-name>
This is a clone of issue OCPBUGS-32950. The following is the description of the original issue:
—
Description of problem:
Affects only developers with a local build.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
Build and run the console locally.
Actual results:
The user toggle menu isn't shown, so developers cannot access the user preference, such as the language or theme.
Expected results:
The user toggle should be there.
Description of the problem:
Staging UI 2.31.1, BE 2.31.0 - click on create new cluster - UI have rolling wheel but nothing loads and no error can be found.
Edit:
BE v2/openshift-versions response is empty
How reproducible:
100%
Steps to reproduce:
1. Click on create new cluster
2.
3.
Actual results:
Expected results:
Description of problem:
The customer requires multiple domain names in their AWS VPCs DHCP option set which is to allow on-prem DNS(infoblox) lookups to work. The problem is that kubelet service is unable to parse the node name properly. ~~~ hyperkube[2562]: Error: failed to run Kubelet: failed to create kubelet: could not initialize volume plugins for KubeletVolumePluginMgr: parse "http://example.compute.internal example.com:9001": invalid character " " in host name ~~~ /etc/systemd/system/kubelet.service.d/20-aws-node-name.conf [Service] Environment="KUBELET_NODE_NAME=ip-x-x-x-x.example.example test.example" ^ space The is customer is aware of this KCS article. If the cu follows what the KCS article says, it will break their DNS functionality. Kubelet fails to start on nodes during OCP 4.x IPI installation on AWS - Red Hat Customer Portal https://access.redhat.com/solutions/6978959
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Create/adding a node with multiple domain names 2. Add base domain to the DHCP option in the VPC setting 3.
Actual results:
kubelet is failing to start
Expected results:
should be able to add a worker node that has multiple domain names
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
4.14
How reproducible:
1. oc patch svc <svc> --type merge --patch '{"spec":{"sessionAffinity":"ClientIP"}}'
2. curl <svc>:<port>
3. oc scale --replicas=3 deploy/<deploy>
4. oc scale --replicas=0 deploy/<deploy>
5. oc scale --replicas=3 deploy/<deploy>
Actual results:
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 54850
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 46668
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 46682
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 60144
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 60150
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 60160
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 51720
Expected results:
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46914
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46928
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46944
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 40510
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 40520
Additional info:
See the hostname in the server log output for each command.
$ oc patch svc <svc> --type merge --patch '{"spec":{"sessionAffinity":"ClientIP"}}'
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46914
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46928
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46944
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 40510
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 40520
$ oc scale --replicas=1 deploy/<deploy>
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 47082
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 47088
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 54832
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 54848
$ oc scale --replicas=3 deploy/<deploy>
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 54850
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 46668
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 46682
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 60144
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 60150
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 60160
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 51720
Reported in https://issues.redhat.com/browse/MGMT-14527
resourcepool-path key needs to be added to config ini
Description of problem:
All e2e-ibmcloud-ovn testing is failing due to repeated events of liveness or readiness probes failing during MonitorTests.
Version-Release number of selected component (if applicable):
4.16.0-0.ci.test-2024-02-20-184205-ci-op-lghcpt9x-latest
How reproducible:
Appears to be 100%
Steps to Reproduce:
1. Setup IPI cluster on IBM Cloud 2. Run OCP Conformance w/ MonitorTests (CI does this on IBM Cloud related PR's)
Actual results:
Failed OCP Conformance tests, due to MonitorTests failure: : [sig-arch] events should not repeat pathologically for ns/openshift-cloud-controller-manager expand_less0s{ 2 events happened too frequently event happened 43 times, something is wrong: namespace/openshift-cloud-controller-manager node/ci-op-lghcpt9x-52953-tk4vl-master-2 pod/ibm-cloud-controller-manager-6c5f8594c5-bpnm8 hmsg/d91441a732 - reason/ProbeError Liveness probe error: Get "https://10.241.129.4:10258/healthz": dial tcp 10.241.129.4:10258: connect: connection refused result=reject body: From: 20:25:44Z To: 20:25:45Z event happened 43 times, something is wrong: namespace/openshift-cloud-controller-manager node/ci-op-lghcpt9x-52953-tk4vl-master-1 pod/ibm-cloud-controller-manager-6c5f8594c5-wn4fq hmsg/fda26f2bbf - reason/ProbeError Liveness probe error: Get "https://10.241.64.6:10258/healthz": dial tcp 10.241.64.6:10258: connect: connection refused result=reject body: From: 20:25:54Z To: 20:25:55Z} : [sig-arch] events should not repeat pathologically for ns/openshift-oauth-apiserver expand_less0s{ 1 events happened too frequently event happened 25 times, something is wrong: namespace/openshift-oauth-apiserver node/ci-op-lghcpt9x-52953-tk4vl-master-1 pod/apiserver-c5ff4776b-kqg7c hmsg/c9e932e38d - reason/ProbeError Readiness probe error: HTTP probe failed with statuscode: 500 result=reject body: [+]ping ok [+]log ok [+]etcd ok [-]etcd-readiness failed: reason withheld [+]informer-sync ok [+]poststarthook/generic-apiserver-start-informers ok [+]poststarthook/priority-and-fairness-config-consumer ok [+]poststarthook/priority-and-fairness-filter ok [+]poststarthook/storage-object-count-tracker-hook ok [+]poststarthook/openshift.io-StartUserInformer ok [+]poststarthook/openshift.io-StartOAuthInformer ok [+]poststarthook/openshift.io-StartTokenTimeoutUpdater ok [+]shutdown ok readyz check failed From: 20:25:04Z To: 20:25:05Z}
Expected results:
Passing OCP Conformance (w/ MonitorTests) test
Additional info:
The frequent (perhaps only) failures appear to occur via: [sig-arch] events should not repeat pathologically for ns/openshift-cloud-controller-manager [sig-arch] events should not repeat pathologically for ns/openshift-oauth-apiserver I am unsure on the cause of the liveness/readiness probe failures as of yet, unsure if the underlying Infrastructure is the cause (and if so, what resource).
Please review the following PR: https://github.com/openshift/ironic-image/pull/438
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-36711. The following is the description of the original issue:
—
Description of problem:
With the changes in https://github.com/openshift/machine-config-operator/pull/4425, RHEL worker nodes fail as follows: [root@ptalgulk-0807c-fq97t-w-a-l-rhel-1 cloud-user]# systemctl --failed UNIT LOAD ACTIVE SUB DESCRIPTION ● disable-mglru.service loaded failed failed Disables MGLRU on Openshfit LOAD = Reflects whether the unit definition was properly loaded. ACTIVE = The high-level unit activation state, i.e. generalization of SUB. SUB = The low-level unit activation state, values depend on unit type. 1 loaded units listed. Pass --all to see loaded but inactive units, too. To show all installed unit files use 'systemctl list-unit-files'. [root@ptalgulk-0807c-fq97t-w-a-l-rhel-1 cloud-user]# journalctl -u disable-mglru.service -- Logs begin at Mon 2024-07-08 06:23:03 UTC, end at Mon 2024-07-08 08:31:35 UTC. -- Jul 08 06:23:14 localhost.localdomain systemd[1]: Starting Disables MGLRU on Openshfit... Jul 08 06:23:14 localhost.localdomain bash[710]: /usr/bin/bash: /sys/kernel/mm/lru_gen/enabled: No such file or directory Jul 08 06:23:14 localhost.localdomain systemd[1]: disable-mglru.service: Main process exited, code=exited, status=1/FAILURE Jul 08 06:23:14 localhost.localdomain systemd[1]: disable-mglru.service: Failed with result 'exit-code'. Jul 08 06:23:14 localhost.localdomain systemd[1]: Failed to start Disables MGLRU on Openshfit. Jul 08 06:23:14 localhost.localdomain systemd[1]: disable-mglru.service: Consumed 4ms CPU time We should only disable mglru if it exists.
Version-Release number of selected component (if applicable):
4.16, 4.17
How reproducible:
Attempt to bring up rhel worker node
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Logs for PipelineRuns fetched from the Tekton Results API is not loading
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Navigate to the Log tab of PipelineRun fetched from the Tekton Results 2. 3.
Actual results:
Logs window is empty with a loading indicator
Expected results:
Logs should be shown
Additional info:
Dockerfile.okd is behind compared to Dockerfile
Please review the following PR: https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/101
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
ibm-vpc-block-csi-driver deployment is missing sidecar metrics and the kube-rbac-proxy sidecar
Version-Release number of selected component (if applicable):
4.10+
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Due to HTTP/2 Connection Coalescing (https://daniel.haxx.se/blog/2016/08/18/http2-connection-coalescing/), routes which use the same certificate can present unexplained 503 errors when attempting to access an HTTP/2 enabled ingress controller.
It appears that HAProxy supports the ability to force HTTP 1.1 on a route-by-route basis, but our Ingress Controller does not expose that option.
This is especially problematic for component routes because generally speaking, customers use a wildcard or SAN to deploy custom component routes (console, OAuth, downloads), but with HTTP/2, this does not work properly.
To address this issue, we're proposing the creation of an annotation haproxy.router.openshift.io/http2-disable, which will allow the disabling of HTTP/2 on a route-by-route basis, or smarter logic built into our Ingress operator to handle this situation.
Version-Release number of selected component (if applicable):
OpenShift 4.14
How reproducible:
Serve routes to applications in Openshift. Observe the routes through a HTTP/2 enabled client. Notice that http/2 client connections are broken (returns 503 on second connection when using same certificates across a mix of re-encrypt and passthrough routes)
Steps to Reproduce:
(see above notes)
Actual results:
503 error
Expected results:
no error
Additional info:
This is a clone of issue OCPBUGS-36768. The following is the description of the original issue:
—
Description of problem:
https://github.com/prometheus/prometheus/pull/14446 is a fix for https://github.com/prometheus/prometheus/issues/14087 (see there for details) This was introduced in Prom 2.51.0 https://github.com/openshift/cluster-monitoring-operator/blob/master/Documentation/deps-versions.md
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
On startup, if release images syncing fails, the service will fail instead of checking whether it can proceed with stale data.
How reproducible:
Always.
Steps to reproduce:
Actual results:
assisted-service will fail.
Expected results:
assisted-service will continue with stale data
Description of problem:
Must-gather link
long snippet from e2e log
external internet 09/01/23 07:26:09.624 Sep 1 07:26:09.624: INFO: Running 'oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 http://www.google.com:80' STEP: creating an egressfirewall object 09/01/23 07:26:09.903 STEP: calling oc create -f /tmp/fixture-testdata-dir978363556/test/extended/testdata/egress-firewall/ovnk-egressfirewall-test.yaml 09/01/23 07:26:09.903 Sep 1 07:26:09.904: INFO: Running 'oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/root/.kube/config create -f /tmp/fixture-testdata-dir978363556/test/extended/testdata/egress-firewall/ovnk-egressfirewall-test.yaml' egressfirewall.k8s.ovn.org/default createdSTEP: sending traffic to control plane nodes should work 09/01/23 07:26:22.122 Sep 1 07:26:22.130: INFO: Running 'oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 -k https://193.168.200.248:6443' Sep 1 07:26:23.358: INFO: Error running /usr/local/bin/oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 -k https://193.168.200.248:6443: StdOut> command terminated with exit code 28 StdErr> command terminated with exit code 28[AfterEach] [sig-network][Feature:EgressFirewall] github.com/openshift/origin/test/extended/util/client.go:180 STEP: Collecting events from namespace "e2e-test-egress-firewall-e2e-2vvzx". 09/01/23 07:26:23.358 STEP: Found 4 events. 09/01/23 07:26:23.361 Sep 1 07:26:23.361: INFO: At 2023-09-01 07:26:08 -0400 EDT - event for egressfirewall: {multus } AddedInterface: Add eth0 [10.131.0.89/23] from ovn-kubernetes Sep 1 07:26:23.361: INFO: At 2023-09-01 07:26:08 -0400 EDT - event for egressfirewall: {kubelet lon06-worker-0.rdr-qe-ocp-upi-7250.redhat.com} Pulled: Container image "quay.io/openshift/community-e2e-images:e2e-quay-io-redhat-developer-nfs-server-1-1-dlXGfzrk5aNo8EjC" already present on machine Sep 1 07:26:23.361: INFO: At 2023-09-01 07:26:08 -0400 EDT - event for egressfirewall: {kubelet lon06-worker-0.rdr-qe-ocp-upi-7250.redhat.com} Created: Created container egressfirewall-container Sep 1 07:26:23.361: INFO: At 2023-09-01 07:26:08 -0400 EDT - event for egressfirewall: {kubelet lon06-worker-0.rdr-qe-ocp-upi-7250.redhat.com} Started: Started container egressfirewall-container Sep 1 07:26:23.363: INFO: POD NODE PHASE GRACE CONDITIONS Sep 1 07:26:23.363: INFO: egressfirewall lon06-worker-0.rdr-qe-ocp-upi-7250.redhat.com Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2023-09-01 07:26:07 -0400 EDT } {Ready True 0001-01-01 00:00:00 +0000 UTC 2023-09-01 07:26:09 -0400 EDT } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2023-09-01 07:26:09 -0400 EDT } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2023-09-01 07:26:07 -0400 EDT }] Sep 1 07:26:23.363: INFO: Sep 1 07:26:23.367: INFO: skipping dumping cluster info - cluster too large Sep 1 07:26:23.383: INFO: Deleted {user.openshift.io/v1, Resource=users e2e-test-egress-firewall-e2e-2vvzx-user}, err: <nil> Sep 1 07:26:23.398: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients e2e-client-e2e-test-egress-firewall-e2e-2vvzx}, err: <nil> Sep 1 07:26:23.414: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens sha256~X_2HPGEj3O9hpd-3XKTckrp9bO23s_7zlJ3Tkn7ncBE}, err: <nil> [AfterEach] [sig-network][Feature:EgressFirewall] github.com/openshift/origin/test/extended/util/client.go:180 STEP: Collecting events from namespace "e2e-test-no-egress-firewall-e2e-84f48". 09/01/23 07:26:23.414 STEP: Found 0 events. 09/01/23 07:26:23.416 Sep 1 07:26:23.417: INFO: POD NODE PHASE GRACE CONDITIONS Sep 1 07:26:23.417: INFO: Sep 1 07:26:23.421: INFO: skipping dumping cluster info - cluster too large Sep 1 07:26:23.446: INFO: Deleted {user.openshift.io/v1, Resource=users e2e-test-no-egress-firewall-e2e-84f48-user}, err: <nil> Sep 1 07:26:23.451: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients e2e-client-e2e-test-no-egress-firewall-e2e-84f48}, err: <nil> Sep 1 07:26:23.457: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens sha256~2Lk8-jWfwpdyo59E9YF7kQFKH2LBUSvnbJdKj7rOzn4}, err: <nil> [DeferCleanup (Each)] [sig-network][Feature:EgressFirewall] dump namespaces | framework.go:196 STEP: dump namespace information after failure 09/01/23 07:26:23.457 [DeferCleanup (Each)] [sig-network][Feature:EgressFirewall] tear down framework | framework.go:193 STEP: Destroying namespace "e2e-test-no-egress-firewall-e2e-84f48" for this suite. 09/01/23 07:26:23.457 [DeferCleanup (Each)] [sig-network][Feature:EgressFirewall] dump namespaces | framework.go:196 STEP: dump namespace information after failure 09/01/23 07:26:23.462 [DeferCleanup (Each)] [sig-network][Feature:EgressFirewall] tear down framework | framework.go:193 STEP: Destroying namespace "e2e-test-egress-firewall-e2e-2vvzx" for this suite. 09/01/23 07:26:23.463 fail [github.com/openshift/origin/test/extended/networking/egress_firewall.go:155]: Unexpected error: <*fmt.wrapError | 0xc001dd50a0>: { msg: "Error running /usr/local/bin/oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 -k https://193.168.200.248:6443:\nStdOut>\ncommand terminated with exit code 28\nStdErr>\ncommand terminated with exit code 28\nexit status 28\n", err: <*exec.ExitError | 0xc001dd5080>{ ProcessState: { pid: 140483, status: 7168, rusage: { Utime: {Sec: 0, Usec: 149480}, Stime: {Sec: 0, Usec: 19930}, Maxrss: 222592, Ixrss: 0, Idrss: 0, Isrss: 0, Minflt: 1536, Majflt: 0, Nswap: 0, Inblock: 0, Oublock: 0, Msgsnd: 0, Msgrcv: 0, Nsignals: 0, Nvcsw: 596, Nivcsw: 173, }, }, Stderr: nil, }, } Error running /usr/local/bin/oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 -k https://193.168.200.248:6443: StdOut> command terminated with exit code 28 StdErr> command terminated with exit code 28 exit status 28 occurred Ginkgo exit error 1: exit with code 1failed: (18.7s) 2023-09-01T11:26:23 "[sig-network][Feature:EgressFirewall] when using openshift ovn-kubernetes should ensure egressfirewall is created [Suite:openshift/conformance/parallel]"
Version-Release number of selected component (if applicable):
4.13.11
How reproducible:
This e2e failure is not consistently reproduceable.
Steps to Reproduce:
1.Start a Z stream Job via Jenkins 2.monitor e2e
Actual results:
e2e is getting failed
Expected results:
e2e should pass
Additional info:
This is a clone of issue OCPBUGS-33695. The following is the description of the original issue:
—
Description of problem:
Found in QE CI case failure https://issues.redhat.com/browse/OCPQE-22045 that: 4.16 HCP oauth-openshift panics when anonymously curl'ed (this is not seen in OCP 4.16 and HCP 4.15).
Version-Release number of selected component (if applicable):
HCP 4.16 4.16.0-0.nightly-2024-05-14-165654
How reproducible:
Always
Steps to Reproduce:
1. $ export KUBECONFIG=HCP.kubeconfig $ oc get --raw=/.well-known/oauth-authorization-server | jq -r .issuer https://oauth-clusters-hypershift-ci-283235.apps.xxxx.com:443 2. Panics when anonymously curl'ed: $ curl -k "https://oauth-clusters-hypershift-ci-283235.apps.xxxx.com:443/oauth/authorize?response_type=token&client_id=openshift-challenging-client" This request caused apiserver to panic. Look in the logs for details. 3. Check logs. $ oc --kubeconfig=/home/xxia/my/env/hypershift-management/mjoseph-hyp-283235-416/kubeconfig -n clusters-hypershift-ci-283235 get pod | grep oauth-openshift oauth-openshift-55c6967667-9bxz9 2/2 Running 0 6h23m oauth-openshift-55c6967667-l55fh 2/2 Running 0 6h22m oauth-openshift-55c6967667-ntc6l 2/2 Running 0 6h23m $ for i in oauth-openshift-55c6967667-9bxz9 oauth-openshift-55c6967667-l55fh oauth-openshift-55c6967667-ntc6l; do oc logs --timestamps --kubeconfig=/home/xxia/my/env/hypershift-management/mjoseph-hyp-283235-416/kubeconfig -n clusters-hypershift-ci-283235 $i > logs/hypershift-management/mjoseph-hyp-283235-416/$i.log; done $ grep -il panic *.log oauth-openshift-55c6967667-ntc6l.log $ cat oauth-openshift-55c6967667-ntc6l.log 2024-05-15T03:43:59.769424528Z I0515 03:43:59.769303 1 secure_serving.go:57] Forcing use of http/1.1 only 2024-05-15T03:43:59.772754182Z I0515 03:43:59.772725 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController 2024-05-15T03:43:59.772803132Z I0515 03:43:59.772782 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" 2024-05-15T03:43:59.772841518Z I0515 03:43:59.772834 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file 2024-05-15T03:43:59.772870498Z I0515 03:43:59.772787 1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController 2024-05-15T03:43:59.772982605Z I0515 03:43:59.772736 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file" 2024-05-15T03:43:59.773009678Z I0515 03:43:59.773002 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file 2024-05-15T03:43:59.773214896Z I0515 03:43:59.773194 1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/etc/kubernetes/certs/serving-cert/tls.crt::/etc/kubernetes/certs/serving-cert/tls.key" 2024-05-15T03:43:59.773939655Z I0515 03:43:59.773923 1 secure_serving.go:213] Serving securely on [::]:6443 2024-05-15T03:43:59.773965659Z I0515 03:43:59.773952 1 tlsconfig.go:240] "Starting DynamicServingCertificateController" 2024-05-15T03:43:59.873008524Z I0515 03:43:59.872970 1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController 2024-05-15T03:43:59.873078108Z I0515 03:43:59.873021 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file 2024-05-15T03:43:59.873120163Z I0515 03:43:59.873032 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file 2024-05-15T09:25:25.782066400Z E0515 09:25:25.782026 1 runtime.go:77] Observed a panic: runtime error: invalid memory address or nil pointer dereference 2024-05-15T09:25:25.782066400Z goroutine 8662 [running]: 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1.1() 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/server/filters/timeout.go:110 +0x9c 2024-05-15T09:25:25.782066400Z panic({0x2115f60?, 0x3c45ec0?}) 2024-05-15T09:25:25.782066400Z runtime/panic.go:914 +0x21f 2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/oauth/handlers.(*unionAuthenticationHandler).AuthenticationNeeded(0xc0008a90e0, {0x7f2a74268bd8?, 0xc000607760?}, {0x293c340?, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/oauth/handlers/default_auth_handler.go:122 +0xce1 2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/oauth/handlers.(*authorizeAuthenticator).HandleAuthorize(0xc0008a9110, 0xc0007b06c0, 0x7?, {0x293c340, 0xc0007d1ef0}) 2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/oauth/handlers/authenticator.go:54 +0x21d 2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/osinserver.AuthorizeHandlers.HandleAuthorize({0xc0008a91a0?, 0x3, 0x772d66?}, 0x22ef8e0?, 0xc0007b2420?, {0x293c340, 0xc0007d1ef0}) 2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/osinserver/interfaces.go:29 +0x95 2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/osinserver.(*osinServer).handleAuthorize(0xc0004a54c0, {0x293c340, 0xc0007d1ef0}, 0xd?) 2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/osinserver/osinserver.go:77 +0x25e 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x0?, {0x293c340?, 0xc0007d1ef0?}, 0x410acc?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z net/http.(*ServeMux).ServeHTTP(0x2390e60?, {0x293c340, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z net/http/server.go:2514 +0x142 2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).buildHandlerChainForOAuth.WithRestoreOAuthHeaders.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/server/headers/oauthbasic.go:57 +0x1ca 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x235fda0?, {0x293c340?, 0xc0007d1ef0?}, 0x291ef40?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.TrackCompleted.trackCompleted.func21({0x293c340?, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:110 +0x177 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x29490d0?, {0x293c340?, 0xc0007d1ef0?}, 0x4?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/endpoints/filters.withAuthorization.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/authorization.go:78 +0x639 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0xc1893dc16e2d2585?, {0x293c340?, 0xc0007d1ef0?}, 0xc0007fabb8?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:84 +0x192 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x3c5b920?, {0x293c340?, 0xc0007d1ef0?}, 0x3?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server/filters.WithMaxInFlightLimit.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/server/filters/maxinflight.go:196 +0x262 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x235fda0?, {0x293c340?, 0xc0007d1ef0?}, 0x291ef40?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.TrackCompleted.trackCompleted.func23({0x293c340?, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:110 +0x177 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x7f2a74226390?, {0x293c340?, 0xc0007d1ef0?}, 0xc0007953c8?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithImpersonation.func4({0x293c340, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/impersonation.go:50 +0x1c3 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0xcd1160?, {0x293c340?, 0xc0007d1ef0?}, 0x0?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:84 +0x192 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x235fda0?, {0x293c340?, 0xc0007d1ef0?}, 0x291ef40?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.TrackCompleted.trackCompleted.func24({0x293c340?, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:110 +0x177 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0xcd1160?, {0x293c340?, 0xc0007d1ef0?}, 0x0?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:84 +0x192 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x235fda0?, {0x293c340?, 0xc0007d1ef0?}, 0x291ef40?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.TrackCompleted.trackCompleted.func26({0x293c340?, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:110 +0x177 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x29490d0?, {0x293c340?, 0xc0007d1ef0?}, 0x291a100?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/endpoints/filters.withAuthentication.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3700) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/authentication.go:120 +0x7e5 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x29490d0?, {0x293c340?, 0xc0007d1ef0?}, 0x291ef40?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3500) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:94 +0x37a 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0xc0003e0900?, {0x293c340?, 0xc0007d1ef0?}, 0xc00061af20?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1() 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/server/filters/timeout.go:115 +0x62 2024-05-15T09:25:25.782066400Z created by k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP in goroutine 8660 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/server/filters/timeout.go:101 +0x1b2 2024-05-15T09:25:25.782066400Z 2024-05-15T09:25:25.782066400Z goroutine 8660 [running]: 2024-05-15T09:25:25.782066400Z k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1fb1a00?, 0xc000810260}) 2024-05-15T09:25:25.782066400Z k8s.io/apimachinery@v0.29.2/pkg/util/runtime/runtime.go:75 +0x85 2024-05-15T09:25:25.782066400Z k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0xc0005aa840, 0x1, 0x1c08865?}) 2024-05-15T09:25:25.782066400Z k8s.io/apimachinery@v0.29.2/pkg/util/runtime/runtime.go:49 +0x6b 2024-05-15T09:25:25.782066400Z panic({0x1fb1a00?, 0xc000810260?}) 2024-05-15T09:25:25.782066400Z runtime/panic.go:914 +0x21f 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP(0xc000528cc0, {0x2944dd0, 0xc000476460}, 0xdf8475800?) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/server/filters/timeout.go:121 +0x35c 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithRequestDeadline.withRequestDeadline.func27({0x2944dd0, 0xc000476460}, 0xc0007a3300) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/request_deadline.go:100 +0x237 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x29490d0?, {0x2944dd0?, 0xc000476460?}, 0x2459ac0?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithWaitGroup.withWaitGroup.func28({0x2944dd0, 0xc000476460}, 0xc0004764b0?) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/server/filters/waitgroup.go:86 +0x18c 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0xc0007a3200?, {0x2944dd0?, 0xc000476460?}, 0xc0004764b0?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithWarningRecorder.func13({0x2944dd0?, 0xc000476460}, 0xc000476410?) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/warning.go:35 +0xc6 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x2390e60?, {0x2944dd0?, 0xc000476460?}, 0xd?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithCacheControl.func14({0x2944dd0, 0xc000476460}, 0x0?) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/cachecontrol.go:31 +0xa7 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0xc0002a0fa0?, {0x2944dd0?, 0xc000476460?}, 0xc0005aad90?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithHTTPLogging.WithLogging.withLogging.func34({0x2944dd0, 0xc000476460}, 0x1?) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/server/httplog/httplog.go:111 +0x95 2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0xc0007b0360?, {0x2944dd0?, 0xc000476460?}, 0x0?) 2024-05-15T09:25:25.782066400Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/endpoints/filters.WithTracing.func1({0x2944dd0?, 0xc000476460?}, 0xc0007a3200?) 2024-05-15T09:25:25.782066400Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/traces.go:42 +0x222 2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0x29490d0?, {0x2944dd0?, 0xc000476460?}, 0x291ef40?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782129547Z go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP(0xc000289b80, {0x293c340?, 0xc0007d1bf0}, 0xc0007a3100, {0x2923a40, 0xc000528d68}) 2024-05-15T09:25:25.782129547Z go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.44.0/handler.go:217 +0x1202 2024-05-15T09:25:25.782129547Z go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1({0x293c340?, 0xc0007d1bf0?}, 0xc0001fec40?) 2024-05-15T09:25:25.782129547Z go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.44.0/handler.go:81 +0x35 2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0x2948fb0?, {0x293c340?, 0xc0007d1bf0?}, 0x100?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782129547Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithLatencyTrackers.func16({0x29377e0?, 0xc0001fec40}, 0xc000289e40?) 2024-05-15T09:25:25.782129547Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/webhook_duration.go:57 +0x14a 2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0xc0007a2f00?, {0x29377e0?, 0xc0001fec40?}, 0x7f2abb853108?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782129547Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithRequestInfo.func17({0x29377e0, 0xc0001fec40}, 0x3d02360?) 2024-05-15T09:25:25.782129547Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/requestinfo.go:39 +0x118 2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0xc0007a2e00?, {0x29377e0?, 0xc0001fec40?}, 0x12a1dc02246f?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782129547Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithRequestReceivedTimestamp.withRequestReceivedTimestampWithClock.func31({0x29377e0, 0xc0001fec40}, 0xc000508b58?) 2024-05-15T09:25:25.782129547Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/request_received_time.go:38 +0xaf 2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0x3?, {0x29377e0?, 0xc0001fec40?}, 0xc0005ab818?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782129547Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithMuxAndDiscoveryComplete.func18({0x29377e0?, 0xc0001fec40?}, 0xc0007a2e00?) 2024-05-15T09:25:25.782129547Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/mux_discovery_complete.go:52 +0xd5 2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0xc000081800?, {0x29377e0?, 0xc0001fec40?}, 0xc0005ab888?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782129547Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithPanicRecovery.withPanicRecovery.func32({0x29377e0?, 0xc0001fec40?}, 0xc0007d18f0?) 2024-05-15T09:25:25.782129547Z k8s.io/apiserver@v0.29.2/pkg/server/filters/wrap.go:74 +0xa6 2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0x29490d0?, {0x29377e0?, 0xc0001fec40?}, 0xc00065eea0?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782129547Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithAuditInit.withAuditInit.func33({0x29377e0, 0xc0001fec40}, 0xc00040c580?) 2024-05-15T09:25:25.782129547Z k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/audit_init.go:63 +0x12c 2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0x2390e60?, {0x29377e0?, 0xc0001fec40?}, 0xd?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782129547Z github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).buildHandlerChainForOAuth.WithPreserveOAuthHeaders.func2({0x29377e0, 0xc0001fec40}, 0xc0007a2d00) 2024-05-15T09:25:25.782129547Z github.com/openshift/oauth-server/pkg/server/headers/oauthbasic.go:42 +0x16e 2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0xc0005aba80?, {0x29377e0?, 0xc0001fec40?}, 0x24c95d5?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782129547Z github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).buildHandlerChainForOAuth.WithStandardHeaders.func3({0x29377e0, 0xc0001fec40}, 0xc0005abb18?) 2024-05-15T09:25:25.782129547Z github.com/openshift/oauth-server/pkg/server/headers/headers.go:30 +0xde 2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0xc0005abb68?, {0x29377e0?, 0xc0001fec40?}, 0xc00040c580?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2136 +0x29 2024-05-15T09:25:25.782129547Z k8s.io/apiserver/pkg/server.(*APIServerHandler).ServeHTTP(0x3d33480?, {0x29377e0?, 0xc0001fec40?}, 0xc0005abb50?) 2024-05-15T09:25:25.782129547Z k8s.io/apiserver@v0.29.2/pkg/server/handler.go:189 +0x25 2024-05-15T09:25:25.782129547Z net/http.serverHandler.ServeHTTP({0xc0007d1830?}, {0x29377e0?, 0xc0001fec40?}, 0x6?) 2024-05-15T09:25:25.782129547Z net/http/server.go:2938 +0x8e 2024-05-15T09:25:25.782129547Z net/http.(*conn).serve(0xc0007b02d0, {0x29490d0, 0xc000585e90}) 2024-05-15T09:25:25.782129547Z net/http/server.go:2009 +0x5f4 2024-05-15T09:25:25.782129547Z created by net/http.(*Server).Serve in goroutine 249 2024-05-15T09:25:25.782129547Z net/http/server.go:3086 +0x5cb 2024-05-15T09:25:25.782129547Z http: superfluous response.WriteHeader call from k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithPanicRecovery.func19 (wrap.go:57) 2024-05-15T09:25:25.782129547Z E0515 09:25:25.782066 1 wrap.go:58] "apiserver panic'd" method="GET" URI="/oauth/authorize?response_type=token&client_id=openshift-challenging-client" auditID="ac4795ff-5935-4ff5-bc9e-d84018f29469"
Actual results:
Panics when anonymously curl'ed
Expected results:
No panic
This is a clone of issue OCPBUGS-35852. The following is the description of the original issue:
—
Description of problem:
When setting ENV OPENSHIFT_INSTALL_PRESERVE_BOOTSTRAP to keep bootstrap, and launch capi-based installation, installer exit with error when collecting applied cluster api manifests..., since local cluster api was already stopped. 06-20 15:26:51.216 level=debug msg=Machine jima417aws-gjrzd-bootstrap is ready. Phase: Provisioned 06-20 15:26:51.216 level=debug msg=Checking that machine jima417aws-gjrzd-master-0 has provisioned... 06-20 15:26:51.217 level=debug msg=Machine jima417aws-gjrzd-master-0 has status: Provisioned 06-20 15:26:51.217 level=debug msg=Checking that IP addresses are populated in the status of machine jima417aws-gjrzd-master-0... 06-20 15:26:51.217 level=debug msg=Checked IP InternalDNS: ip-10-0-50-47.us-east-2.compute.internal 06-20 15:26:51.217 level=debug msg=Found internal IP address: 10.0.50.47 06-20 15:26:51.217 level=debug msg=Machine jima417aws-gjrzd-master-0 is ready. Phase: Provisioned 06-20 15:26:51.217 level=debug msg=Checking that machine jima417aws-gjrzd-master-1 has provisioned... 06-20 15:26:51.217 level=debug msg=Machine jima417aws-gjrzd-master-1 has status: Provisioned 06-20 15:26:51.217 level=debug msg=Checking that IP addresses are populated in the status of machine jima417aws-gjrzd-master-1... 06-20 15:26:51.218 level=debug msg=Checked IP InternalDNS: ip-10-0-75-199.us-east-2.compute.internal 06-20 15:26:51.218 level=debug msg=Found internal IP address: 10.0.75.199 06-20 15:26:51.218 level=debug msg=Machine jima417aws-gjrzd-master-1 is ready. Phase: Provisioned 06-20 15:26:51.218 level=debug msg=Checking that machine jima417aws-gjrzd-master-2 has provisioned... 06-20 15:26:51.218 level=debug msg=Machine jima417aws-gjrzd-master-2 has status: Provisioned 06-20 15:26:51.218 level=debug msg=Checking that IP addresses are populated in the status of machine jima417aws-gjrzd-master-2... 06-20 15:26:51.218 level=debug msg=Checked IP InternalDNS: ip-10-0-60-118.us-east-2.compute.internal 06-20 15:26:51.218 level=debug msg=Found internal IP address: 10.0.60.118 06-20 15:26:51.218 level=debug msg=Machine jima417aws-gjrzd-master-2 is ready. Phase: Provisioned 06-20 15:26:51.218 level=info msg=Control-plane machines are ready 06-20 15:26:51.218 level=info msg=Cluster API resources have been created. Waiting for cluster to become ready... 06-20 15:26:51.219 level=warning msg=OPENSHIFT_INSTALL_PRESERVE_BOOTSTRAP is set, shutting down local control plane. 06-20 15:26:51.219 level=info msg=Shutting down local Cluster API control plane... 06-20 15:26:51.473 level=info msg=Stopped controller: Cluster API 06-20 15:26:51.473 level=info msg=Stopped controller: aws infrastructure provider 06-20 15:26:52.830 level=info msg=Local Cluster API system has completed operations 06-20 15:26:52.830 level=debug msg=Collecting applied cluster api manifests... 06-20 15:26:52.831 level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: [failed to get manifest openshift-cluster-api-guests: Get "https://127.0.0.1:46555/api/v1/namespaces/openshift-cluster-api-guests": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest default: Get "https://127.0.0.1:46555/apis/infrastructure.cluster.x-k8s.io/v1beta2/awsclustercontrolleridentities/default": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd: Get "https://127.0.0.1:46555/apis/cluster.x-k8s.io/v1beta1/namespaces/openshift-cluster-api-guests/clusters/jima417aws-gjrzd": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd: Get "https://127.0.0.1:46555/apis/infrastructure.cluster.x-k8s.io/v1beta2/namespaces/openshift-cluster-api-guests/awsclusters/jima417aws-gjrzd": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-bootstrap: Get "https://127.0.0.1:46555/apis/infrastructure.cluster.x-k8s.io/v1beta2/namespaces/openshift-cluster-api-guests/awsmachines/jima417aws-gjrzd-bootstrap": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-master-0: Get "https://127.0.0.1:46555/apis/infrastructure.cluster.x-k8s.io/v1beta2/namespaces/openshift-cluster-api-guests/awsmachines/jima417aws-gjrzd-master-0": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-master-1: Get "https://127.0.0.1:46555/apis/infrastructure.cluster.x-k8s.io/v1beta2/namespaces/openshift-cluster-api-guests/awsmachines/jima417aws-gjrzd-master-1": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-master-2: Get "https://127.0.0.1:46555/apis/infrastructure.cluster.x-k8s.io/v1beta2/namespaces/openshift-cluster-api-guests/awsmachines/jima417aws-gjrzd-master-2": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-bootstrap: Get "https://127.0.0.1:46555/apis/cluster.x-k8s.io/v1beta1/namespaces/openshift-cluster-api-guests/machines/jima417aws-gjrzd-bootstrap": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-master-0: Get "https://127.0.0.1:46555/apis/cluster.x-k8s.io/v1beta1/namespaces/openshift-cluster-api-guests/machines/jima417aws-gjrzd-master-0": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-master-1: Get "https://127.0.0.1:46555/apis/cluster.x-k8s.io/v1beta1/namespaces/openshift-cluster-api-guests/machines/jima417aws-gjrzd-master-1": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-master-2: Get "https://127.0.0.1:46555/apis/cluster.x-k8s.io/v1beta1/namespaces/openshift-cluster-api-guests/machines/jima417aws-gjrzd-master-2": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-bootstrap: Get "https://127.0.0.1:46555/api/v1/namespaces/openshift-cluster-api-guests/secrets/jima417aws-gjrzd-bootstrap": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-master: Get "https://127.0.0.1:46555/api/v1/namespaces/openshift-cluster-api-guests/secrets/jima417aws-gjrzd-master": dial tcp 127.0.0.1:46555: connect: connection refused]
Version-Release number of selected component (if applicable):
4.16/4.17 nightly build
How reproducible:
always
Steps to Reproduce:
1. Set ENV OPENSHIFT_INSTALL_PRESERVE_BOOTSTRAP 2. Trigger the capi-based installation 3.
Actual results:
Installer exited when collecting capi manifests.
Expected results:
Installation should be successful.
Additional info:
Description of problem:
In https://github.com/openshift/release/pull/47648 ecr-credentials-provider is built in CI and later included in RHCOS. In order to make it work on OKD it needs to be included in the payload, so that OKD machine-os could extract RPM and install it on the host
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Ref: OCPBUGS-25662
Please review the following PR: https://github.com/openshift/ironic-static-ip-manager/pull/41
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
when there is only one server CSR pending on approval, we still show two records(one is client CSR requires approval which is already several hours old and the other is server CSR requires approval)
Version-Release number of selected component (if applicable):
pre-merge testing of https://github.com/openshift/console/pull/13493
How reproducible:
Always
Steps to Reproduce:
1. select one node which is joining the cluster, approve client CSR and do not approve server CSR, wait for some time => we can see only one node is pending on server CSR approval $ oc get csr | grep Pending | grep system:node csr-54sn4 142m kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending csr-7nhb9 65m kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending csr-9g22f 4m4s kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending csr-bgrdq 35m kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending csr-chqnf 50m kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending csr-f4sbl 127m kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending csr-msnml 157m kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending csr-p9qrp 19m kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending csr-qp2pw 112m kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending csr-qrlnv 96m kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending csr-tk7j4 81m kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending
Actual results:
1. on nodes list page, we can see two rows shown for node ip-10-0-49-55.us-east-2.compute.internal
Expected results:
since the pending client CSR has been there for several hours and the node now is actually waiting for server CSR approval, we should only show one record/row to indicate user that it requires server CSR approval The pending client CSR associated with ip-10-0-49-55.us-east-2.compute.internal is already 3 hours old $ oc get csr csr-4d628 NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-4d628 3h kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending
Additional info:
Description of problem:
In accounts with a large amount of resources, the destroy code will fail to list all resources. This has revealed some changes that need to be made to the destroy code to handle these situations.
Version-Release number of selected component (if applicable):
How reproducible:
Difficult - but we have an account where we can reproduce it consistently
Steps to Reproduce:
1. Try to destroy a cluster in an account with a large amount of resources. 2. Fail. 3.
Actual results:
Fail to destroy
Expected results:
Destroy succeeds
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Cluster operator status showing `Unavailable`:
ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver observed in phase Failed with reason: APIServiceResourceIssue, message: found the CA cert is not active
Below script used for checking validity of the certificate and recreate them
# Check Cluster Existing Certificates : echo -e "NAMESPACE\tNAME\tEXPIRY" && oc get secrets -A -o go-template='{{range .items}}{{if eq .type "kubernetes.io/tls"}}{{.metadata.namespace}}{{" "}}{{.metadata.name}}{{" "}}{{index .data "tls.crt"}}{{"\n"}}{{end}}{{end}}' | while read namespace name cert; do echo -en "$namespace\t$name\t"; echo $cert | base64 -d | openssl x509 -noout -enddate; done | column -t # Manually Update Cluster Certificates : az aro update -n xxxx -g xxxx --refresh-credentials --debug # Check again Cluster Existing Certificates : echo -e "NAMESPACE\tNAME\tEXPIRY" && oc get secrets -A -o go-template='{{range .items}}{{if eq .type "kubernetes.io/tls"}}{{.metadata.namespace}}{{" "}}{{.metadata.name}}{{" "}}{{index .data "tls.crt"}}{{"\n"}}{{end}}{{end}}' | while read namespace name cert; do echo -en "$namespace\t$name\t"; echo $cert | base64 -d | openssl x509 -noout -enddate; done | column -t #Renew Secret/Certificate for OLM : # Check Secret Expiration : oc get secret packageserver-service-cert -o json -n openshift-operator-lifecycle-manager | jq -r '.data | .["tls.crt"]' | base64 -d | openssl x509 -noout -dates # Backup the current secret : oc get secret packageserver-service-cert -o json -n openshift-operator-lifecycle-manager > packageserver-service-cert.yaml # Delete the Secret : oc delete secret packageserver-service-cert -n openshift-operator-lifecycle-manager # Check Secret Expiration again : oc get secret packageserver-service-cert -o json -n openshift-operator-lifecycle-manager | jq -r '.data | .["tls.crt"]' | base64 -d | openssl x509 -noout -dates # Get Cluster Operator : oc get co oc get co operator-lifecycle-manager oc get co operator-lifecycle-manager-catalog oc get co operator-lifecycle-manager-packageserver # Go to the kube-system namespace and take the backup of extension-apiserver-authentication configmap: oc project kube-system oc get cm extension-apiserver-authentication -oyaml >> extcm_backup.yaml # delete the extension-apiserver-authentication configmap to : oc delete cm extension-apiserver-authentication -n kube-system oc get cm -n kube-system |grep extension-apiserver-authentication oc get apiservice v1.packages.operators.coreos.com -o jsonpath='{.spec.caBundle}' | base64 -d | openssl x509 -noout -text
We have check the certificate details as below :
$ oc get apiservice v1.packages.operators.coreos.com -o jsonpath='{.spec.caBundle}' | base64 -d | openssl x509 -text E1213 10:24:41.606151 3802053 memcache.go:255] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request E1213 10:24:41.639144 3802053 memcache.go:106] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request E1213 10:24:41.651532 3802053 memcache.go:106] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request E1213 10:24:41.660851 3802053 memcache.go:106] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request Certificate: Data: Version: 3 (0x2) Serial Number: 5319897470906267024 (0x49d4129052ddf590) Signature Algorithm: ecdsa-with-SHA256 Issuer: O = "Red Hat, Inc." Validity Not Before: Nov 29 18:41:35 2021 GMT Not After : Nov 29 18:41:35 2023 GMT Subject: O = "Red Hat, Inc." Subject Public Key Info: Public Key Algorithm: id-ecPublicKey Public-Key: (256 bit) pub: 04:ea:c0:af:d3:af:e6:0e:61:82:c8:f4:fe:ec:22: 8d:c5:c1:08:6f:91:92:8b:09:05:e9:72:ca:d4:68: fb:aa:e1:ec:e2:e8:ca:32:4c:1f:e7:fc:3a:eb:61: 0b:df:9c:b4:13:62:f4:67:6c:d2:8f:97:a0:a8:a8: 69:08:22:4d:62 ASN1 OID: prime256v1 NIST CURVE: P-256 X509v3 extensions: X509v3 Key Usage: critical Digital Signature, Certificate Sign X509v3 Extended Key Usage: TLS Web Client Authentication, TLS Web Server Authentication X509v3 Basic Constraints: critical CA:TRUE X509v3 Subject Key Identifier: 53:A4:1D:22:F8:0F:8E:C5:74:8C:C6:F4:90:F0:2D:29:B0:65:89:19 Signature Algorithm: ecdsa-with-SHA256 30:45:02:21:00:f5:32:98:3d:34:b6:fd:65:47:3b:31:0d:88: fc:fe:35:cd:4f:51:75:a0:89:16:1a:9e:56:d5:f7:49:e6:3a: a3:02:20:43:fa:81:78:56:f4:1f:9b:3a:5b:7f:28:7e:a8:5b: b7:7a:3e:0a:99:67:88:0e:66:e4:c9:d5:9d:2f:79:80:3e ----BEGIN CERTIFICATE---- MIIBhzCCAS2gAwIBAgIISdQSkFLd9ZAwCgYIKoZIzj0EAwIwGDEWMBQGA1UEChMN UmVkIEhhdCwgSW5jLjAeFw0yMTExMjkxODQxMzVaFw0yMzExMjkxODQxMzVaMBgx FjAUBgNVBAoTDVJlZCBIYXQsIEluYy4wWTATBgcqhkjOPQIBBggqhkjOPQMBBwNC AATqwK/Tr+YOYYLI9P7sIo3FwQhvkZKLCQXpcsrUaPuq4ezi6MoyTB/n/DrrYQvf nLQTYvRnbNKPl6CoqGkIIk1io2EwXzAOBgNVHQ8BAf8EBAMCAoQwHQYDVR0lBBYw FAYIKwYBBQUHAwIGCCsGAQUFBwMBMA8GA1UdEwEB/wQFMAMBAf8wHQYDVR0OBBYE FFOkHSL4D47FdIzG9JDwLSmwZYkZMAoGCCqGSM49BAMCA0gAMEUCIQD1Mpg9NLb9 ZUc7MQ2I/P41zU9RdaCJFhqeVtX3SeY6owIgQ/qBeFb0H5s6W38ofqhbt3o+Cpln iA5m5MnVnS95gD4=
Description of problem:
When creating an application based on devfile "Import from Git" in Developer console using only GitLab repo, the following error block to create it. It only happened when using GitLab, not Github. And CLI operation based on "oc new-app" could work well. In other words, the issue is only for Dev console. Could not fetch kubernetes resource "/deploy.yaml" for component "kubernetes-deploy" from Git repository https://{gitlaburl}.
Version-Release number of selected component (if applicable):
4.15.z
How reproducible:
Always
Steps to Reproduce:
You can always reproduce according to the following procedures. a. Switch "Developer" mode at your web console. b. Move "+Add", then click "Import from Git" in "Git Repository" section at the page. c. Input "https://<GITLAB HOSTNAME>/XXXX/devfile-sample-go-basic.git" to the "Git Repo URL" text box. d. Select "GitLab" at "Git type" drop box. e. You can see the below error messages.
Actual results:
The "/deploy.yaml" file path evaluated as invalid one with 400 response status during the process as follows. Look at the URL, "/%2Fdeploy.yaml" shows us leading slash was duplicated there. Request URL: https://<GITLAB HOSTNAME>/api/v4/projects/yyyy/repository/files/%2Fdeploy.yaml/raw?ref=main Response: {"error":"file_path should be a valid file path"}
Expected results:
The request URL for handling "deploy.yaml" file should be removed the duplicated leading slash and provide correct file path. Request URL: https://<GITLAB HOSTNAME>/api/v4/projects/yyyy/repository/files/deploy.yaml/raw?ref=main Response: "deploy.yaml" contents.
Additional info:
I submitted a pull request to fix this here: https://github.com/openshift/console/pull/13812
Please review the following PR: https://github.com/openshift/images/pull/156
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
While testing oc adm upgrade status against b02, I noticed some COs do not have any annotations, while I expected them to have the include/exclude.release.openshift.io/* ones (to recognize COs that come from the payload).
$ b02 get clusteroperator etcd -o jsonpath={.metadata.annotations} $ ota-stage get clusteroperator etcd -o jsonpath={.metadata.annotations} {"exclude.release.openshift.io/internal-openshift-hosted":"true","include.release.openshift.io/self-managed-high-availability":"true","include.release.openshift.io/single-node-developer":"true"}
CVO does not reconcile CO resources once they exist, only precreates them but does not touch them once they exist. Build02 does not have CO with reconciled metadata because it was born as 4.2 which (AFAIK) is before OCP started to use the exclude/include annotations.
4.16 (development branch)
deterministic
1. delete an annotation on a ClusterOperator resource
The annotation wont be recreated
The annotation should be recreated
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled
Description of problem:
OCP cluster upgrade is stuck with image registry pod in degraded state. The image registry co shows the below error message. - lastTransitionTime: "2024-09-13T03:15:05Z" message: "Progressing: All registry resources are removed\nNodeCADaemonProgressing: The daemon set node-ca is deployed\nAzurePathFixProgressing: Migration failed: I0912 18:18:02.117077 1 main.go:233] Azure Stack Hub environment variables not present in current environment, skipping setup...\nAzurePathFixProgressing: panic: Get \"https://xxxxximageregistry.blob.core.windows.net/xxxxcontainer?comp=list&prefix=docker&restype=container\": dial tcp: lookup xxxximageregistry.blob.core.windows.net on 192.168.xx.xx. no such host\nAzurePathFixProgressing: \nAzurePathFixProgressing: goroutine 1 [running]:\nAzurePathFixProgressing: main.main()\nAzurePathFixProgressing: \t/go/src/github.com/openshift/cluster-image-registry-operator/cmd/move-blobs/main.go:53 +0x12a\nAzurePathFixProgressing: " reason: AzurePathFixFailed::Removed status: "False" type: Progressing
Version-Release number of selected component (if applicable):
4.14.33
How reproducible:
Steps to Reproduce:
1. configure azure storage in configs.imageregistry.operator.openshift.io/cluster 2. then mark the managementState as Removed 3. check the operator status
Actual results:
CO image-registry remain is degraded state
Expected results:
Operator should not be in degraded state
Additional info:
Description of problem:
2024-05-07 17:21:59 level=debug msg=baremetal: getting master addresses 2024-05-07 17:21:59 level=warning msg=Failed to extract host addresses: open ocp/ostest/.masters.json: no such file or directory
Description of problem:
When trying to delete a machine after an instance is in ERROR state the machine is stuck in deleting.
Version-Release number of selected component (if applicable):
OSP RHOS-17.1-RHEL-9-20240516.n.1 OCP 4.17.0-0.ci-2024-06-01-234742
How reproducible:
Steps to Reproduce:
1. openstack server stop <node instance> 2. openstack server set --state error <node instance> 3. oc delete <machine> 3. oc get machines -A 4. Verify the machine is stuck in deleting
Actual results:
Expected results:
Additional info:
When collecting onprem events, we want to be able to distinguish among the various onprem deployments:
This info we should also make sure we forward it when collecting events
We should also define a human-friendly version for each
Slack thread about the supported deployment types
https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1706209886659329
Environment variables for setting the deployment type (one of: podman, operator, ACM, MCE, ABI) and for setting the release version (if applicable)
As a HyperShift Engineer, I want to be able to:
so that I can achieve
As a HyperShift Engineer, I want to be able to:
so that I can achieve
As a HyperShift Engineer, I want to be able to:
so that I can achieve
Description of criteria:
This does not require a design proposal.
This does not require a feature gate.
Description of problem:
In OCP 4.17, kube-apiserver no longer gets a valid cloud config. Therefore the PersistentVolumeLabel admission plugin reject in-tree GCE PD PVs that do not have correct topology with `persistentvolumes \"gce-\" is forbidden: error querying GCE PD volume e2e-4d8656c6-d1d4-4245-9527-33e5ed18dd31: disk is not found`
In 4.16, kube-apiserver will not get a valid cloud config after it updates library-go with this PR.
How reproducible:
always
Steps to Reproduce:
1. Run e2e test "Multi-AZ Cluster Volumes should schedule pods in the same zones as statically provisioned PVs"
The `oc adm release` commands that use git currently do full clones of every repo in the releases specified. This causes the command to take a long time and use a lot of disk space (approximately 31GB to generate a 4.14.0->4.15.0 changelog). This can be optimized to significantly reduce the disk space and time required to run these commands.
While debugging a problem, I noticed some containers lack FallbackToLogsOnError. This is important for debugging via the API. Found via https://github.com/openshift/origin/pull/28547
Description of problem:
Looking at the telemetry data for Nutanix I noticed that the “host_type” for clusters installed with platform nutanix shows as “virt-unknown”. Do you know what needs to happen in the code to tell telemetry about host type being Nutanix? The problem is that we can’t track those installations with platform none, just IPI. Refer to the slack thread https://redhat-internal.slack.com/archives/C0211848DBN/p1687864857228739.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
Create an OCP Nutanix cluster
Actual results:
The telemetry data for Nutanix shows the “host_type” for the nutanix cluster as “virt-unknown”.
Expected results:
The telemetry data for Nutanix shows the “host_type” for the nutanix cluster as "nutanix".
Additional info:
Add `madhu-pillai` in the `coreos-approvers` and `coreos-reviewers` lists.
Even if the –namespace arg is specified to hypershift install render, the openshift-config-managed-trusted-ca-bundle configmap's namespace is always set to "hypershift".
Description of problem: Not all net.* per-interface sysctls declared safe by TELCOSTRAT-10 / CNF-3642 are implemented in the default cni-sysctl-allowlist.
E.g.
net.ipv6.conf.IFNAME.disable_ipv6,
net.ipv6.conf.IFNAME.disable_policy,
net.ipv4.conf.IFNAME.rp_filter,
net.ipv4.conf.IFNAME.forwarding,
net.ipv4.conf.IFNAME.forwarding,
and possibly others.
Version-Release number of selected component (if applicable): 4.14
How reproducible: Always
Steps to Reproduce:
1. Compare the list of per-interface sysctls declared safe in TELCOSTRAT-10 / CNF-3642 / CNF-4093 Google Doc and Jira comments to the default cni-sysctl-allowlist in the code
Actual results: List of per-interface sysctls declared safe in the default cni-sysctl-allowlist in the code does not match the list in TELCOSTRAT-10 / CNF-3642 / CNF-4093
Expected results: List of per-interface sysctls declared safe in the default cni-sysctl-allowlist in the code should match the list in TELCOSTRAT-10 / CNF-3642 / CNF-4093
Additional info: None
David mention this issue here: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1702312628947029
duplicated_event_patterns I think it's creating a blackout range (events ok during time X) and then checks the time range itself, but doesn't appear to exclude the change in counts?
The count of the last event within the allowed range should be subtracted from the first event that is outside of the allowed time range for pathological event test calculation.
David has a demonstration of the count here: https://github.com/openshift/origin/pull/28456 but to fix you have to invert the testDuplicatedEvents to iterate through the event registry, not the the events
Description of problem:
With the implementation of bug https://issues.redhat.com/browse/MGMT-14527,we see that the Vsphere pluging is degraded and it gives a pop up to fill up the details of the Vcenter Configuration so the configuration are then stored in cloud-provider-config, There are requirement on how those details should be entered in the popup but there is no details about the format in which customer should fill the details. Requirement from the bug 1. The UI should display the format in which the data is to be entered 2. Warning that if the configurations are saved then new Machine config will be rolled out which will lead to node reboot
Version-Release number of selected component (if applicable):
4.13+
How reproducible:
Steps to reproduce unavailable
Steps to Reproduce:
1. 2. 3.
Actual results:
Vspehere plugin is degraded
Expected results:
Plugin should not be degraded
Additional info:
Ideally would be seen in a situation where clusters are upgraded from a lower version to high version.
Component Readiness has found a potential regression in [sig-apps] Daemon set [Serial] should surge pods onto nodes when spec was updated and update strategy is RollingUpdate [Suite:openshift/conformance/serial] [Suite:k8s].
Probability of significant regression: 98.27%
Sample (being evaluated) Release: 4.16
Start Time: 2024-04-18T00:00:00Z
End Time: 2024-04-24T23:59:59Z
Success Rate: 90.00%
Successes: 27
Failures: 3
Flakes: 0
Base (historical) Release: 4.15
Start Time: 2024-02-01T00:00:00Z
End Time: 2024-02-28T23:59:59Z
Success Rate: 100.00%
Successes: 83
Failures: 0
Flakes: 0
Looking at this job as an example, test failed with this error message:
invariants were violated during daemonset update: An old pod with UID f44d840a-4430-4666-addd-cc3fae7a1e8a has been running alongside a newer version for longer than 1m0s { s: "invariants were violated during daemonset update:\nAn old pod with UID f44d840a-4430-4666-addd-cc3fae7a1e8a has been running alongside a newer version for longer than 1m0s", }
The last query from log blob:https://prow.ci.openshift.org/bbb02cd3-e004-48be-aef8-8ce6ef7acc47 shows the conflicting pods at 8:46:47.812:
{{ Apr 23 18:46:47.812: INFO: Node Version Name UID Deleted Ready
Apr 23 18:46:47.812: INFO: ip-10-0-22-60.us-west-1.compute.internal 1 daemon-set-42bhb f44d840a-4430-4666-addd-cc3fae7a1e8a false true
Apr 23 18:46:47.812: INFO: ip-10-0-22-60.us-west-1.compute.internal 2 daemon-set-c5p92 02f9a3ba-cc0c-4954-bc29-65b4b954b93b false true
Apr 23 18:46:47.812: INFO: ip-10-0-28-200.us-west-1.compute.internal 1 daemon-set-7p7hs 4ba591c7-623a-4397-b99d-3b2616b5a787 false true
Apr 23 18:46:47.812: INFO: ip-10-0-95-178.us-west-1.compute.internal 2 daemon-set-chhhl 53d07493-c0f4-46f1-9365-f67be6ac993b false true
}}
Yet if you look at journal from ip-10-0-22-60.us-west-1.compute.internal, it starts deleting pod daemon-set-42bhb at 18:46:48.140434
Apr 23 18:46:48.140434 ip-10-0-22-60 kubenswrapper[1432]: I0423 18:46:48.140408 1432 kubelet.go:2445] "SyncLoop DELETE" source="api" pods=["e2e-daemonsets-9316/daemon-set-42bhb"]
Should the test be waiting longer, or there is a legit problem with the delay?
https://issues.redhat.com/browse/RHEL-1671 introduces "dns-changed" event that resolv-prepender should act on. So now instead of a bunch of "-change" and "up" and "whatnot" events we have the one that clearly indicates that the DNS has been changed.
By embedding this into our logic, we will heavily optimize number of times our scripts are called.
It is important to check when exactly this is going to be shipped so that we synchronize our change with upstream NM.
The following components do not preserve their container resource requests/limits on reconciliation when modified by an external source:
The original change to add this resource preservation support doesn't appear to have accomplished the desired behavior for these specific components.
Description of problem:
1. vSphere connection configuration modal stays in unresponsive status for a long time after user updates 'Virtual Machine Folder' value 2. user is not able to update the configuration again when the changes are applying
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-02-22-021321
How reproducible:
Always
Steps to Reproduce:
1. setup cluster landed in vSphere 2. try to update 'Virtual Machine Folder' value in vSphere connection configuration modal 3. click 'Save'
Actual results:
3. vSphere connection configuration modal will be in Saving status for a very long time, user can not Close nor Cancel the changes. when we check in backend, the changes already in place in cm/cloud-provider-config also user is not able to update the configuration again when the changes are applying
Expected results:
3. user should be able to continue updating the values or close the modal since the modal is only exposed to user to update the value, user don't need to wait until all finished, also the changes already take place in backend
Additional info:
Remove the github handle of Haoyu Sun from OWNER files of monitoring components.
This is a clone of issue OCPBUGS-38497. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-37736. The following is the description of the original issue:
—
Modify the import to strip or change the bootOptions.efiSecureBootEnabled
https://redhat-internal.slack.com/archives/CLKF3H5RS/p1722368792144319
archive := &importx.ArchiveFlag{Archive: &importx.TapeArchive{Path: cachedImage}}
ovfDescriptor, err := archive.ReadOvf("*.ovf")
if err != nil {
// Open the corrupt OVA file
f, ferr := os.Open(cachedImage)
if ferr != nil
defer f.Close()
// Get a sha256 on the corrupt OVA file
// and the size of the file
h := sha256.New()
written, cerr := io.Copy(h, f)
if cerr != nil
return fmt.Errorf("ova %s has a sha256 of %x and a size of %d bytes, failed to read the ovf descriptor %w", cachedImage, h.Sum(nil), written, err)
}
ovfEnvelope, err := archive.ReadEnvelope(ovfDescriptor)
if err != nil
Description of problem:
Bootstrap process failed due to API_URL and API_INT_URL are not resolvable: Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'. Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Consumed 1min 457ms CPU time. Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 1. Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: Stopped Bootstrap a Kubernetes cluster. Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Consumed 1min 457ms CPU time. Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: Started Bootstrap a Kubernetes cluster. Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Check if API and API-Int URLs are resolvable during bootstrap Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Checking if api.yunjiang-dn16d.qe.gcp.devcluster.openshift.com of type API_URL is resolvable Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting stage resolve-api-url Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Unable to resolve API_URL api.yunjiang-dn16d.qe.gcp.devcluster.openshift.com Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Checking if api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com of type API_INT_URL is resolvable Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting stage resolve-api-int-url Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Unable to resolve API_INT_URL api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8905]: https://localhost:2379 is healthy: successfully committed proposal: took = 7.880477ms Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting cluster-bootstrap... Feb 06 06:41:59 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: Starting temporary bootstrap control plane... Feb 06 06:41:59 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: Waiting up to 20m0s for the Kubernetes API Feb 06 06:42:00 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: API is up install logs: ... time="2024-02-06T06:54:28Z" level=debug msg="Unable to connect to the server: dial tcp: lookup api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com on 169.254.169.254:53: no such host" time="2024-02-06T06:54:28Z" level=debug msg="Log bundle written to /var/home/core/log-bundle-20240206065419.tar.gz" time="2024-02-06T06:54:29Z" level=error msg="Bootstrap failed to complete: timed out waiting for the condition" time="2024-02-06T06:54:29Z" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane." ...
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-02-05-184957,openshift/machine-config-operator#4165
How reproducible:
Always.
Steps to Reproduce:
1. Enable custom DNS on gcp: platform.gcp.userProvisionedDNS:Enabled and featureSet:TechPreviewNoUpgrade 2. Create cluster 3.
Actual results:
Failed to complete bootstrap process.
Expected results:
See description.
Additional info:
I believe 4.15 is affected as well once https://github.com/openshift/machine-config-operator/pull/4165 backport to 4.15, currently, it failed at an early phase, see https://issues.redhat.com/browse/OCPBUGS-28969
primary_ipv4_address is deprecated in favor of primary_ip[*].address. Replace it with the new attribute.
Description of problem:
Hypershift management clusters are using a network-load-balancer to route to their own openshift-ingress router pods for cluster ingress. These NLBs are provisioned by the https://github.com/openshift/cloud-provider-aws. The cloud-provider-aws uses the cluster-tag on the subnets to select the correct subnets for the NLB and the SecurityGroup adjustments. On management clusters *all* subnets are tagged with the MCs cluster-id. This can lead to the cloud-provider-aws to possibly selecting the incorrect subnet, due to breaking conflicts between multiple subnets in an AZ using lexicographical comparisons: https://github.com/openshift/cloud-provider-aws/blob/master/pkg/providers/v1/aws.go#L3626 This can lead to a situation, where a SecurityGroup will only allow Ingress from a subnet that is not actually part of the NLB - in this case the TargetGroup will not be able to correctly perform a HealthCheck in that AZ. In certain cases this can lead to all targets reporting unhealthy as the nodes hosting the ingress pods have the incorrect SecurityGroup rules. In that case routing to nodes that are part of the target group can select nodes that should not be chosen as they are not ready yet/anymore leading to problems when attempting to access management cluster services (e.g. the console).
Version-Release number of selected component (if applicable):
4.14.z & 4.15.z
How reproducible:
Most MCs that are using NLBs will have some of the SecurityGroups misconfigured.
Steps to Reproduce:
1. Have the cloud-provider-aws update the NLB while there are subnets in an AZ with lexicographically smaller names than the MCs default subnets - this can lead to the other subnets being chosen instead. 2. Check the securitygroups to see if the source CIDRs are incorrect.
Actual results:
SecurityGroups can have incorrect source CIDRs used for the MCs own NLB.
Expected results:
The MC should only tag their own subnet with the clusterid of the MC, so subnet selection of the cloud-provider-aws is not affected by the HCP subnets in the same availability zones.
Additional info:
Related OHSS ticket from SREP: https://issues.redhat.com/browse/OSD-20289
For ingress controllers that's exposed via LBs, there are considerations for external and internal publishing scope. Requesting support for providers the ability to specify the LB scope on the HostedCluster at initial create time.
Document HostedCluster and HostedControlPlane manifests used by IBM Cloud.