Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
Problem:
Certain Insights Advisor features differentiate between RHEL and OCP advisor
Goal:
Address top priority UI misalignments between RHEL and OCP advisor. Address UI features dropped from Insights ADvisor for OCP GA.
Scope:
Specific tasks and priority of them tracked in https://issues.redhat.com/browse/CCXDEV-7432
This contains all the Insights Advisor widget deliverables for the OCP release 4.11.
Scope
It covers only minor bug fixes and improvements:
Show the error message (mocked in CCXDEV-5868) if the Prometheus metrics `cluster_operator_conditions{name="insights"}` contain two true conditions: UploadDegraded and Degraded at the same time. This state occurs if there was an IO archive upload error = problems with the pipeline.
Expected for 4.11 OCP release.
Scenario: Check if the Insights Advisor widget in the OCP WebConsole UI shows the time of the last data analysis Given: OCP WebConsole UI and the cluster dashboard is accessible And: CCX external data pipeline is in a working state And: administrator A1 has access to his cluster's dashboard And: Insights Operator for this cluster is sending archives When: administrator A1 clicks on the Insights Advisor widget Then: the results of the last analysis are showed in the Insights Advisor widget And: the time of the last analysis is shown in the Insights Advisor widget
Acceptance criteria:
max_over_time(timestamp(changes(insightsclient_request_send_total\{status_code="202"}[1m]) > 0)[24h:1m])
Cloning the existing rule should end up with a new rule in the same namespace.
Modifications can now be done to the new rule.
(Optional) You can silence the existing rule.
Create a new PrometheusRule object inside the namespace that includes the metrics you need to form the alerting rule.
CMO should reconcile the platform Prometheus configuration with the alert-relabel-config resources.
DoD
CMO should reconcile the platform Prometheus configuration with the AlertingRule resources.
DoD
Managing PVs at scale for a fleet creates difficulties where "one size does not fit all". The ability for SRE to deploy prometheus with PVs and have retention based an on a desired size would enable easier management of these volumes across the fleet.
The prometheus-operator exposes retentionSize.
Field | Description |
---|---|
retentionSize | Maximum amount of disk space used by blocks. Supported units: B, KB, MB, GB, TB, PB, EB. Ex: 512MB. |
This is a feature request to enable this configuration option via CMO cluster-monitoring-config ConfigMap.
Today, all configuration for setting individual, for example, routing configuration is done via a single configuration file that only admins have access to. If an environment uses multiple tenants and each tenant, for example, has different systems that they are using to notify teams in case of an issue, then someone needs to file a request w/ an admin to add the required settings.
That can be bothersome for individual teams, since requests like that usually disappear in the backlog of an administrator. At the same time, administrators might get tons of requests that they have to look at and prioritize, which takes them away from more crucial work.
We would like to introduce a more self service approach whereas individual teams can create their own configuration for their needs w/o the administrators involvement.
Last but not least, since Monitoring is deployed as a Core service of OpenShift there are multiple restrictions that the SRE team has to apply to all OSD and ROSA clusters. One restriction is the ability for customers to use the central Alertmanager that is owned and managed by the SRE team. They can't give access to the central managed secret due to security concerns so that users can add their own routing information.
Provide a new API (based on the Operator CRD approach) as part of the Prometheus Operator that allows creating a subset of the Alertmanager configuration without touching the central Alertmanager configuration file.
Please note that we do not plan to support additional individual webhooks with this work. Customers will need to deploy their own version of the third party webhooks.
Team A wants to send all their important notifications to a specific Slack channel.
* CI - CI is running, tests are automated and merged.
* Release Enablement <link to Feature Enablement Presentation>
* DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
* DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
* DEV - Downstream build attached to advisory: <link to errata>
* QE - Test plans in Polarion: <link or reference to Polarion>
* QE - Automated tests merged: <link or reference to automated tests>
* DOC - Downstream documentation merged: <link to meaningful PR>
Now that upstream supports AlertmanagerConfig v1beta1 (see MON-2290 and https://github.com/prometheus-operator/prometheus-operator/pull/4709), it should be deployed by CMO.
DoD:
DoD
DoD
Copy/paste from [_https://github.com/openshift-cs/managed-openshift/issues/60_]
Which service is this feature request for?
OpenShift Dedicated and Red Hat OpenShift Service on AWS
What are you trying to do?
Allow ROSA/OSD to integrate with AWS Managed Prometheus.
Describe the solution you'd like
Remote-write of metrics is supported in OpenShift but it does not work with AWS Managed Prometheus since AWS Managed Prometheus requires AWS SigV4 auth.
Describe alternatives you've considered
There is the workaround to use the "AWS SigV4 Proxy" but I'd think this is not properly supported by RH.
https://mobb.ninja/docs/rosa/cluster-metrics-to-aws-prometheus/
Additional context
The customer wants to use an open and portable solution to centralize metrics storage and analysis. If they also deploy to other clouds, they don't want to have to re-configure. Since most clouds offer a Prometheus service (or it's easy to self-manage Prometheus), app migration should be simplified.
The cluster monitoring operator should allow OpenShift customers to configure remote write with all authentication methods supported by upstream Prometheus.
We will extend CMO's configuration API to support the following authentications with remote write:
Customers want to send metrics to AWS Managed Prometheus that require sigv4 authentication (see https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-secure-metric-ingestion.html#AMP-secure-auth).
Prometheus and Prometheus operator already support custom Authorization for remote write. This should be possible to configure the same in the CMO configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
prometheusK8s:
remoteWrite:
- url: "https://remote-write.endpoint"
Authorization:
type: Bearer
credentials:
name: credentials
key: token
DoD:
Prometheus and Prometheus operator already support sigv4 authentication for remote write. This should be possible to configure the same in the CMO configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
prometheusK8s:
remoteWrite:
- url: "https://remote-write.endpoint"
sigv4:
accessKey:
name: aws-credentialss
key: access
secretKey:
name: aws-credentials
key: secret
profile: "SomeProfile"
roleArn: "SomeRoleArn"
DoD:
As WMCO user, I want to make sure containerd logging information has been updated in documents and scripts.
Configure audit logging to capture login, logout and login failure details
TODO(PM): update this
Customer who needs login, logout and login failure details inside the openshift container platform.
I have checked for this on my test cluster but the audit logs do not contain any user name specifying login or logout details. For successful logins or logout, on CLI and openshift console as well we can see 'Login successful' or 'Invalid credentials'.
Expected results: Login, logout and login failures should be captured in audit logging.
The apiserver pods today have ´/var/log/<kube|oauth|openshift>-apiserver` mounted from the host and create audit files there using the upstream audit event format (JSON lines following https://github.com/kubernetes/apiserver/blob/92392ef22153d75b3645b0ae339f89c12767fb52/pkg/apis/audit/v1/types.go#L72). These events are apiserver specific, but as oauth authentication flow events are also requests, we can use the apiserver event format to log logins, login failures and logouts. Hence, we propose to make oauth-server to create /var/log/oauth-server/audit.log files on the master nodes using that format.
When the login flow does not finish within a certain time (e.g. 10min), we can artificially create an event to show a login failure in the audit logs.
Right now there's no way to generate audit logs from this.
Let the Cluster Authentication Operator deliver the policy to OAuthServer.
In order to know if authn events should be logged, OAuthServer needs to be aware of it.
* Stanislav LázničkaCreate an observer to deliver the audit policy to the oauth server
Make the authentication-operator react to the new audit field in the oauth.config/cluster object. Write an observer watching this field, such an observer will translate the top-level configuration into oauth-server config and add it to the rest of the observed config.
Right now there's no way to generate audit logs from this.
OCP/Telco Definition of Done
Feature Template descriptions and documentation.
Early customer feedback is that they see SNO as a great solution covering smaller footprint deployment, but are wondering what is the evolution story OpenShift is going to provide where more capacity or high availability are needed in the future.
While migration tooling (moving workload/config to new cluster) could be a mid-term solution, customer desire is not to include extra hardware to be involved in this process.
For Telecommunications Providers, at the Far Edge they intend to start small and then grow. Many of these operators will start with a SNO-based DU deployment as an initial investment, but as DUs evolve, different segments of the radio spectrum are added, various radio hardware is provisioned and features delivered to the Far Edge, the Telecommunication Providers desire the ability for their Far Edge deployments to scale up from 1 node to 2 nodes to n nodes. On the opposite side of the spectrum from SNO is MMIMO where there is a robust cluster and workloads use HPA.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
This is a ticket meant to track all the all the OCP PRs that are involved in the implementation of the SNO + workers enhancement
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Rebase openshift/builder to k8s 1.24
4.11 MVP Requirements
Out of scope use cases (that are part of the Kubeframe/factory project):
Questions to be addressed:
As a deployer, I want to be able to:
so that I can achieve
Currently the Assisted Service generates the credentials by running the ignition generation step of the oepnshift-installer. This is why the credentials are only retrievable from the REST API towards the end of the installation.
In the BILLI usage, which takes down assisted service before the installation is complete there is no obvious point at which to alert the user that they should retrieve the credentials. This means that we either need to:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
The AWS-specific code added in OCPPLAN-6006 needs to become GA and with this we want to introduce a couple of Day2 improvements.
Currently the AWS tags are defined and applied at installation time only and saved in the infrastructure CRD's status field for further operator use, which in turn just add the tags during creation.
Saving in the status field means it's not included in Velero backups, which is a crucial feature for customers and Day2.
Thus the status.resourceTags field should be deprecated in favour of a newly created spec.resourceTags with the same content. The installer should only populate the spec, consumers of the infrastructure CRD must favour the spec over the status definition if both are supplied, otherwise the status should be honored and a warning shall be issued.
Being part of the spec, the behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the AWS resources should be updated accordingly.
On AWS this can be done without re-creating any resources (the behaviour is basically an upsert by tag key) and is possible without service interruption as it is a metadata operation.
Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.
Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").
Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.
After that, we can remove the experimental flag and make this a GA feature.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
List any affected packages or components.
RFE-1101 described user defined tags for AWS resources provisioned by an OCP cluster. Currently user can define tags which are added to the resources during creation. These tags cannot be updated subsequently. The propagation of the tags is controlled using experimental flag. Before this feature goes GA we should define and implement a mechanism to exclude any experimental flags. Day2 operations and deletion of tags is not in the scope.
RFE-2012 aims to make the user-defined resource tags feature GA. This means that user defined tags should be updatable.
Currently the user-defined tags during install are passed directly as parameters of the Machine and Machineset resources for the master and worker. As a result these tags cannot be updated by consulting the Infrastructure resource of the cluster where the user defined tags are written.
The MCO should be changed such that during provisioning the MCO looks up the values of the tags in the Infrastructure resource and adds the tags during creation of the EC2 resources. The MCO should also watch the infrastructure resource for changes and when the resource tags are updated it should update the tags on the EC2 instances without restarts.
Acceptance Criteria:
Much like core OpenShift operators, a standardized flow exists for OLM-managed operators to interact with the cluster in a specific way to leverage AWS STS authorization when using AWS APIs as opposed to insecure static, long-lived credentials. OLM-managed operators can implement integration with the CloudCredentialOperator in well-defined way to support this flow.
Enable customers to easily leverage OpenShift's capabilities around AWS STS with layered products, for increased security posture. Enable OLM-managed operators to implement support for this in well-defined pattern.
See Operators & STS slide deck.
The CloudCredentialsOperator already provides a powerful API for OpenShift's cluster core operator to request credentials and acquire them via short-lived tokens. This capability should be expanded to OLM-managed operators, specifically to Red Hat layered products that interact with AWS APIs. The process today is cumbersome to none-existent based on the operator in question and seen as an adoption blocker of OpenShift on AWS.
This is particularly important for ROSA customers. Customers are expected to be asked to pre-create the required IAM roles outside of OpenShift, which is deemed acceptable.
This Section: High-Level description of the Market Problem ie: Executive Summary
This Section: Articulates and defines the value proposition from a users point of view
This Section: Effect is the expected outcome within the market. There are two dimensions of outcomes; growth or retention. This represents part of the “why” statement for a feature.
As an engineer I want the capability to implement CI test cases that run at different intervals, be it daily, weekly so as to ensure downstream operators that are dependent on certain capabilities are not negatively impacted if changes in systems CCO interacts with change behavior.
Acceptance Criteria:
Create a stubbed out e2e test path in CCO and matching e2e calling code in release such that there exists a path to tests that verify working in an AWS STS workflow.
Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Goal
Add the ability to choose between a full cluster upgrade (which exists today) or control plane upgrade (which will pause all worker pools) in the console.
Background
Currently in the console, users only have the ability to complete a full cluster upgrade. For many customers, upgrades take longer than what their maintenance window allows. Users need the ability to upgrade the control plane independently of the other worker nodes.
Ex. Upgrades of huge clusters may take too long so admins may do the control plane this weekend, worker-pool-A next weekend, worker-pool-B the weekend after, etc. It is all at a pool level, they will not be able to choose specific hosts.
Requirements
Design deliverables:
Goal
Improve the UX on the machine config pool page to reflect the new enhancements on the cluster settings that allows users to select the ability to update the control plane only.
Background
Currently in the console, users only have the ability to complete a full cluster upgrade. For many customers, upgrades take longer than what their maintenance window allows. Users need the ability to upgrade the control plane independently of the other worker nodes.
Ex. Upgrades of huge clusters may take too long so admins may do the control plane this weekend, worker-pool-A next weekend, worker-pool-B the weekend after, etc. It is all at a pool level, they will not be able to choose specific hosts.
Requirements
Design deliverables:
OCP/Telco Definition of Done
Feature Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Feature --->
<--- Remove the descriptive text as appropriate --->
Problem
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
Running the OPCT with the latest version (v0.1.0) on OCP 4.11.0, the openshift-tests is reporting an incorrect counter for the "total" field.
In the example below, after the 1127th test, the total follows the same counter of executed. I also would assume that the total is incorrect before that point as the test continues the execution increases both counters.
openshift-tests output format: [failed/executed/total]
started: (0/1126/1127) "[sig-storage] PersistentVolumes-expansion loopback local block volume should support online expansion on node [Suite:openshift/conformance/parallel] [Suite:k8s]" passed: (38s) 2022-08-09T17:12:21 "[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with mount options [Suite:openshift/conformance/parallel] [Suite:k8s]" started: (0/1127/1127) "[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: tmpfs] [Testpattern: Generic Ephemeral-volume (block volmode) (late-binding)] ephemeral should support two pods which have the same volume definition [Suite:openshift/conformance/parallel] [Suite:k8s]" passed: (6.6s) 2022-08-09T17:12:21 "[sig-storage] Downward API volume should provide container's memory request [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]" started: (0/1128/1128) "[sig-storage] In-tree Volumes [Driver: cinder] [Testpattern: Dynamic PV (immediate binding)] topology should fail to schedule a pod which has topologies that conflict with AllowedTopologies [Suite:openshift/conformance/parallel] [Suite:k8s]" skip [k8s.io/kubernetes@v1.24.0/test/e2e/storage/framework/testsuite.go:116]: Driver local doesn't support GenericEphemeralVolume -- skipping Ginkgo exit error 3: exit with code 3 skipped: (400ms) 2022-08-09T17:12:21 "[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: tmpfs] [Testpattern: Generic Ephemeral-volume (block volmode) (late-binding)] ephemeral should support two pods which have the same volume definition [Suite:openshift/conformance/parallel] [Suite:k8s]" started: (0/1129/1129) "[sig-storage] In-tree Volumes [Driver: emptydir] [Testpattern: Dynamic PV (default fs)] capacity provides storage capacity information [Suite:openshift/conformance/parallel] [Suite:k8s]"
OPCT output format [executed/total (failed failures)]
Tue, 09 Aug 2022 14:12:13 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE openshift-conformance-validated | running | | 1112/1127 (0 failures) | status=running openshift-kube-conformance | complete | | 352/352 (0 failures) | waiting for post-processor... Tue, 09 Aug 2022 14:12:23 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE openshift-conformance-validated | running | | 1120/1127 (0 failures) | status=running openshift-kube-conformance | complete | | 352/352 (0 failures) | waiting for post-processor... Tue, 09 Aug 2022 14:12:33 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE openshift-conformance-validated | running | | 1139/1139 (0 failures) | status=running openshift-kube-conformance | complete | | 352/352 (0 failures) | waiting for post-processor... Tue, 09 Aug 2022 14:12:43 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE openshift-conformance-validated | running | | 1185/1185 (0 failures) | status=running openshift-kube-conformance | complete | | 352/352 (0 failures) | waiting for post-processor... Tue, 09 Aug 2022 14:12:53 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE openshift-conformance-validated | running | | 1188/1188 (0 failures) | status=running openshift-kube-conformance | complete | | 352/352 (0 failures) | waiting for post-processor...
When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release
We drive OpenShift cross-market customer success and new customer adoption with constant improvements and feature additions to the existing capabilities of our OpenShift Core Networking (SDN and Network Edge). This feature captures that natural progression of the product.
There are definitely grey areas, but in general:
Questions to be addressed:
Create a PR in openshift/cluster-ingress-operator to implement configurable router probe timeouts.
The PR should include the following:
User Story: As a customer in a highly regulated environment, I need the ability to secure DNS traffic when forwarding requests to upstream resolvers so that I can ensure additional DNS traffic and data privacy.
tldr: three basic claims, the rest is explanation and one example
While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.
One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.
I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.
We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.
Relevant links:
In OCP 4.8 the router was changed to use the "random" balancing algorithm for non-passthrough routes by default. It was previously "leastconn".
Bug https://bugzilla.redhat.com/show_bug.cgi?id=2007581 shows that using "random" by default incurs significant memory overhead for each backend that uses it.
PR https://github.com/openshift/cluster-ingress-operator/pull/663
reverted the change and made "leastconn" the default again (OCP 4.8 onwards).
The analysis in https://bugzilla.redhat.com/show_bug.cgi?id=2007581#c40 shows that the default haproxy behaviour is to multiply the weight (specified in the route CR) by 16 as it builds its data structures for each backend. If no weight is specified then openshift-router sets the weight to 256. If you have many, many thousands of routes then this balloons quickly and leads to a significant increase in memory usage, as highlighted by customer cases attached to BZ#2007581.
The purpose of this issue is to both explore changing the openshift-router default weight (i.e., 256) to something smaller, or indeed unset (assuming no explicit weight has been requested), and to measure the memory usage within the context of the existing perf&scale tests that we use for vetting new haproxy releases.
It may be that the low-hanging change is to not default to weight=256 for backends that only have one pod replica (i.e., if no value specified, and there is only 1 pod replica, then don't default to 256 for that single server entry).
Outcome: does changing the [default] weight value make it feasible to switch back to "random" as the default balancing algorithm for a future OCP release.
Revert router to using "random" once again in 4.11 once analysis is done on impact of weight and static memory allocation.
Per the 4.6.30 Monitoring DNS Post Mortem, we should add E2E tests to openshift/cluster-dns-operator to reduce the risk that changes to our CoreDNS configuration break DNS resolution for clients.
To begin with, we add E2E DNS testing for 2 or 3 client libraries to establish a framework for testing DNS resolvers; the work of adding additional client libraries to this framework can be left for follow-up stories. Two common libraries are Go's resolver and glibc's resolver. A somewhat common library that is known to have quirks is musl libc's resolver, which uses a shorter timeout value than glibc's resolver and reportedly has issues with the EDNS0 protocol extension. It would also make sense to test Java or other popular languages or runtimes that have their own resolvers.
Additionally, as talked about in our DNS Issue Retro & Testing Coverage meeting on Feb 28th 2024, we also decided to add a test for testing a non-EDNS0 query for a larger than 512 byte record, as once was an issue in bug OCPBUGS-27397.
The ultimate goal is that the test will inform us when a change to OpenShift's DNS or networking has an effect that may impact end-user applications.
Requirement | Notes | isMvp? |
---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
When viewing the Installed Operators list set to 'All projects' and then selecting an operator that is available in 'All namespaces' (globally installed,) upon clicking the operator to view its details the user is taken into the details of that operator in installed namespace (project selector will switch to the install namespace.)
This can be disorienting then to look at the lists of custom resource instances and see them all blank, since the lists are showing instances only in the currently selected project (the install namespace) and not across all namespaces the operator is available in.
It is likely that making use of the new Operator resource will improve this experience (CONSOLE-2240,) though that may still be some releases away. it should be considered if it's worth a "short term" fix in the meantime.
Note: The informational alert was not implemented. It was decided that since "All namespaces" is displayed in the radio button, the alert was not needed.
During master nodes upgrade when nodes are getting drained there's currently no protection from two or more operands going down. If your component is required to be available during upgrade or other voluntary disruptions, please consider deploying PDB to protect your operands.
The effort is tracked in https://issues.redhat.com/browse/WRKLDS-293.
Example:
Acceptance Criteria:
1. Create PDB controller in console-operator for both console and downloads pods
2. Add e2e tests for PDB in single node and multi node cluster
Note: We should consider to backport this to 4.10
Goal
Add support for PDB (Pod Disruption Budget) to the console.
Requirements:
Designs:
Enable sharing ConfigMap and Secret across namespaces
Requirement | Notes | isMvp? |
---|---|---|
Secrets and ConfigMaps can get shared across namespaces | YES |
NA
NA
Consumption of RHEL entitlements has been a challenge on OCP 4 since it moved to a cluster-based entitlement model compared to the node-based (RHEL subscription manager) entitlement mode. In order to provide a sufficiently similar experience to OCP 3, the entitlement certificates that are made available on the cluster (OCPBU-93) should be shared across namespaces in order to prevent the need for cluster admin to copy these entitlements in each namespace which leads to additional operational challenges for updating and refreshing them.
Questions to be addressed:
* What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
* Does this feature have doc impact?
* New Content, Updates to existing content, Release Note, or No Doc Impact
* If unsure and no Technical Writer is available, please contact Content Strategy.
* What concepts do customers need to understand to be successful in [action]?
* How do we expect customers will use the feature? For what purpose(s)?
* What reference material might a customer want/need to complete [action]?
* Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
* What is the doc impact (New Content, Updates to existing content, or Release Note)?
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As an OpenShift engineer
I want the shared resource CSI Driver webhook to be installed with the cluster storage operator
So that the webhook is deployed when the CSI driver is deployed
None - no new functional capabilities will be added
None - we can verify in CI that we are deploying the webhook correctly.
None - no new functional capabilities will be added
The scope of this story is to just deploy the "hello world" webhook with the Cluster Storage Operator.
Adding the live ValidatingWebhook configuration and service will be done in a separate story.
As an OpenShift engineer,
I want to initialize a validating admission webhook for the shared resource CSI driver
So that I can eventually require readOnly: true to be set on all pods that use the Shared Resource CSI Driver
None.
None.
None.
This is a prerequisite for implementing the validating admission webhook.
We need to have ART build the container image downstream so that we can add the correct image references for the CVO.
If we reference images in the CVO manifests which do not have downstream counterparts, we break the downstream build for the payload.
CI is capable of producing multiple images for a GitHub repository. For example, github.com/openshift/oc produces 4-5 images with various capabilities.
We did similar work in BUILD-234 - some of these steps are not required.
See also:
Tasks:
As a developer using SharedSecrets and ConfigMaps
I want to ensure all pods set readOnly; true on admission
So that I don't have pods stuck in the "Pending" state because of a bad volume mount
QE will need to verify the new Pod Admission behavior
Docs will need to ensure that readOnly: true is required and must be set to true.
None.
QE testing/verification of the feature - require readOnly to be true
Actions:
1. Create smoke test and submit to GitHub
2. Run script to integrate smoke test with Polarion
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled
https://issues.redhat.com/browse/AUTH-2 revealed that, in prinicipal, Pod Security Admission is possible to integrate into OpenShift while retaining SCC functionality.
This epic is about the concrete steps to enable Pod Security Admission by default in OpenShift
Enhancement - https://github.com/openshift/enhancements/pull/1010
ingress-operator must comply to pod security. The current audit warning is:
{ "objectRef": "openshift-ingress-operator/deployments/ingress-operator", "pod-security.kubernetes.io/audit-violations": "would violate PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.run AsNonRoot=true), seccompProfile (pod or containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")" }
dns-operator must comply to restricted pod security level. The current audit warning is:
{ "objectRef": "openshift-dns-operator/deployments/dns-operator", "pod-security.kubernetes.io/audit-violations": "would violate PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.allowPrivilegeEscalation=false), unre stricted capabilities (containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.runAsNonRoot=tr ue), seccompProfile (pod or containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")" }HyperShift provisions OpenShift clusters with externally managed control-planes. It follows a slightly different process for provisioning clusters. For example, HyperShift uses cluster API as a backend and moves all the machine management bits to the management cluster.
showing machine management/cluster auto-scaling tabs in the console is likely to confuse users and cause unnecessary side effects.
See Design Doc: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#
It's based on the SERVER_FLAG controlPlaneTopology being set to External is really the driving factor here; this can be done in one of two ways:
To test work related to cluster upgrade process, use a 4.10.3 cluster set on the candidate-4.10 upgrade channel using 4.11 frontend code.
If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml co the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to remove the ability to “Add identity providers” under “Set up your Cluster”. In addition to the getting started card, we should remove the ability to update a cluster on the details card when applicable (anything that changes a cluster version should be read only).
Summary of changes to the overview page:
Check section 03 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#
If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml co the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need surface a message that the control plane is externally managed and add following changes:
In general, anything that changes a cluster version should be read only.
Check section 02 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#
If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml to the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to suspend kubeadmin notifier, from the global notifications, since it contain link for updating the cluster OAuth configuration (see attachment).
Based on Cesar's comment we should be removing the `Control Plane` section, if the infrastructure.status.controlplanetopology being "External".
If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml to the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to suspend these notifications:
For these we will need to check `ControlPlaneTopology`, if it's set to 'External' and also check if the user can edit cluster version(either by creating a hook or an RBAC call, eg. `canEditClusterVersion`)
Check section 05 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#
PatternFly Dark Theme Handbook: https://docs.google.com/document/d/1mRYEfUoOjTsSt7hiqjbeplqhfo3_rVDO0QqMj2p67pw/edit
Admin Console -> Workloads & Pods
Dev Console -> Gotcha pages: Observe Dashboard and Metrics, Add, Pipelines: builder, list, log, and run
As a developer, I want to be able to scope the changes needed to enable dark mode for the admin console. As such, I need to investigate how much of the console will display dark mode using PF variables and also define a list of gotcha pages/components which will need special casing above and beyond PF variable settings.
Acceptance criteria:
As a developer, I want to be able to fix remaining issues from the spreadsheet of issues generated after the initial pass and spike of adding dark theme to the console.. As such, I need to make sure to either complete all remaining issues for the spreadsheet, or, create a bug or future story for any remaining issues in these two documents.
Acceptance criteria:
An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.
The Cluster Dashboard Details Card Protractor integration test was failing at high rate, and despite multiple attempts to fix, was never fully resolved, so it was disabled as a way to fix https://bugzilla.redhat.com/show_bug.cgi?id=2068594. Migrating this entire file to Cypress should give us better debugging capability, which is what was done to fix a similarly problematic project dashboard Protractor test.
Currently, you need to navigate to
Cluster Settings ->
Global configuration ->
Console (operator) config ->
Console plugins
to see and managed plugins. This takes a lot of clicks and is not discoverable. We should look at surfacing plugin details where they're easier to find – perhaps on the Cluster Settings page – or at least provide a more convenient link somewhere in the UI.
AC: Add the Dynamic Plugins section to the Status Card in the overview that will contain:
Currently, enabled plugins can fail to load for a variety of reasons. For instance, plugins don't load if the plugin name in the manifest doesn't match the ConsolePlugin name or the plugin has an invalid codeRef. There is no indication in the UI that something has gone wrong. We should explore ways to report this problem in the UI to cluster admins. Depending on the nature of the issue, an admin might be able to resolve the issue or at least report a bug against the plugin.
The message about failing could appear in the notification drawer and/or console plugins tab on the operator config. We could also explore creating an alert if a plugin is failing.
AC:
We need to provide a base for running integration tests using the dynamic plugins. The tests should initially
Once the basic framework is in place, we can update the demo plugin and add new integration tests when we add new extension points.
https://github.com/openshift/console/tree/master/frontend/dynamic-demo-plugin
https://github.com/openshift/enhancements/blob/master/enhancements/console/dynamic-plugins.md
https://github.com/openshift/console/tree/master/frontend/packages/console-plugin-sdk
In the 4.11 release, a console.openshift.io/default-i18next-namespace annotation is being introduced. The annotation indicates whether the ConsolePlugin contains localization resources. If the annotation is set to "true", the localization resources from the i18n namespace named after the dynamic plugin (e.g. plugin__kubevirt), are loaded. If the annotation is set to any other value or is missing on the ConsolePlugin resource, localization resources are not loaded.
In case these resources are not present in the dynamic plugin, the initial console load will be slowed down. For more info check BZ#2015654
AC:
Follow up of https://issues.redhat.com/browse/CONSOLE-3159
We have a Timestamp component for consistent display of dates and times that we should expose through the SDK. We might also consider a hook that formats dates and times for places were you don't want or cant use the component, eg. times on a chart.
This will become important when we add a user preference for dates so that plugins show consistent dates and times as console. If I set my user preference to UTC dates, console should show UTC dates everywhere.
AC:
Goal
Background
RFE: for 4.10, Cincinnati and the cluster-version operator are adding conditional updates (a.k.a. targeted edge blocking): https://issues.redhat.com/browse/OTA-267
High-level plans in https://github.com/openshift/enhancements/blob/master/enhancements/update/targeted-update-edge-blocking.md#update-client-support-for-the-enhanced-schema
Example of what the oc adm upgrade UX will be in https://github.com/openshift/enhancements/blob/master/enhancements/update/targeted-update-edge-blocking.md#cluster-administrator.
The oc implementation landed via https://github.com/openshift/oc/pull/961.
Design
See design doc: https://docs.google.com/document/d/1Nja4whdsI5dKmQNS_rXyN8IGtRXDJ8gXuU_eSxBLMIY/edit#
See marvel: https://marvelapp.com/prototype/h3ehaa4/screen/86077932
The "Update Version" modal on the cluster settings page should be updated to give users information about recommended, not recommended, and blocked update versions.
Update the cluster settings page to inform the user when the latest available update is supported but not recommended. Add an informational popover to the latest version in update path visualization.
Story: As an administrator I want to rely on a default configuration that spreads image registry pods across topology zones so that I don't suffer from a long recovery time (>6 mins) in case of a complete zone failure if all pods are impacted.
Background: The image registry currently uses affinity/anti-affinity rules to spread registry pods across different hosts. However this might cause situations in which all pods end up on hosts of a single zone, leading to a long recovery time of the registry if that zone is lost entirely. However due to problems in the past with the preferred setting of anti-affinity rule adherence the configuration was forced instead with required and the rules became constraints. With zones as constraints the internal registry would not have deployed anymore in environments with a single zone, e.g. internal CI environment. Pod topology constraints is a new API that is supported in OCP which can also relax constraints in case they cannot be satisfied. Details here: https://docs.openshift.com/container-platform/4.7/nodes/scheduling/nodes-scheduler-pod-topology-spread-constraints.html
Acceptance criteria:
Open Questions:
As an OpenShift administrator
I want to provide the registry operator with a custom certificate authority for S3 storage
so that I can use a third-party S3 storage provider.
Remove Jenkins from the OCP Payload.
See epic linking - need alternative non payload image available to provide relatively seamless migration
Also, the EP for this is approved and merged at https://github.com/openshift/enhancements/blob/master/enhancements/builds/remove-jenkins-payload.md
PARTIAL ANSWER ^^: confirmed with Ben Parees in https://coreos.slack.com/archives/C014MHHKUSF/p1646683621293839 that EP merging is currently sufficient OCP "technical leadership" approval.
assuming none
As maintainers of the OpenShift jenkins component, we need run Jenkins CI for PR testing against openshift/jenkins, openshift/jenkins-sync-plugin, openshift/jenkins-client-plugin, openshift/jenkins-openshift-login-plugin, using images built in the CI pipeline but not injected into CI test clusters via sample operator overriding the jenkins sample imagestream with the jenkins payload image.
As maintainers of the OpenShift Jenkins component, we need Jenkins periodics for the client and sync plugins to run against the latest non payload, CPaas image, promoted to CI's image locations on quay.io, for the current release in development.
As maintainers of the OpenShift Jenkins component, we need Jenkins related tests outside of very basic Jenkins Pipieline Strategy Build Config verification, removed from openshift-tests in OpenShift Origin, using a non-payload, CPaas image pertinent to the branch in question.
High Level, we ideally want to vet the new CPaas image via CI and periodics BEFORE we start changing the samples operator so that it does not manipulate the jenkins imagestream (our tests will override the samples operator override)
NONE ... QE should wait until JNKS-254
NONE
NONE
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Possible staging
1) before CPaas is available, we can validate images generated by PRs to openshift/jenkins, openshift/jenkins-sync-plugin, openshift/jenkins-client-plugin by taking the image built by the image (where the info needed to get the right image from the CI registry is in the IMAGE_FORMAT env var) and then doing an `oc tag --source=docker <PR image ref> openshift/jenkins:2` to replace the use of the payload image in the jenkins imagestream in the openshift namespace with the PRs image
2) insert 1) in https://github.com/openshift/release/blob/master/ci-operator/step-registry/jenkins/sync-plugin/e2e/jenkins-sync-plugin-e2e-commands.sh and https://github.com/openshift/release/blob/master/ci-operator/step-registry/jenkins/client-plugin/tests/jenkins-client-plugin-tests-commands.sh where you test for IMAGE_FORMAT being set
3) or instead of 2) you update the Makefiles for the plugins to call a script that does the same sort of thing, see what is in IMAGE_FORMAT, and if it has something, do the `oc tag`
https://github.com/openshift/release/pull/26979 is a prototype of how to stick the image built from a PR and conceivably the periodics to get the image built from it and tag it into the jenkins imagestream in the openshift namespace in the test cluster
After installing or upgrading to the latest OCP version, the existing OpenShift route to the prometheus-k8s service is updated to be a path-based route to '/api/v1'.
DoD:
Following up on https://issues.redhat.com/browse/MON-1320, we added three new CLI flags to Prometheus to apply different limits on the samples' labels. These new flags are available starting from Prometheus v2.27.0, which will most likely be shipped in OpenShift 4.9.
The limits that we want to look into for OCP are the following ones:
# Per-scrape limit on number of labels that will be accepted for a sample. If # more than this number of labels are present post metric-relabeling, the # entire scrape will be treated as failed. 0 means no limit. [ label_limit: <int> | default = 0 ] # Per-scrape limit on length of labels name that will be accepted for a sample. # If a label name is longer than this number post metric-relabeling, the entire # scrape will be treated as failed. 0 means no limit. [ label_name_length_limit: <int> | default = 0 ] # Per-scrape limit on length of labels value that will be accepted for a sample. # If a label value is longer than this number post metric-relabeling, the # entire scrape will be treated as failed. 0 means no limit. [ label_value_length_limit: <int> | default = 0 ]
We could benefit from them by setting relatively high values that could only induce unbound cardinality and thus reject the targets completely if they happened to breach our constrainst.
DoD:
When users configure CMO to interact with systems outside of an OpenShift cluster, we want to provide an easy way to add the cluster ID to the data send.
Technically this can be achieved today, by adding an identifying label to the remote_write configuration for a given cluster. The operator adding the remote_write integration needs to take care that the label is unique over the managed fleet of clusters. This however adds management complexity. Any given cluster already has a pseudo-unique datum, that can be used for this purpose.
Expose a flag in the CMO configuration, that is false by default (keeps backward compatibility) and when set to true will add the _id label to a remote_write configuration. More specifically it will be added to the top of a remote_write relabel_config list via the replace action. This will add the label as expect, but additionally a user could alter this label in a later relabel config to suit any specific requirements (say rename the label or add additional information to the value).
The location of this flag is the remote_write Spec, so this can be set for individual remote_write configurations.
Add an optional boolean flag to CMOs definition of RemoteWriteSpec that if true adds an entry in the specs WriteRelabelConfigs list.
I went with adding the relabel config to all user-supplied remote_write configurations. This path has no risk for backwards compatibility (unless users use the {}tmp_openshift_cluster_id{} label, seems unlikely) and reduces overall complexity, as well as documentation complexity.
The entry should look like what is already added to the telemetry remote write config and it should be added as the first entry in the list, before any user supplied relabel configs.
We currently use a sample app to e2e test remote write in CMO.
In order to test the addition of the cluster_id relabel config, we need to confirm that the metrics send actually have the expected label.
For this test we should use Prometheus as the remote_write target. This allows us to query the metrics send via remote write and confirm they have the expected label.
The potential target ServiceMonitors are:
As a user, I want the topology view to be less cluttered as I doom out showing only information that I can discern and still be able to get a feel for the status of my project.
As a user, I want to understand which service bindings connected a service to a component successfully or not. Currently it's really difficult to understand and needs inspection into each ServiceBinding resource (yaml).
See also https://docs.google.com/document/d/1OzE74z2RGO5LPjtDoJeUgYBQXBSVmD5tCC7xfJotE00/edit
This epic is mainly focused on the 4.10 Release QE activities
1. Identify the scenarios for automation
2. Segregate the test Scenarios into smoke, Regression and other user stories
a. Update the https://docs.jboss.org/display/ODC/Automation+Status+Report
3. Align with layered operator teams for updating scripts
3. Work closely with dev team for epic automation
4. Create the automation scripts using cypress
5. Implement CI for nightly builds
6. Execute scripts on sprint basis
To the track the QE progress at one place in 4.10 Release Confluence page
Acceptance criteria:
This epic covers a number of customer requests(RFEs) as well as increases usability.
Customer satisfaction as well as improved usability.
None
As a user, I want to use a form to create Deployments
Edit deployment form ODC-5007
As a user, I should be able to switch between the form and yaml editor while creating the ProjectHelmChartRepository CR.
Form component https://github.com/openshift/console/pull/11227
Currently we are only able to get limited telemetry from the Dev Sandbox, but not from any of our managed clusters or on prem clusters.
In order to improve properly analyze usage and the user experience, we need to be able to gather as much data as possible.
// JS type
telemetry?: Record<string, string>
./bin/bridge --telemetry SEGMENT_API_KEY=a-key-123-xzy ./bin/bridge --telemetry CONSOLE_LOG=debug
Goal:
Enhance oc adm release new (and related verbs info, extract, mirror) with heterogeneous architecture support
tl;dr
oc adm release new (and related verbs info, extract, mirror) would be enhanced to optionally allow the creation of manifest list release payloads. The manifest list flow would be triggered whenever the CVO image in an imagestream was a manifest list. If the CVO image is a standard manifest, the generated release payload will also be a manifest. If the CVO image is a manifest list, the generated release payload would be a manifest list (containing a manifest for each arch possessed by the CVO manifest list).
In either case, oc adm release new would permit non-CVO component images to be manifest or manifest lists and pass them through directly to the resultant release manifest(s).
If a manifest list release payload is generated, each architecture specific release payload manifest will reference the same pullspecs provided in the input imagestream.
More details in Option 1 of https://docs.google.com/document/d/1BOlPrmPhuGboZbLZWApXszxuJ1eish92NlOeb03XEdE/edit#heading=h.eldc1ppinjjh
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled
I asked Zvonko Kaiser and he seemed open to it. I need to confirm with Shiva Merla
Rename Provider to Infrastructure Provider
Add GPU Provider
https://miro.com/app/board/uXjVOeUB2B4=/?moveToWidget=3458764514332229879&cot=14
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As a developer building container images on OpenShift
I want to specify that my build should run without elevated privileges
So that builds do not run as root from the host's perspective with elevated privileges
No QE required for Dev Preview. OpenShift regression testing will verify that existing behavior is not impacted.
We will need to document how to enable this feature, with sufficient warnings regarding Dev Preview.
This likely warrants an OpenShift blog post, potentially?
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled
1. Proposed title of this feature request
--> Alert generation when the etcd container memory consumption goes beyond 90%
2. What is the nature and description of the request?
--> When the etcd database starts growing rapidly due to some high number of objects like secrets, events, or configmap generation by application/workload, the memory and CPU consumption of APIserver and etcd container (control plane component) spikes up and eventually the control plane nodes goes to hung/unresponsive or crash due to out of memory errors as some of the critical processes/services running on master nodes get killed. Hence we request an alert/alarm when the ETCD container's memory consumption goes beyond 90% so that the cluster administrator can take some action before the cluster/nodes go unresponsive.
I see we already have a etcdExcessiveDatabaseGrowth Prometheus rule which helps when the surge in etcd writes leading to a 50% increase in database size over the past four hours on etcd instance however it does not consider the memory consumption:
$ oc get prometheusrules etcd-prometheus-rules -o yaml|grep -i etcdExcessiveDatabaseGrowth -A 9
- alert: etcdExcessiveDatabaseGrowth
annotations:
description: 'etcd cluster "{{ $labels.job }}": Observed surge in etcd writes
leading to 50% increase in database size over the past four hours on etcd
instance {{ $labels.instance }}, please check as it might be disruptive.'
expr: |
increase(((etcd_mvcc_db_total_size_in_bytes/etcd_server_quota_backend_bytes)*100)[240m:1m]) > 50
for: 10m
labels:
severity: warning
3. Why does the customer need this? (List the business requirements here)
--> Once the etcd memory consumption goes beyond 90-95% of total ram as it's system critical container, the OCP cluster goes unresponsive causing revenue loss to business and impacting the productivity of users of the openshift cluster.
4. List any affected packages or components.
--> etcd
This is a clone of issue OCPBUGS-14415. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-14315. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4501. The following is the description of the original issue:
—
Description of problem:
IPV6 interface and IP is missing in all pods created in OCP 4.12 EC-2.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Every time
Steps to Reproduce:
We create network-attachment-definitions.k8s.cni.cncf.io in OCP cluster at namespace scope for our software pods to get IPV6 IPs.
Actual results:
Pods do not receive IPv6 addresses
Expected results:
Pods receive IPv6 addresses
Additional info:
This has been working flawlessly till OCP 4.10. 21 however we are trying same code in OCP 4.12-ec2 and we notice all our pods are missing ipv6 address and we have to restart pods couple times for them to get ipv6 address.
This bug is a backport clone of [Bugzilla Bug 2094362](https://bugzilla.redhat.com/show_bug.cgi?id=2094362). The following is the description of the original bug:
—
Description of problem:
A change [1] was introduced to split the kube-apiserver SLO rules into 2 groups to reduce the load on Prometheus (see bug 2004585).
Version-Release number of selected component (if applicable):
4.9 (because the change was backported to 4.9.z)
How reproducible:
Always
Steps to Reproduce:
1. Install OCP 4.9
2. Retrieve kube-apiserver-slos*
oc get -n openshift-kube-apiserver prometheusrules kube-apiserver-slos -o yaml
oc get -n openshift-kube-apiserver prometheusrules kube-apiserver-slos-basic -o yaml
Actual results:
The KubeAPIErrorBudgetBurn alert with labels
{long="1h",namespace="openshift-kube-apiserver",severity="critical",short="5m"}exists both in kube-apiserver-slos and kube-apiserver-slos-basic.
The alerting rules is evaluated twice. The same is true for recording rules like "apiserver_request:burnrate1h" and in this case, it can trigger warning logs in the Prometheus pods:
> level=warn component="rule manager" group=kube-apiserver.rules msg="Error on ingesting out-of-order result from rule evaluation" numDropped=283
Expected results:
I presume that kube-apiserver-slos shouldn't exist since it's been replaced by kube-apiserver-slos-basic and kube-apiserver-slos-extended.
Additional info:
Discovered while investigating bug 2091902
Description of problem:
With every pod update we are executing a mutate operation to add the pod port to the port group or add the pod IP to an address set. This functionally doesn't hurt, since mutate will not add duplicate values to the same set. However, this is bad for performance. For example, with a 730 network policies affecting a pod, and issuing 7 pod updates would result in over 5k transactions.
Node healthz server was added in 4.13 with https://github.com/openshift/ovn-kubernetes/commit/c8489e3ff9c321e77f265dc9d484ed2549df4a6b and https://github.com/openshift/ovn-kubernetes/commit/9a836e3a547f3464d433ce8b9eef336624d51858. We need to configure it by default on 0.0.0.0:10256 on CNO for ovnk, just like we do for sdn.
This is a clone of issue OCPBUGS-12956. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-12910. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-12904. The following is the description of the original issue:
—
Description of problem:
In order to test proxy installations, the CI base image for OpenShift on OpenStack needs netcat.
While running a PerfScale test we noticed that the hosted ovnkube-master pods always initially error on deployment. They eventually succeed on retry however.
This is running quay.io/openshift-release-dev/ocp-release:4.11.11-x86_64 for the hosted clusters and the hypershift operator is quay.io/hypershift/hypershift-operator:4.11 on a 4.11.9 management cluster.
An example of the error in the ovnkube-master container:
```
F1102 13:27:51.935600 1 ovnkube.go:133] error when trying to initialize libovsdb SB client: unable to connect to any endpoints: failed to connect to ssl:ovnkube-master-0.ovnkube-master-internal.clusters-perf-pqd-0021.svc.cluster.local:9642: failed to open connection: dial tcp 10.131.8.25:9642: connect: connection refused. failed to connect to ssl:ovnkube-master-1.ovnkube-master-internal.clusters-perf-pqd-0021.svc.cluste
```
This is a clone of issue OCPBUGS-5100. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-5068. The following is the description of the original issue:
—
Description of problem:
virtual media provisioning fails when iLO Ironic driver is used
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Steps to Reproduce:
1. attempt virtual media provisioning on a node configured with ilo-virtualmedia:// drivers 2. 3.
Actual results:
Provisioning fails with "An auth plugin is required to determine endpoint URL" error
Expected results:
Provisioning succeeds
Additional info:
Relevant log snippet: 3742 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector [None req-e58ac1f2-fac6-4d28-be9e-983fa900a19b - - - - - -] Unable to start managed inspection for node e4445d43-3458-4cee-9cbe-6da1de75 78cd: An auth plugin is required to determine endpoint URL: keystoneauth1.exceptions.auth_plugins.MissingAuthPlugin: An auth plugin is required to determine endpoint URL 3743 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector Traceback (most recent call last): 3744 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/inspector.py", line 210, in _start_managed_inspection 3745 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector task.driver.boot.prepare_ramdisk(task, ramdisk_params=params) 3746 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector File "/usr/lib/python3.9/site-packages/ironic_lib/metrics.py", line 59, in wrapped 3747 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector result = f(*args, **kwargs) 3748 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/ilo/boot.py", line 408, in prepare_ramdisk 3749 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector iso = image_utils.prepare_deploy_iso(task, ramdisk_params, 3750 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/image_utils.py", line 624, in prepare_deploy_iso 3751 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector return prepare_iso_image(inject_files=inject_files) 3752 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/image_utils.py", line 537, in _prepare_iso_image 3753 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector image_url = img_handler.publish_image( 3754 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/image_utils.py", line 193, in publish_image 3755 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector swift_api = swift.SwiftAPI() 3756 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector File "/usr/lib/python3.9/site-packages/ironic/common/swift.py", line 66, in __init__ 3757 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector endpoint = keystone.get_endpoint('swift', session=session)
This is a clone of issue OCPBUGS-13013. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-12854. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-11550. The following is the description of the original issue:
—
Description of problem:
`cluster-reader` ClusterRole should have ["get", "list", "watch"] permissions for a number of privileged CRs, but lacks them for the API Group "k8s.ovn.org", which includes CRs such as EgressFirewalls, EgressIPs, etc.
Version-Release number of selected component (if applicable):
OCP 4.10 - 4.12 OVN
How reproducible:
Always
Steps to Reproduce:
1. Create a cluster with OVN components, e.g. EgressFirewall 2. Check permissions of ClusterRole `cluster-reader`
Actual results:
No permissions for OVN resources
Expected results:
Get, list, and watch verb permissions for OVN resources
Additional info:
Looks like a similar bug was opened for "network-attachment-definitions" in OCPBUGS-6959 (whose closure is being contested).
This is a clone of issue OCPBUGS-1428. The following is the description of the original issue:
—
Description of problem:
When using an OperatorGroup attached to a service account, AND if there is a secret present in the namespace, the operator installation will fail with the message: the service account does not have any API secret sa=testx-ns/testx-sa This issue seems similar to https://bugzilla.redhat.com/show_bug.cgi?id=2094303 - which was resolved in 4.11.0 - however, the new element now, is that the presence of a secret in the namespace is causing the issue. The name of the secret seems significant - suggesting something somewhere is depending on the order that secrets are listed in. For example, If the secret in the namespace is called "asecret", the problem does not occur. If it is called "zsecret", the problem always occurs.
"zsecret" is not a "kubernetes.io/service-account-token". The issue I have raised here relates to Opaque secrets - zsecret is an Opaque secret. The issue may apply to other types of secrets, but specifically my issue is that when there is an opaque secret present in the namespace, the operator install fails as described. I aught to be allowed to have an opaque secret present in the namespace where I am installing the operator.
Version-Release number of selected component (if applicable):
4.11.0 & 4.11.1
How reproducible:
100% reproducible
Steps to Reproduce:
1.Create namespace: oc new-project testx-ns 2. oc apply -f api-secret-issue.yaml
Actual results:
Expected results:
Additional info:
API YAML:
cat api-secret-issue.yaml
apiVersion: v1
kind: Secret
metadata:
name: zsecret
namespace: testx-ns
annotations:
kubernetes.io/service-account.name: testx-sa
type: Opaque
stringData:
mykey: mypass
—
apiVersion: v1
kind: ServiceAccount
metadata:
name: testx-sa
namespace: testx-ns
—
kind: OperatorGroup
apiVersion: operators.coreos.com/v1
metadata:
name: testx-og
namespace: testx-ns
spec:
serviceAccountName: "testx-sa"
targetNamespaces:
- testx-ns
—
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: testx-role
namespace: testx-ns
rules:
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: testx-rolebinding
namespace: testx-ns
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: testx-role
subjects:
—
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: etcd-operator
namespace: testx-ns
spec:
channel: singlenamespace-alpha
installPlanApproval: Automatic
name: etcd
source: community-operators
sourceNamespace: openshift-marketplace
Description of problem:
Cannot scale up worker node have deploying OCP 4.11.1 cluster via UPI on Azure
5h2m Warning FailedCreate machine/pokus-2knkh-worker-northeurope1-f6kc4 InvalidConfiguration: failed to reconcile machine "pokus-2knkh-worker-northeurope1-f6kc4": failed to create vm pokus-2knkh-worker-northeurope1-f6kc4: failure sending request for machine pokus-2knkh-worker-northeurope1-f6kc4: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=404 - Original Error: Code="NotFound" Message="The Image '/subscriptions/e639e479-2737-4b3d-b338-f1928f6429a1/resourceGroups/mlpipe-2163-azpln-rg/providers/Microsoft.Compute/images/pokus-2knkh-gen2' cannot be found in 'northeurope' region."
Customer would like to have the installer create machineset from the inital installation, therefore Kubernetes manifest files that define the worker machines were not removed during the installation.
Highlights:
Can I please let help verifying if these are the correct steps to have the initial installation created and manage the worker machines?Is there an explanation on how changing the image to -gen2 in [concat(parameters('baseName'),'-gen2')] from the 02_storage.json template can resolve the problem?
Version-Release number of selected component (if applicable):
Environment:
OCP 4.11.1 UPI install on Azure using ARM
VM size:
bootstrap: Standard_D4s_v3
master: Standard_D4s_v3
How reproducible:
Always
Steps to Reproduce:
Following the step described in the document: Installing a cluster on Azure using ARM templates .
In the install-config.yaml, worker replicas was set to 0
compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: {} replicas: 3 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: {} replicas: 3
After creating the manifests described in this step: Creating the Kubernetes manifest and Ignition config files only control plane machines manifests were removed, worker machines manifests remain untouchedAfter three masters and three worker nodes were created by ARM templates, additional worker were added using machine sets via command
oc scale --replicas=1 machineset cluster-g7rzv-worker-francecentral1 -n openshift-machine-api`
Actual results:
No addition node visible from `oc get nodes` and the following error occur:
5h2m Warning FailedCreate machine/pokus-2knkh-worker-northeurope1-f6kc4 InvalidConfiguration: failed to reconcile machine "pokus-2knkh-worker-northeurope1-f6kc4": failed to create vm pokus-2knkh-worker-northeurope1-f6kc4: failure sending request for machine pokus-2knkh-worker-northeurope1-f6kc4: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=404 - Original Error: Code="NotFound" Message="The Image '/subscriptions/e639e479-2737-4b3d-b338-f1928f6429a1/resourceGroups/mlpipe-2163-azpln-rg/providers/Microsoft.Compute/images/pokus-2knkh-gen2' cannot be found in 'northeurope' region."
The customer found out that this can be resolved if changing the -image to -gen2 in [concat(parameters('baseName'),'-gen2')] from the 02_storage.json template
Expected results:
The installer should be able to create and manage machineset
Additional info:
SFDC case #03304526
Slack discussion, might due to MAO not able to support UPI in Azure Thread1, Thread2
This is a clone of issue OCPBUGS-7409. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-7374. The following is the description of the original issue:
—
Originally reported by lance5890 in issue https://github.com/openshift/cluster-etcd-operator/issues/1000
The controllers sometimes get stuck on listing members in failure scenarios, this is known and can be mitigated by simply restarting the CEO.
similar BZ 2093819 with stuck controllers was fixed slightly different in https://github.com/openshift/cluster-etcd-operator/commit/4816fab709e11e0681b760003be3f1de12c9c103
This fix was contributed by lance5890, thanks a lot!
This is a clone of issue OCPBUGS-10497. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-10213. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-8468. The following is the description of the original issue:
—
Description of problem:
RHCOS is being published to new AWS regions (https://github.com/openshift/installer/pull/6861) but aws-sdk-go need to be bumped to recognize those regions
Version-Release number of selected component (if applicable):
master/4.14
How reproducible:
always
Steps to Reproduce:
1. openshift-install create install-config 2. Try to select ap-south-2 as a region 3.
Actual results:
New regions are not found. New regions are: ap-south-2, ap-southeast-4, eu-central-2, eu-south-2, me-central-1.
Expected results:
Installer supports and displays the new regions in the Survey
Additional info:
See https://github.com/openshift/installer/blob/master/pkg/asset/installconfig/aws/regions.go#L13-L23
This is a clone of issue OCPBUGS-11998. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-10678. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-10655. The following is the description of the original issue:
—
Description of problem:
The dev console shows a list of samples. The user can create a sample based on a git repository. But some of these samples doesn't include a git repository reference and could not be created.
Version-Release number of selected component (if applicable):
Tested different frontend versions against a 4.11 cluster and all (oldest tested frontend was 4.8) show the sample without git repository.
But the result also depends on the installed samples operator and installed ImageStreams.
How reproducible:
Always
Steps to Reproduce:
Actual results:
The git repository is not filled and the create button is disabled.
Expected results:
Samples without git repositories should not be displayed in the list.
Additional info:
The Git repository is saved as "sampleRepo" in the ImageStream tag section.
Changelog between 3.5.5 and 3.5.4:
https://github.com/etcd-io/etcd/blob/main/CHANGELOG/CHANGELOG-3.5.md#v355-tbd
Changelog between 3.5.3 and 3.5.4:
https://github.com/etcd-io/etcd/blob/main/CHANGELOG/CHANGELOG-3.5.md#v354-2022-04-24
This is a clone of OCPBUGS-853.
Description of problem:
Large OpenShift Container Platform 4.10.24 - Cluster is failing to update router-certs secret in openshift-config-managed namespace as the given secret is too big. 2022-09-01T06:24:15.157333294Z 2022-09-01T06:24:15.157Z ERROR operator.init.controller.certificate_publisher_controller controller/controller.go:266 Reconciler error {"name": "foo-bar", "namespace": "openshift-ingress-operator", "error": "failed to ensure global secret: failed to update published router certificates secret: Secret \"router-certs\" is invalid: data: Too long: must have at most 1048576 bytes"} The OpenShift Container Platform 4 - Cluster has 180 IngressController configured with endpointPublishingStrategy set to private. Now the default certificate needs to be replaced but is not properly replicated to openshift-authentication namespace and potentially other location because of the problem mentioned (since the required secret can not be updated)
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.10.24
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift Container Platform 4.10 2. Create 180 IngressController with specific certificates 3. Check openshift-ingress-operator logs to see how it fails to update/create the necessary secret in openshift-config-managed
Actual results:
2022-09-01T06:24:15.157333294Z 2022-09-01T06:24:15.157Z ERROR operator.init.controller.certificate_publisher_controller controller/controller.go:266 Reconciler error {"name": "foo-bar", "namespace": "openshift-ingress-operator", "error": "failed to ensure global secret: failed to update published router certificates secret: Secret \"router-certs\" is invalid: data: Too long: must have at most 1048576 bytes"}
Expected results:
No matter how many IngressController is created, secret management taken care by Operators need to work, even if data exceed 1 MB size limitation. In that case an approach needs to exist to split data into multiple secrets or handle it otherwise.
Additional info:
This is a clone of issue OCPBUGS-17703. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-17365. The following is the description of the original issue:
—
When we update a Secret referenced in the BareMetalHost, an immediate reconcile of the corresponding BMH is not triggered. In most states we requeue each CR after a timeout, so we should eventually see the changes.
In the case of BMC Secrets, this has been broken since the fix for OCPBUGS-1080 in 4.12.
This is a clone of issue OCPBUGS-1329. The following is the description of the original issue:
—
Description of problem:
etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO
Version-Release number of selected component (if applicable):
4.10.32
How reproducible:
Not always, after ~10 attempts
Steps to Reproduce:
1. Deploy SNO with Telco DU profile applied 2. Create multiple pods with local storage volumes attached(attaching yaml manifest) 3. Force delete and re-create pods 10 times
Actual results:
etcd and kube-apiserver pods get restarted, making to cluster unavailable for a period of time
Expected results:
etcd and kube-apiserver do not get restarted
Additional info:
Attaching must-gather. Please let me know if any additional info is required. Thank you!
This is a clone of issue OCPBUGS-6766. The following is the description of the original issue:
—
This is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2083087 (OCPBUGSM-44070) to backport this issue.
Description of problem:
"Delete dependent objects of this resource" is a bit of confusing for some users because when creating the Application in Dev console not only the deployment but also IS, route, svc, secret objects will be created as well. When deleting the Application (in fact it is deployment), there is an option called "Delete dependent objects of this resource" and some users might think this means the IS, route, svc and any other objects which are created alongside with the deployment will be deleted as well
Version-Release number of selected component (if applicable):
4.8
How reproducible:
Always
Steps to Reproduce:
1. Create Application in Dev console
2. Delete the deployment
3. Check "Delete dependent objects of this resource"
Actual results:
Only deployment will be deleted and IS, svc, route will not be deleted
Expected results:
We either change the description of this option, or we really delete IS, svc, route and any other objects created under this Application.
Additional info:
This is a clone of issue OCPBUGS-5191. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-5164. The following is the description of the original issue:
—
Description of problem:
It looks like the ODC doesn't register KNATIVE_SERVING and KNATIVE_EVENTING flags. Those are based on KnativeServing and KnativeEventing CRs, but they are looking for v1alpha1 version of those: https://github.com/openshift/console/blob/f72519fdf2267ad91cc0aa51467113cc36423a49/frontend/packages/knative-plugin/console-extensions.json#L6-L8
This PR https://github.com/openshift-knative/serverless-operator/pull/1695 moved the CRs to v1beta1, and that breaks that ODC discovery.
Version-Release number of selected component (if applicable):
Openshift 4.8, Serverless Operator 1.27
Additional info:
https://coreos.slack.com/archives/CHGU4P8UU/p1671634903447019
This is a clone of issue OCPBUGS-1805. The following is the description of the original issue:
—
The vSphere CSI cloud.conf lists the single datacenter from platform workspace config but in a multi-zone setup (https://github.com/openshift/enhancements/pull/918 ) there may be more than the one datacenter.
This issue is resulting in PVs failing to attach because the virtual machines can't be find in any other datacenter. For example:
0s Warning FailedAttachVolume pod/image-registry-85b5d5db54-m78vp AttachVolume.Attach failed for volume "pvc-ab1a0611-cb3b-418d-bb3b-1e7bbe2a69ed" : rpc error: code = Internal desc = failed to find VirtualMachine for node:"rbost-zonal-ghxp2-worker-3-xm7gw". Error: virtual machine wasn't found
The machine above lives in datacenter-2 but the CSI cloud.conf is only aware of the datacenter IBMCloud.
$ oc get cm vsphere-csi-config -o yaml -n openshift-cluster-csi-drivers | grep datacenters
datacenters = "IBMCloud"
This is a clone of issue OCPBUGS-6831. The following is the description of the original issue:
—
Description of problem:
The console crashes when it used with a user settings ConfigMap that is created with a 4.13+ console. This version saves "null" for the key "console.pinnedResources" which doesn't happen before and the old console version could not handle this well.
Version-Release number of selected component (if applicable):
4.8-4.12
How reproducible:
Always, but only in the edge case that someone used a newer console first and then downgraded.
This can happen only by manually applying the user settings ConfigMap or when downgrading a cluster.
Steps to Reproduce:
Open the user-settings ConfigMap and set "console.pinedResources" to "null" (with quotes as all ConfigMap values needs to be strings)
Or run this patch command:
oc patch -n openshift-console-user-settings configmaps user-settings-kubeadmin --type=merge --patch '{"data":{"console.pinnedResources":"null"}}'
Open console...
Actual results:
Console crashes
Expected results:
Console should not crash
Description of problem:
To address: 'Static Pod is managed but errored" err="managed container xxx does not have Resource.Requests'
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Already merged in https://github.com/openshift/cluster-kube-controller-manager-operator/pull/660
This bug was initially created as a copy of
Bug #2096605
I am copying this bug because: the parent bug solved the validation aspect of diskType but now the description of diskType in
https://github.com/openshift/installer/blob/master/data/data/install.openshift.io_installconfigs.yaml#L2914-L2923
needs to be updated.
Version: 4.11.0-0.nightly-2022-06-06-201913
Platform: vSphere IPI
What happened?
1. If user inputs an invalid value for platform.vsphere.diskType in install-config.yaml file, there is no validation checking for diskType and doesn't exit with error, but continues the installation, which is not the same behavior as in 4.10.
After all vms are provisioned, I checked that the disk provision type is thick.
2. If user doesn't set platform.vsphere.diskType in install-config.yaml file, the default disk provision type is thick, but not the vSphere default storage policy. On VMC, the default policy is thin, so maybe the description of diskType should also need to be updated.
$ ./openshift-install explain installconfig.platform.vsphere.diskType
KIND: InstallConfig
VERSION: v1
RESOURCE: <string>
Valid Values: "","thin","thick","eagerZeroedThick"
DiskType is the name of the disk provisioning type, valid values are thin, thick, and eagerZeroedThick. When not specified, it will be set according to the default storage policy of vsphere.
What did you expect to happen?
validation for diskType
How to reproduce it (as minimally and precisely as possible)?
set diskType to invalid value in install-config.yaml and install the cluster
As discussed previously by email, customer support case 03211616 requests a means to use the latest patch version of a given X.Y golang via imagestreams, with the blocking issue being the lack of X.Y tags for the go-toolset containers on RHCC.
The latter has now been fixed with the latest version also getting a :1.17 tag, and the imagestream source has been modified accordingly, which will get picked up in 4.12. We can now fix this in 4.11.z by backporting this to the imagestream files bundled in cluster-samples-operator.
/cc Ian Watson Feny Mehta
This is a clone of issue OCPBUGS-501. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable): 4.10.16
How reproducible: Always
Steps to Reproduce:
1. Edit the apiserver resource and add spec.audit.customRules field
$ oc get apiserver cluster -o yaml
spec:
audit:
customRules:
2. Allow the kube-apiserver pods to rollout new revision.
3. Once the kube-apiserver pods are in new revision execute $ oc get dc
Actual results:
Error from server (InternalError): an error on the server ("This request caused apiserver to panic. Look in the logs for details.") has prevented the request from succeeding (get deploymentconfigs.apps.openshift.io)
Expected results: The command "oc get dc" should display the deploymentconfig without any error.
Additional info:
This is a clone of issue OCPBUGS-11844. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-5548. The following is the description of the original issue:
—
Description of problem:
This is a follow-up on https://bugzilla.redhat.com/show_bug.cgi?id=2083087 and https://github.com/openshift/console/pull/12390
When creating a Deployment, DeploymentConfig, or Knative Service with enabled Pipeline, and then deleting it again with the enabled option "Delete other resources created by console" (only available on 4.13+ with the PR above) the automatically created Pipeline is not deleted.
When the user tries to create the same resource with a Pipeline again this fails with an error:
An error occurred
secrets "nodeinfo-generic-webhook-secret" already exists
Version-Release number of selected component (if applicable):
4.13
(we might want to backport this together with https://github.com/openshift/console/pull/12390 and OCPBUGS-5547)
How reproducible:
Always
Steps to Reproduce:
Actual results:
Case 1: Delete resources:
Case 2: Delete application:
Expected results:
Case 1: Delete resource:
Case 2: Delete application:
Additional info:
Description of problem:
Provisioning interface on master node not getting ipv4 dhcp ip address from bootstrap dhcp server on OCP 4.10.16 IPI BareMetal install.
Customer is performing an OCP 4.10.16 IPI BareMetal install and bootstrap node provisions just fine, but when master nodes are booted for provisioning, they are not getting an ipv4 address via dhcp. As such, the install is not moving forward at this point.
Version-Release number of selected component (if applicable):
OCP 4.10.16
How reproducible:
Perform OCP 4.10.16 IPI BareMetal install.
Actual results:
provisioning interface comes up (as evidenced by ipv6 address) but is not getting an ipv4 address via dhcp. OCP install / provisioning fails at this point.
Expected results:
provisioning interface successfully received an ipv4 ip address and successfully provisioned master nodes (and subsequently worker nodes as well.)
Additional info:
As a troubleshooting measure, manually adding an ipv4 ip address did allow the coreos image on the bootstrap node to be reached via curl.
Further, the kernel boot line for the first master node was updated for a static ip addresss assignment for further confirmation that the master node would successfully image this way which further confirming that the issue is the provisioning interface not receiving an ipv4 ip address from the dhcp server.
Description of problem:
To address: 'Static Pod is managed but errored" err="managed container xxx does not have Resource.Requests'
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Already merged in https://github.com/openshift/cluster-kube-apiserver-operator/pull/1398
This is a clone of issue OCPBUGS-7445. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-7207. The following is the description of the original issue:
—
At some point in the mtu-migration development a configuration file was generated at /etc/cno/mtu-migration/config which was used as a flag to indicate to configure-ovs that a migration procedure was in progress. When that file was missing, it was assumed the migration procedure was over and configure-ovs did some cleaning on behalf of it.
But that changed and /etc/cno/mtu-migration/config is never set. That causes configure-ovs to remove mtu-migration information when the procedure is still in progress making it to use incorrect MTU values and either causing nodes to be tainted with "ovn.k8s.org/mtu-too-small" blocking the procedure itself or causing network disruption until the procedure is over.
However, this was not a problem for the CI job as it doesn't use the migration procedure as documented for the sake of saving limited time available to run CI jobs. The CI merges two steps of the procedure into one so that there is never a reboot while the procedure is in progress and hiding this issue.
This was probably not detected in QE as well for the same reason as CI.
This is a clone of issue OCPBUGS-7732. The following is the description of the original issue:
—
Description of problem:
When services are deleted, the services controller cache should also remove the service from its top level cache to avoid growing forever. While this is not an issue in 4.13 once the lb_cache rework merges [1], the 4.12 and older branches have this problem because that rework is meant for 4.13 only. [1]: https://github.com/ovn-org/ovn-kubernetes/pull/3387 This is the location where alreadyApplied is not deleting the removal: https://github.com/openshift/ovn-kubernetes/blob/cf9fb51510e1870961bf3a0f064b73536757a4f8/go-controller/pkg/ovn/controller/services/services_controller.go#L269 It should do the similar changes depicted here (currently merged upstream): https://github.com/ovn-org/ovn-kubernetes/blob/cd78ae1af4657d38bdc41003a8737aa958d62b9d/go-controller/pkg/ovn/controller/services/services_controller.go#L322-L324
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. create service -- use unique name 2. remove service 3. notice how alreadyApplied grows and never gets smaller 4. repeat
Actual results:
^^
Expected results:
alreadyApplied should not grow forever
Additional info:
Description of problem:
Upgrade to 4.10 is stuck looping in syncEgressFirewall We see transacting operations with context deadline exceeded. It looks to be trying to process 2.8 million records is one go. 2023-02-21T19:55:06.514097513Z I0221 19:55:06.435220 1 client.go:781] "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:acls Mutator:delete Value:{GoSet:[{GoUUID:6a3ad543-a77d-4700-83b8-5ccae6b2d067} ID:1c5297ff-8588-467a-93f4-22f22d609563} {GoUUID:f6288ed3-3928-45a8-ae57-40ed94cfa249} {GoUUID:04bf90c2-fde1-4a10-baaa-6a3f1d8e2931} {GoUUID:c6609536-857c-48ae-9125-9505753180 a8} {GoUUID:c79b4398-d7cc-4dcf-8c1d-11484f318324} {GoUUID:4323ac2c-033e-43c3-885b-e951cd7a4159} {GoUUID:7b316a80-076f-4266-b7d2-bd69b1d4b874} {GoUUID:57dfecb2-2f94-4cd8-a277-8 b28205e1048} {GoUUID:2c039f15-ff11-4ceb-aa82-bcbe82fc86d1} {GoUUID:063c4121-73c3-4d53-a89d-1063e775146b} {GoUUID:25c788e3-6146-4571-98bf-61010100a22a} {GoUUID:3d3c150f-1296-4d 91-b334-506f28bff4bd}]}}] Timeout:<nil> Where:[where column _uuid == {ba9652de-5aae-4a74-a512-29f775e38c19}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]: context deadline exceeded 2023-02-21T19:55:18.739739417Z E0221 19:55:18.643127 1 master.go:1369] Failed (will retry) in syncing syncEgressFirewall: failed to remove reject acl from node logical switches: error while removing ACLS: [6a3ad543-a77d-4700-83b8-5ccae6b2d067 8e004991-0382-455f-9901-33ef724acbc2 Everything is built into one operation via: https://github.com/openshift/ovn-kubernetes/blob/release-4.10/go-controller/pkg/libovsdbops/switch.go#L243 TrandactAndCheck is being called with a 10s timeout and this operation never completes.
Version-Release number of selected component (if applicable):
4.10.50
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Upgrade completes
Additional info:
Description of problem:
Similar to OCPBUGS-11636 ccoctl needs to be updated to account for the s3 bucket changes described in https://aws.amazon.com/blogs/aws/heads-up-amazon-s3-security-changes-are-coming-in-april-of-2023/ these changes have rolled out to us-east-2 and China regions as of today and will roll out to additional regions in the near future See OCPBUGS-11636 for additional information
Version-Release number of selected component (if applicable):
How reproducible:
Reproducible in affected regions.
Steps to Reproduce:
1. Use "ccoctl aws create-all" flow to create STS infrastructure in an affected region like us-east-2. Notice that document upload fails because the s3 bucket is created in a state that does not allow usage of ACLs with the s3 bucket.
Actual results:
./ccoctl aws create-all --name abutchertestue2 --region us-east-2 --credentials-requests-dir ./credrequests --output-dir _output 2023/04/11 13:01:06 Using existing RSA keypair found at _output/serviceaccount-signer.private 2023/04/11 13:01:06 Copying signing key for use by installer 2023/04/11 13:01:07 Bucket abutchertestue2-oidc created 2023/04/11 13:01:07 Failed to create Identity provider: failed to upload discovery document in the S3 bucket abutchertestue2-oidc: AccessControlListNotSupported: The bucket does not allow ACLs status code: 400, request id: 2TJKZC6C909WVRK7, host id: zQckCPmozx+1yEhAj+lnJwvDY9rG14FwGXDnzKIs8nQd4fO4xLWJW3p9ejhFpDw3c0FE2Ggy1Yc=
Expected results:
"ccoctl aws create-all" successfully creates IAM and S3 infrastructure. OIDC discovery and JWKS documents are successfully uploaded to the S3 bucket and are publicly accessible.
Additional info:
This is a clone of issue OCPBUGS-7650. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-672. The following is the description of the original issue:
—
Description of problem:
Redhat-operator part of the marketplace is failing regularly due to startup probe timing out connecting to registry-server container part of the same pod within 1 sec which in turn increases CPU/Mem usage on Master nodes: 62m Normal Scheduled pod/redhat-operators-zb4j7 Successfully assigned openshift-marketplace/redhat-operators-zb4j7 to ip-10-0-163-212.us-west-2.compute.internal by ip-10-0-149-93 62m Normal AddedInterface pod/redhat-operators-zb4j7 Add eth0 [10.129.1.112/23] from ovn-kubernetes 62m Normal Pulling pod/redhat-operators-zb4j7 Pulling image "registry.redhat.io/redhat/redhat-operator-index:v4.11" 62m Normal Pulled pod/redhat-operators-zb4j7 Successfully pulled image "registry.redhat.io/redhat/redhat-operator-index:v4.11" in 498.834447ms 62m Normal Created pod/redhat-operators-zb4j7 Created container registry-server 62m Normal Started pod/redhat-operators-zb4j7 Started container registry-server 62m Warning Unhealthy pod/redhat-operators-zb4j7 Startup probe failed: timeout: failed to connect service ":50051" within 1s 62m Normal Killing pod/redhat-operators-zb4j7 Stopping container registry-server Increasing the threshold of the probe might fix the problem: livenessProbe: exec: command: - grpc_health_probe - -addr=:50051 failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5 name: registry-server ports: - containerPort: 50051 name: grpc protocol: TCP readinessProbe: exec: command: - grpc_health_probe - -addr=:50051 failureThreshold: 3 initialDelaySeconds: 5 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Install OSD cluster using 4.11.0-0.nightly-2022-08-26-162248 payload 2. Inspect redhat-operator pod in openshift-marketplace namespace 3. Observe the resource usage ( CPU and Memory ) of the pod
Actual results:
Redhat-operator failing leading to increase to CPU and Mem usage on master nodes regularly during the startup
Expected results:
Redhat-operator startup probe succeeding and no spikes in resource on master nodes
Additional info:
Attached cpu, memory and event traces.
We're seeing a slight uptick in how long upgrades are taking[1][2]. We are not 100% sure of the cause, but it looks like it started with 4.11 rc.7. There's no obvious culprits in the diff[3].
Looking at some of the jobs, we are seeing the gaps between kube-scheduler being updated and then machine-api appear to take longer. Example job run[4] showing 10+ minutes waiting for it.
TRT had a debugging session, and we have two suggestions:
[1] https://search.ci.openshift.org/graph/metrics?metric=job%3Aduration%3Atotal%3Aseconds&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-aws-sdn-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-ovn-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade
[2] https://sippy.dptools.openshift.org/sippy-ng/tests/4.12/analysis?test=Cluster%20upgrade.%5Bsig-cluster-lifecycle%5D%20cluster%20upgrade%20should%20complete%20in%2075.00%20minutes
[3] https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.11.0-rc.7
[4] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-azure-sdn-upgrade/1556865989923049472
libovsdb builds transaction log messages for every transaction and then throws them away if the log level is not 4 or above. This wastes a bunch of CPU at scale and increases pod ready latency.
Description of problem:
TO address: 'Static Pod is managed but errored" err="managed container xxx does not have Resource.Requests'
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-5067. The following is the description of the original issue:
—
Description of problem:
Since coreos-installer writes to stdout, its logs are not available for us.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
After setting the appsDomain flag in the OpenShift Ingress configuration, all Routes that are generated are based off this domain. This includes OpenShift-core Routes, which can result in failing health-checks, such as seen with the OpenShift Web Console. According to the documentation[0], the appsDomain value should only be used by user-created Routes and is not intended to be seen when Openshift components are reconciled.
Version-Release number of selected component (if applicable):
OpenShift 4.12
How reproducible:
Everytime
Steps to Reproduce:
1. Provision a new cluster 2. Configure appsDomain value 3. Delete the OpenShift Console Route: oc delete routes -n openshift-console console 4. The Route is automatically reconciled
Actual results:
The re-created Route uses the appsDomain value, resulting in failed health-checks throughout the cluster.
Expected results:
The Console and all other OpenShift-core components should use the default domain.
Additional info:
Reviewing the appsDomain code, if the appsDomain value is configured this takes precedent for all Routes. [3][1][2] Resources: [0] https://docs.openshift.com/container-platform/4.9/networking/ingress-operator.html#nw-ingress-configuring-application-domain_configuring-ingress [1] Sets Route Subdomain to appsdomain if it's present, otherwise uses the default domain https://github.com/openshift/cluster-openshift-apiserver-operator/blob/d2182396d647839f61b027602578ac31c9437e7d/pkg/operator/configobservation/ingresses/observe_ingresses.go#L16-L60 [2] Sets up the Route generator https://github.com/openshift/openshift-apiserver/blob/9c381fc873b35c074f7c0c485893e038828c1405/pkg/cmd/openshift-apiserver/openshiftapiserver/config.go#L217-L218 [3] Where Route.spec.host is generated: https://github.com/openshift/openshift-apiserver/blob/9c381fc873b35c074f7c0c485893e038828c1405/vendor/github.com/openshift/library-go/pkg/route/hostassignment/plugin.go#L22-L47
This is a clone of issue OCPBUGS-11208. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-11054. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-11038. The following is the description of the original issue:
—
Description of problem:
Backport support starting in 4.12.z to a new GCP region europe-west12
Version-Release number of selected component (if applicable):
4.12.z and 4.13.z
How reproducible:
Always
Steps to Reproduce:
1. Use openhift-install to deploy OCP in europe-west12
Actual results:
europe-west12 is not available as a supported region in the user survey
Expected results:
europe-west12 to be available as a supported region in the user survey
Additional info:
Description of problem:
If you set a services cluster IP to an IP with a leading zero (e.g. 192.168.0.011), ovn-k should normalise this and remove the leading zero before sending it to ovn.
This was seen by me on a CI run executing the k8 test here: test/e2e/network/funny_ips.go +75
you can reproduce using that above test.
Have a read of the text there:
43 // What are funny IPs: 44 // The adjective is because of the curl blog that explains the history and the problem of liberal 45 // parsing of IP addresses and the consequences and security risks caused the lack of normalization, 46 // mainly due to the use of different notations to abuse parsers misalignment to bypass filters. 47 // xref: https://daniel.haxx.se/blog/2021/04/19/curl-those-funny-ipv4-addresses/ 48 // 49 // Since golang 1.17, IPv4 addresses with leading zeros are rejected by the standard library. 50 // xref: https://github.com/golang/go/issues/30999 51 // 52 // Because this change on the parsers can cause that previous valid data become invalid, Kubernetes 53 // forked the old parsers allowing leading zeros on IPv4 address to not break the compatibility. 54 // 55 // Kubernetes interprets leading zeros on IPv4 addresses as decimal, users must not rely on parser 56 // alignment to not being impacted by the associated security advisory: CVE-2021-29923 golang 57 // standard library "net" - Improper Input Validation of octal literals in golang 1.16.2 and below 58 // standard library "net" results in indeterminate SSRF & RFI vulnerabilities. xref: 59 // https://nvd.nist.gov/vuln/detail/CVE-2021-29923
northd is logging an error about this also:
|socket_util|ERR|172.30.0.011:7180: bad IP address "172.30.0.011" ... 2022-08-23T14:14:21.968Z|01839|ovn_util|WARN|bad ip address or port for load balancer key 172.30.0.011:7180
Also, I see the error:
E0823 14:14:34.135115 3284 gateway_shared_intf.go:600] Failed to delete conntrack entry for service e2e-funny-ips-8626/funny-ip: failed to delete conntrack entry for service e2e-funny-ips-8626/funny-ip with svcVIP 172.30.0.011, svcPort 7180, protocol TCP: value "<nil>" passed to DeleteConntrack is not an IP address
We should normalise the IPs before sending to OVN-k. I see also theres conntrack error when trying to set this bad IP.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. See above k8 test
Actual results:
Leading zero IP sent to OVN
Expected results:
No leading zero IP sent to OVN
Additional info:
The relevant code in ironic-image was not updated to support TLS, so it still uses the old port and explicit http://
This is a clone of issue OCPBUGS-676. The following is the description of the original issue:
—
the machine approver isn't recognizing hostnames that use capital letters as valid even though DNS is case-insensitive
an example of this is in OHSS-14709:
I0822 19:04:51.587266 1 controller.go:114] Reconciling CSR: csr-vdtpv I0822 19:04:51.600941 1 csr_check.go:156] csr-vdtpv: CSR does not appear to be client csr I0822 19:04:51.603648 1 csr_check.go:542] retrieving serving cert from ip-100-66-119-117.ec2.internal (100.66.119.117:10250) I0822 19:04:51.604003 1 csr_check.go:181] Failed to retrieve current serving cert: dial tcp 100.66.119.117:10250: connect: connection refused I0822 19:04:51.604017 1 csr_check.go:201] Falling back to machine-api authorization for ip-100-66-119-117.ec2.internal E0822 19:04:51.604024 1 csr_check.go:392] csr-vdtpv: DNS name 'ip-100-66-119-117.tech-ace-maint-prd.aws.delta.com' not in machine names: ip-100-66-119-117.ec2.internal ip-100-66-119-117.ec2.internal ip-100-66-119-117.tech-ACE-maint-prd.aws.delta.com I0822 19:04:51.604033 1 csr_check.go:204] Could not use Machine for serving cert authorization: DNS name 'ip-100-66-119-117.tech-ace-maint-prd.aws.delta.com' not in machine names: ip-100-66-119-117.ec2.internal ip-100-66-119-117.ec2.internal ip-100-66-119-117.tech-ACE-maint-prd.aws.delta.com I0822 19:04:51.606777 1 controller.go:199] csr-vdtpv: CSR not authorized
This can be worked around by manually approving the CSR
The relevant line in the machine approver appears to be here: https://github.com/openshift/cluster-machine-approver/blob/master/pkg/controller/csr_check.go#L378
This is a clone of issue OCPBUGS-4489. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4168. The following is the description of the original issue:
—
Description of problem:
Prometheus continuously restarts due to slow WAL replay
Version-Release number of selected component (if applicable):
openshift - 4.11.13
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
[OVN][OSP] After reboot egress node, egress IP cannot be applied anymore.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-11-07-181244
How reproducible:
Frequently happened in automation. But didn't reproduce it in manual.
Steps to Reproduce:
1. Label one node as egress node 2. Config one egressIP object STEP: Check one EgressIP assigned in the object. Nov 8 15:28:23.591: INFO: egressIPStatus: [{"egressIP":"192.168.54.72","node":"huirwang-1108c-pg2mt-worker-0-2fn6q"}] 3. Reboot the node, wait for the node ready.
Actual results:
EgressIP cannot be applied anymore. Waited more than 1 hour. oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-47031 192.168.54.72
Expected results:
The egressIP should be applied correctly.
Additional info:
Some logs E1108 07:29:41.849149 1 egressip.go:1635] No assignable nodes found for EgressIP: egressip-47031 and requested IPs: [192.168.54.72] I1108 07:29:41.849288 1 event.go:285] Event(v1.ObjectReference{Kind:"EgressIP", Namespace:"", Name:"egressip-47031", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'NoMatchingNodeFound' no assignable nodes for EgressIP: egressip-47031, please tag at least one node with label: k8s.ovn.org/egress-assignable W1108 07:33:37.401149 1 egressip_healthcheck.go:162] Could not connect to huirwang-1108c-pg2mt-worker-0-2fn6q (10.131.0.2:9107): context deadline exceeded I1108 07:33:37.401348 1 master.go:1364] Adding or Updating Node "huirwang-1108c-pg2mt-worker-0-2fn6q" I1108 07:33:37.437465 1 egressip_healthcheck.go:168] Connected to huirwang-1108c-pg2mt-worker-0-2fn6q (10.131.0.2:9107)
After this log, seems like no logs related to "192.168.54.72" happened.
This fix contains the following changes coming from updated version of kubernetes up to v1.24.10:
Changelog:
v1.24.11: https://github.com/kubernetes/kubernetes/blob/release-1.24/CHANGELOG/CHANGELOG-1.24.md#changelog-since-v12410
v1.24.10: https://github.com/kubernetes/kubernetes/blob/release-1.24/CHANGELOG/CHANGELOG-1.24.md#changelog-since-v1249
v1.24.9: https://github.com/kubernetes/kubernetes/blob/release-1.24/CHANGELOG/CHANGELOG-1.24.md#changelog-since-v1248
v1.24.8: https://github.com/kubernetes/kubernetes/blob/release-1.24/CHANGELOG/CHANGELOG-1.24.md#changelog-since-v1247
v1.24.7: https://github.com/kubernetes/kubernetes/blob/release-1.24/CHANGELOG/CHANGELOG-1.24.md#changelog-since-v1246
Description of problem:
Dummy bug that is needed to track backport of https://github.com/ovn-org/ovn-kubernetes/pull/2975/commits/816e30a1fbb5beb8b20fe3e96906285762dd8eb6 which is already merged in 4.12
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-262. The following is the description of the original issue:
—
github rate limit failures for upi image downloading govc.
Description of problem:
There is a bug affecting verify steps functionality on iDRAC hardware in OpenShift 4.11 and 4.10. Original bug report has been made against 4.10: https://issues.redhat.com/browse/OCPBUGS-1740 While I am not aware of this issue being reported against 4.11, due to the fact that the fix is only present in 4.12 codebase, 4.11 versions will also be affected by this issue. This bug is created to meet automation requirements for backporting the fixes from 4.12 version to 4.11 (and then to 4.11 in the bug quoted above).
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Upgrade OCP 4.11 --> 4.12 fails with one 'NotReady,SchedulingDisabled' node and MachineConfigDaemonFailed.
Version-Release number of selected component (if applicable):
Upgrade from OCP 4.11.0-0.nightly-2022-09-19-214532 on top of OSP RHOS-16.2-RHEL-8-20220804.n.1 to 4.12.0-0.nightly-2022-09-20-040107. Network Type: OVNKubernetes
How reproducible:
Twice out of two attempts.
Steps to Reproduce:
1. Install OCP 4.11.0-0.nightly-2022-09-19-214532 (IPI) on top of OSP RHOS-16.2-RHEL-8-20220804.n.1. The cluster is up and running with three workers: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-09-19-214532 True False 51m Cluster version is 4.11.0-0.nightly-2022-09-19-214532 2. Run the OC command to upgrade to 4.12.0-0.nightly-2022-09-20-040107: $ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-09-20-040107 --allow-explicit-upgrade --force=true warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Requesting update to release image registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-09-20-040107 3. The upgrade is not succeeds: [0] $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-09-19-214532 True True 17h Unable to apply 4.12.0-0.nightly-2022-09-20-040107: wait has exceeded 40 minutes for these operators: network One node degrided to 'NotReady,SchedulingDisabled' status: $ oc get nodes NAME STATUS ROLES AGE VERSION ostest-9vllk-master-0 Ready master 19h v1.24.0+07c9eb7 ostest-9vllk-master-1 Ready master 19h v1.24.0+07c9eb7 ostest-9vllk-master-2 Ready master 19h v1.24.0+07c9eb7 ostest-9vllk-worker-0-4x4pt NotReady,SchedulingDisabled worker 18h v1.24.0+3882f8f ostest-9vllk-worker-0-h6kcs Ready worker 18h v1.24.0+3882f8f ostest-9vllk-worker-0-xhz9b Ready worker 18h v1.24.0+3882f8f $ oc get pods -A | grep -v -e Completed -e Running NAMESPACE NAME READY STATUS RESTARTS AGE openshift-openstack-infra coredns-ostest-9vllk-worker-0-4x4pt 0/2 Init:0/1 0 18h $ oc get events LAST SEEN TYPE REASON OBJECT MESSAGE 7m15s Warning OperatorDegraded: MachineConfigDaemonFailed /machine-config Unable to apply 4.12.0-0.nightly-2022-09-20-040107: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)] 7m15s Warning MachineConfigDaemonFailed /machine-config Cluster not available for [{operator 4.11.0-0.nightly-2022-09-19-214532}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)] $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.nightly-2022-09-20-040107 True False False 18h baremetal 4.12.0-0.nightly-2022-09-20-040107 True False False 19h cloud-controller-manager 4.12.0-0.nightly-2022-09-20-040107 True False False 19h cloud-credential 4.12.0-0.nightly-2022-09-20-040107 True False False 19h cluster-autoscaler 4.12.0-0.nightly-2022-09-20-040107 True False False 19h config-operator 4.12.0-0.nightly-2022-09-20-040107 True False False 19h console 4.12.0-0.nightly-2022-09-20-040107 True False False 18h control-plane-machine-set 4.12.0-0.nightly-2022-09-20-040107 True False False 17h csi-snapshot-controller 4.12.0-0.nightly-2022-09-20-040107 True False False 19h dns 4.12.0-0.nightly-2022-09-20-040107 True True False 19h DNS "default" reports Progressing=True: "Have 5 available node-resolver pods, want 6." etcd 4.12.0-0.nightly-2022-09-20-040107 True False False 19h image-registry 4.12.0-0.nightly-2022-09-20-040107 True True False 18h Progressing: The registry is ready... ingress 4.12.0-0.nightly-2022-09-20-040107 True False False 18h insights 4.12.0-0.nightly-2022-09-20-040107 True False False 19h kube-apiserver 4.12.0-0.nightly-2022-09-20-040107 True True False 18h NodeInstallerProgressing: 1 nodes are at revision 11; 2 nodes are at revision 13 kube-controller-manager 4.12.0-0.nightly-2022-09-20-040107 True False False 19h kube-scheduler 4.12.0-0.nightly-2022-09-20-040107 True False False 19h kube-storage-version-migrator 4.12.0-0.nightly-2022-09-20-040107 True False False 19h machine-api 4.12.0-0.nightly-2022-09-20-040107 True False False 19h machine-approver 4.12.0-0.nightly-2022-09-20-040107 True False False 19h machine-config 4.11.0-0.nightly-2022-09-19-214532 False True True 16h Cluster not available for [{operator 4.11.0-0.nightly-2022-09-19-214532}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)] marketplace 4.12.0-0.nightly-2022-09-20-040107 True False False 19h monitoring 4.12.0-0.nightly-2022-09-20-040107 True False False 18h network 4.12.0-0.nightly-2022-09-20-040107 True True True 19h DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2022-09-20T14:16:13Z... node-tuning 4.12.0-0.nightly-2022-09-20-040107 True False False 17h openshift-apiserver 4.12.0-0.nightly-2022-09-20-040107 True False False 18h openshift-controller-manager 4.12.0-0.nightly-2022-09-20-040107 True False False 17h openshift-samples 4.12.0-0.nightly-2022-09-20-040107 True False False 17h operator-lifecycle-manager 4.12.0-0.nightly-2022-09-20-040107 True False False 19h operator-lifecycle-manager-catalog 4.12.0-0.nightly-2022-09-20-040107 True False False 19h operator-lifecycle-manager-packageserver 4.12.0-0.nightly-2022-09-20-040107 True False False 19h service-ca 4.12.0-0.nightly-2022-09-20-040107 True False False 19h storage 4.12.0-0.nightly-2022-09-20-040107 True True False 19h ManilaCSIDriverOperatorCRProgressing: ManilaDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods... [0] http://pastebin.test.redhat.com/1074531
Actual results:
OCP 4.11 --> 4.12 upgrade fails.
Expected results:
OCP 4.11 --> 4.12 upgrade success.
Additional info:
Attached logs of the NotReady node - [^journalctl_ostest-9vllk-worker-0-4x4pt.log.tar.gz]
This is a clone of issue OCPBUGS-6764. The following is the description of the original issue:
—
Description of problem:
The "Add Git Repository" has a "Show configuration options" expandable section that shows the required permissions for a webhook setup, and provides a link to "read more about setting up webhook".
But the permission section shows nothing when open this second expandable section, and the link doesn't do anything until the user enters a "supported" GitHub, GitLab or BitBucket URL.
Version-Release number of selected component (if applicable):
4.11-4.13
How reproducible:
Always
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-6887. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-3476. The following is the description of the original issue:
—
Description of problem:
When we detect a refs/heads/branchname we should show the label as what we have now: - Branch: branchname And when we detect a refs/tags/tagname we should instead show the label as: - Tag: tagname
I haven't implemented this in cli but there is an old issue for that here openshift-pipelines/pipelines-as-code#181
Version-Release number of selected component (if applicable):
4.11.z
How reproducible:
Steps to Reproduce:
1. Create a repository 2. Trigger the pipelineruns by push or pull request event on the github
Actual results:
We do not show tag name even is tag is present instead of branch
Expected results:
We should show tag if tag is detected and branch if branch is detedcted.
Additional info:
https://github.com/openshift/console/pull/12247#issuecomment-1306879310
This is a clone of issue OCPBUGS-2438. The following is the description of the original issue:
—
Description of problem:
On the alert details page and alerting rule details page, clicking on a field that has a popover help throws an uncaught JavaScript error.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Go to Observe > Alerting pages 2. Click on an alert (or go to the rules tab then click on a rule) 3. Click on one of the underlined fields (those that have a popover help)
Actual results:
Expected results:
Additional info:
When a thin provisioned COW format disk is created on OCP on RHV via CSI driver (a PVC -
https://github.com/openshift/ovirt-csi-driver/blob/master/deploy/example/storage-claim.yaml
But this is thin provisioned disk, so the initial size of the disk should be default of the engine and then grow as needed, it shouldn't be this big.
This causes all the disks created this way to be functionally preallocated (since it eats all that space), which is a real waste of space.
How reproducible: 100%
Steps to Reproduce:
1. Create a storage claim (PVC) in Openshift (
https://github.com/openshift/ovirt-csi-driver/blob/master/deploy/example/storage-claim.yaml
) using the default storage class (or any other storage class with thinProvisioning: "true") and with requested storage i.e. 100Gi
$ oc create -f storage-claim.yaml
2. In the RHV web console navigate to Storage -> Disks and check Virtual size and Actual size of the created disk (PVC)
Actual results:
Disk from our example with requested storage 100GB reports virtual size 100GB and actual size 110 GB.
Expected results:
Thin provisioned disks should start with small initial size and then grow as needed, so its actual size should be considerably smaller (the default initial size set by the engine should be 2.5 GB if I'm not mistaken).
Note: The extra 10GB in the actual size are caused by overhead for the qcow2 disk format, which is 10%, and this was tracked here as a separate issue:
https://bugzilla.redhat.com/show_bug.cgi?id=2097139
This is a clone of issue OCPBUGS-10603. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-10558. The following is the description of the original issue:
—
Description of problem:
When running a cluster on application credentials, this event appears repeatedly: ns/openshift-machine-api machineset/nhydri0d-f8dcc-kzcwf-worker-0 hmsg/173228e527 - pathological/true reason/ReconcileError could not find information for "ci.m1.xlarge"
Version-Release number of selected component (if applicable):
How reproducible:
Happens in the CI (https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/33330/rehearse-33330-periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.13-e2e-openstack-ovn-serial/1633149670878351360).
Steps to Reproduce:
1. On a living cluster, rotate the OpenStack cloud credentials 2. Invalidate the previous credentials 3. Watch the machine-api events (`oc -n openshift-machine-api get event`). A `Warning` type of issue could not find information for "name-of-the-flavour" will appear. If the cluster was installed using a password that you can't invalidate: 1. Rotate the cloud credentials to application credentials 2. Restart MAPO (`oc -n openshift-machine-api get pods -o NAME | xargs -r oc -n openshift-machine-api delete`) 3. Rotate cloud credentials again 4. Revoke the first application credentials you set 5. Finally watch the events (`oc -n openshift-machine-api get event`) The event signals that MAPO wasn't able to update flavour information on the MachineSet status.
Actual results:
Expected results:
No issue detecting the flavour details
Additional info:
Offending code likely around this line: https://github.com/openshift/machine-api-provider-openstack/blob/bcb08a7835c08d20606d75757228fd03fbb20dab/pkg/machineset/controller.go#L116
This is a clone of issue OCPBUGS-1704. The following is the description of the original issue:
—
Description of problem:
According to OCP 4.11 doc (https://docs.openshift.com/container-platform/4.11/installing/installing_gcp/installing-gcp-account.html#installation-gcp-enabling-api-services_installing-gcp-account), the Service Usage API (serviceusage.googleapis.com) is an optional API service to be enabled. But, the installation cannot succeed if this API is disabled.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-25-071630
How reproducible:
Always, if the Service Usage API is disabled in the GCP project.
Steps to Reproduce:
1. Make sure the Service Usage API (serviceusage.googleapis.com) is disabled in the GCP project. 2. Try IPI installation in the GCP project.
Actual results:
The installation would fail finally, without any worker machines launched.
Expected results:
Installation should succeed, or the OCP doc should be updated.
Additional info:
Please see the attached must-gather logs (http://virt-openshift-05.lab.eng.nay.redhat.com/jiwei/jiwei-0926-03-cnxn5/) and the sanity check results. FYI if enabling the API, and without changing anything else, the installation could succeed.
Description of problem:
The reconciler removes the overlappingrangeipreservations.whereabouts.cni.cncf.io resources whether the pod is alive or not.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create pods and check the overlappingrangeipreservations.whereabouts.cni.cncf.io resources:
$ oc get overlappingrangeipreservations.whereabouts.cni.cncf.io -A NAMESPACE NAME AGE openshift-multus 2001-1b70-820d-4b04--13 4m53s openshift-multus 2001-1b70-820d-4b05--13 4m49s
2. Verify that when the ip-reconciler cronjob removes the overlappingrangeipreservations.whereabouts.cni.cncf.io resources when run:
$ oc get cronjob -n openshift-multus NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE ip-reconciler */15 * * * * False 0 14m 4d13h $ oc get overlappingrangeipreservations.whereabouts.cni.cncf.io -A No resources found $ oc get cronjob -n openshift-multus NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE ip-reconciler */15 * * * * False 0 5s 4d13h
Actual results:
The overlappingrangeipreservations.whereabouts.cni.cncf.io resources are removed for each created pod by the ip-reconciler cronjob. The "overlapping ranges" are not used.
Expected results:
The overlappingrangeipreservations.whereabouts.cni.cncf.io should not be removed regardless of if a pod has used an IP in the overlapping ranges.
Additional info:
This is a clone of issue OCPBUGS-17808. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-16804. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-15327. The following is the description of the original issue:
—
Description of problem:
On OpenShift Container Platform, the etcd Pod is showing messages like the following: 2023-06-19T09:10:30.817918145Z {"level":"warn","ts":"2023-06-19T09:10:30.817Z","caller":"fileutil/purge.go:72","msg":"failed to lock file","path":"/var/lib/etcd/member/wal/000000000000bc4b-00000000183620a4.wal","error":"fileutil: file already locked"} This is described in KCS https://access.redhat.com/solutions/7000327
Version-Release number of selected component (if applicable):
any currently supported version (> 4.10) running with 3.5.x
How reproducible:
always
Steps to Reproduce:
happens after running etcd for a while
This has been discussed in https://github.com/etcd-io/etcd/issues/15360
It's not a harmful error message, it merely indicates that some WALs have not been included in snapshots yet.
This was caused by changing default numbers: https://github.com/etcd-io/etcd/issues/13889
This was fixed in https://github.com/etcd-io/etcd/pull/15408/files but never backported to 3.5.
To mitigate that error and stop confusing people, we should also supply that argument when starting etcd in: https://github.com/openshift/cluster-etcd-operator/blob/master/bindata/etcd/pod.yaml#L170-L187
That way we're not surprised by changes of the default values upstream.
Tracker issue for bootimage bump in 4.11. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-3362.
This is a clone of issue OCPBUGS-2079. The following is the description of the original issue:
—
Description of problem:
The setting of systemReserved: ephemeral-storage in KubeletConfig is not working as expected.
Version-Release number of selected component (if applicable):
4.10.z, may exist on other OCP versions as well.
How reproducible:
always
Steps to Reproduce:
1. Create a KubeletConfig on the node: apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: system-reserved-config spec: machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/master: "" kubeletConfig: systemReserved: cpu: 500m memory: 500Mi ephemeral-storage: 10Gi 2. Check node allocatable storage with command: oc describe node |grep -C 5 ephemeral-storage
Actual results:
The Allocatable:ephemeral-storage on the node is not capacity.ephemeral-storage - systemReserved.ephemeral-storage - eviction-thresholds (10% of the capacity.ephemeral-storage by default)
Expected results:
The Allocatable:ephemeral-storage on the node should be capacity.ephemeral-storage - systemReserved.ephemeral-storage - eviction-thresholds (10% of the capacity.ephemeral-storage by default)
Additional info:
The root cause might be: process argument '--system-reserved=cpu=500m,memory=500Mi' overwrote the setting in /etc/kubernetes/kubelet.conf, one example: root 6824 1 27 Sep30 ? 1-09:00:24 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/master,node.openshift.io/os_id=rhcos --node-ip=192.168.58.47 --minimum-container-ttl-duration=6m0s --cloud-provider= --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --hostname-override= --register-with-taints=node-role.kubernetes.io/master=:NoSchedule --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a7b6408460148cb73c59677dbc2c261076bc07226c43b0c9192cc70aef5ba62 --system-reserved=cpu=500m,memory=500Mi --v=2 --housekeeping-interval=30s
This is a clone of issue OCPBUGS-855. The following is the description of the original issue:
—
Description of problem:
When setting the allowedregistries like the example below, the openshift-samples operator is degraded: oc get image.config.openshift.io/cluster -o yaml apiVersion: config.openshift.io/v1 kind: Image metadata: annotations: release.openshift.io/create-only: "true" creationTimestamp: "2020-12-16T15:48:20Z" generation: 2 name: cluster resourceVersion: "422284920" uid: d406d5a0-c452-4a84-b6b3-763abb51d7a5 spec: additionalTrustedCA: name: registry-ca allowedRegistriesForImport: - domainName: quay.io insecure: false - domainName: registry.redhat.io insecure: false - domainName: registry.access.redhat.com insecure: false - domainName: registry.redhat.io/redhat/redhat-operator-index insecure: true - domainName: registry.redhat.io/redhat/redhat-marketplace-index insecure: true - domainName: registry.redhat.io/redhat/certified-operator-index insecure: true - domainName: registry.redhat.io/redhat/community-operator-index insecure: true registrySources: allowedRegistries: - quay.io - registry.redhat.io - registry.rijksapps.nl - registry.access.redhat.com - registry.redhat.io/redhat/redhat-operator-index - registry.redhat.io/redhat/redhat-marketplace-index - registry.redhat.io/redhat/certified-operator-index - registry.redhat.io/redhat/community-operator-index oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.10.21 True False False 5d13h baremetal 4.10.21 True False False 450d cloud-controller-manager 4.10.21 True False False 94d cloud-credential 4.10.21 True False False 624d cluster-autoscaler 4.10.21 True False False 624d config-operator 4.10.21 True False False 624d console 4.10.21 True False False 42d csi-snapshot-controller 4.10.21 True False False 31d dns 4.10.21 True False False 217d etcd 4.10.21 True False False 624d image-registry 4.10.21 True False False 94d ingress 4.10.21 True False False 94d insights 4.10.21 True False False 104s kube-apiserver 4.10.21 True False False 624d kube-controller-manager 4.10.21 True False False 624d kube-scheduler 4.10.21 True False False 624d kube-storage-version-migrator 4.10.21 True False False 31d machine-api 4.10.21 True False False 624d machine-approver 4.10.21 True False False 624d machine-config 4.10.21 True False False 17d marketplace 4.10.21 True False False 258d monitoring 4.10.21 True False False 161d network 4.10.21 True False False 624d node-tuning 4.10.21 True False False 31d openshift-apiserver 4.10.21 True False False 42d openshift-controller-manager 4.10.21 True False False 22d openshift-samples 4.10.21 True True True 31d Samples installation in error at 4.10.21: &errors.errorString{s:"global openshift image configuration prevents the creation of imagestreams using the registry "} operator-lifecycle-manager 4.10.21 True False False 624d operator-lifecycle-manager-catalog 4.10.21 True False False 624d operator-lifecycle-manager-packageserver 4.10.21 True False False 31d service-ca 4.10.21 True False False 624d storage 4.10.21 True False False 113d After applying the fix as described here( https://access.redhat.com/solutions/6547281 ) it is resolved: oc patch configs.samples.operator.openshift.io cluster --type merge --patch '{"spec": {"samplesRegistry": "registry.redhat.io"}}' But according the the BZ this should be fixed in 4.10.3 https://bugzilla.redhat.com/show_bug.cgi?id=2027745 but the issue is still occur in our 4.10.21 cluster: oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.21 True False 31d Error while reconciling 4.10.21: the cluster operator openshift-samples is degraded
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The test local-test is failing on openshift/thanos when upgrading golang version to 1.18 on the branch release-4.11. Please refer to this test log for details: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_thanos/82/pull-ci-openshift-thanos-release-4.11-test-local/1541516614497734656
Version-Release number of selected component (if applicable):
4.11
How reproducible:
See local-test job on pull request on the repository Openshift/Thanos
Steps to Reproduce:
Actual results:
local-test fails on the following error: level=error ts=2022-06-27T20:28:12.306Z caller=web.go:99 component=web msg="panic while serving request" client=127.0.0.1:37064 url=/api/v1/metadata err="runtime error: invalid memory address or nil pointer dereference" stack="goroutine 278 [running]:\ngithub.com/prometheus/prometheus/web.withStackTracer.func1.1()\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200724121523-657ba532e42f/web/web.go:98 +0x99\npanic({0x1c34760, 0x308ad40})\n\t/usr/lib/golang/src/runtime/panic.go:838 +0x207\nreflect.mapiternext(0xc000458540?)\n\t/usr/lib/golang/src/runtime/map.go:1378 +0x19\ngithub.com/modern-go/reflect2.(*UnsafeMapIterator).UnsafeNext(0x1bd62e0?)\n\t/go/pkg/mod/github.com/modern-go/reflect2@v1.0.1/unsafe_map.go:136 +0x32\ngithub.com/json-iterator/go.(*sortKeysMapEncoder).Encode(0xc000949d10, 0xc0002966b0, 0xc0006c7740)\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect_map.go:297 +0x31a\ngithub.com/json-iterator/go.(*onePtrEncoder).Encode(0xc0008cb120, 0xc000948fc0, 0xc0001139c0?)\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect.go:219 +0x82\ngithub.com/json-iterator/go.(*Stream).WriteVal(0xc0006c7740, {0x1c16da0, 0xc000948fc0})\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect.go:98 +0x158\ngithub.com/json-iterator/go.(*dynamicEncoder).Encode(0xc00094cd58?, 0xfa9a07?, 0xc0006c7758?)\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect_dynamic.go:15 +0x39\ngithub.com/json-iterator/go.(*structFieldEncoder).Encode(0xc000949620, 0x1a4aaba?, 0xc0006c7740)\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect_struct_encoder.go:110 +0x56\ngithub.com/json-iterator/go.(*structEncoder).Encode(0xc000949740, 0x0?, 0xc0006c7740)\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect_struct_encoder.go:158 +0x652\ngithub.com/json-iterator/go.(*OptionalEncoder).Encode(0xc0001afd60?, 0x0?, 0x0?)\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect_optional.go:74 +0xa4\ngithub.com/json-iterator/go.(*onePtrEncoder).Encode(0xc0008cad40, 0xc0006c76e0, 0xc000949020?)\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect.go:219 +0x82\ngithub.com/json-iterator/go.(*Stream).WriteVal(0xc0006c7740, {0x1ac56e0, 0xc0006c76e0})\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect.go:98 +0x158\ngithub.com/json-iterator/go.(*frozenConfig).Marshal(0xc0001afd60, {0x1ac56e0, 0xc0006c76e0})\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/config.go:299 +0xc9\ngithub.com/prometheus/prometheus/web/api/v1.(*API).respond(0xc0002d7a40, {0x229a448, 0xc00022bd60}, {0x1c16da0?, 0xc000948fc0}, {0x0?, 0x7fe5a05a5b20?, 0x20?})\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200724121523-657ba532e42f/web/api/v1/api.go:1437 +0x162\ngithub.com/prometheus/prometheus/web/api/v1.(*API).Register.func1.1({0x229a448, 0xc00022bd60}, 0x7fe5982c5300?)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200724121523-657ba532e42f/web/api/v1/api.go:273 +0x20b\nnet/http.HandlerFunc.ServeHTTP(0x7fe5982c5300?, {0x229a448?, 0xc00022bd60?}, 0xc00072b270?)\n\t/usr/lib/golang/src/net/http/server.go:2084 +0x2f\ngithub.com/prometheus/prometheus/util/httputil.CompressionHandler.ServeHTTP({{0x2290780?, 0xc000856288?}}, {0x7fe5982c5300?, 0xc00072b270?}, 0x228fb20?)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200724121523-657ba532e42f/util/httputil/compression.go:90 +0x69\ngithub.com/prometheus/prometheus/web.(*Handler).testReady.func1({0x7fe5982c5300?, 0xc00072b270?}, 0x7fe5982c5300?)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200724121523-657ba532e42f/web/web.go:499 +0x39\nnet/http.HandlerFunc.ServeHTTP(0x7fe5982c5300?, {0x7fe5982c5300?, 0xc00072b270?}, 0x50?)\n\t/usr/lib/golang/src/net/http/server.go:2084 +0x2f\ngithub.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerResponseSize.func1({0x7fe5982c5300?, 0xc00072b220?}, 0xc000250c00)\n\t/go/pkg/mod/github.com/prometheus/client_golang@v1.6.0/prometheus/promhttp/instrument_server.go:196 +0xa5\nnet/http.HandlerFunc.ServeHTTP(0x228fb80?, {0x7fe5982c5300?, 0xc00072b220?}, 0xc000948ed0?)\n\t/usr/lib/golang/src/net/http/server.go:2084 +0x2f\ngithub.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func2({0x7fe5982c5300, 0xc00072b220}, 0xc000250c00)\n\t/go/pkg/mod/github.com/prometheus/client_golang@v1.6.0/prometheus/promhttp/instrument_server.go:76 +0xa2\nnet/http.HandlerFunc.ServeHTTP(0x22a4a68?, {0x7fe5982c5300?, 0xc00072b220?}, 0x0?)\n\t/usr/lib/golang/src/net/http/server.go:2084 +0x2f\ngithub.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1({0x22a4a68?, 0xc00072b1d0?}, 0xc000250c00)\n\t/go/pkg/mod/github.com/prometheus/client_golang@v1.6.0/prometheus/promhttp/instrument_server.go:100 +0x94\ngithub.com/prometheus/prometheus/web.setPathWithPrefix.func1.1({0x22a4a68, 0xc00072b1d0}, 0xc000250b00)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200724121523-657ba532e42f/web/web.go:1142 +0x290\ngithub.com/prometheus/common/route.(*Router).handle.func1({0x22a4a68, 0xc00072b1d0}, 0xc000250a00, {0x0, 0x0, 0xc00022c364?})\n\t/go/pkg/mod/github.com/prometheus/common@v0.10.0/route/route.go:83 +0x2ae\ngithub.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc0001cc780, {0x22a4a68, 0xc00072b1d0}, 0xc000250a00)\n\t/go/pkg/mod/github.com/julienschmidt/httprouter@v1.3.0/router.go:387 +0x82b\ngithub.com/prometheus/common/route.(*Router).ServeHTTP(0x8?, {0x22a4a68?, 0xc00072b1d0?}, 0x203000?)\n\t/go/pkg/mod/github.com/prometheus/common@v0.10.0/route/route.go:121 +0x26\nnet/http.StripPrefix.func1({0x22a4a68, 0xc00072b1d0}, 0xc000250900)\n\t/usr/lib/golang/src/net/http/server.go:2127 +0x330\nnet/http.HandlerFunc.ServeHTTP(0x10?, {0x22a4a68?, 0xc00072b1d0?}, 0x7fe5c8423f18?)\n\t/usr/lib/golang/src/net/http/server.go:2084 +0x2f\nnet/http.(*ServeMux).ServeHTTP(0x413d87?, {0x22a4a68, 0xc00072b1d0}, 0xc000250900)\n\t/usr/lib/golang/src/net/http/server.go:2462 +0x149\ngithub.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5({0x22a3808?, 0xc000a282a0}, 0xc000250200)\n\t/go/pkg/mod/github.com/opentracing-contrib/go-stdlib@v0.0.0-20190519235532-cf7a6c988dc9/nethttp/server.go:140 +0x662\nnet/http.HandlerFunc.ServeHTTP(0x0?, {0x22a3808?, 0xc000a282a0?}, 0xffffffffffffffff?)\n\t/usr/lib/golang/src/net/http/server.go:2084 +0x2f\ngithub.com/prometheus/prometheus/web.withStackTracer.func1({0x22a3808?, 0xc000a282a0?}, 0xc0008ca850?)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200724121523-657ba532e42f/web/web.go:103 +0x97\nnet/http.HandlerFunc.ServeHTTP(0x0?, {0x22a3808?, 0xc000a282a0?}, 0xc000100000?)\n\t/usr/lib/golang/src/net/http/server.go:2084 +0x2f\nnet/http.serverHandler.ServeHTTP({0xc000c55380?}, {0x22a3808, 0xc000a282a0}, 0xc000250200)\n\t/usr/lib/golang/src/net/http/server.go:2916 +0x43b\nnet/http.(*conn).serve(0xc0000d1540, {0x22a4e18, 0xc00061a0c0})\n\t/usr/lib/golang/src/net/http/server.go:1966 +0x5d7\ncreated by net/http.(*Server).Serve\n\t/usr/lib/golang/src/net/http/server.go:3071 +0x4db\n" level=error ts=2022-06-27T20:28:12.306Z caller=stdlib.go:89 component=web caller="http: panic serving 127.0.0.1:37064" msg="runtime error: invalid memory address or nil pointer dereference"
Expected results:
local-test does no fail on the error above.
Additional info:
This is a clone of issue OCPBUGS-1417. The following is the description of the original issue:
—
Description of problem:
Egress IP is not being assigned to primary interface of node as per hostsubnet definition. The issue being observed at an Openshift cluster hosted on Disconnected AWS environment. Following steps were performed at AWS end: - Disconnected VPC was created and installation of Openshift was done as per documentation. - Elastic IP could not be used as it is a disconnected environment. Customer identified a free IP from same subnet as the node and modified interface of the node to add a secondary IP. It seems cloud.network.openshift.io/egress-ipconfig annotation is need on the node to attach IP to primary interface but its missing. From SDN POD log on the same node I could see its complaining about 'an incomplete annotation "cloud.network.openshift.io/egress-ipconfig"'. Will share more details over comments.
Version-Release number of selected component (if applicable):
Openshift 4.10.28
How reproducible:
Always
Steps to Reproduce:
1. Create a disconnected environment on AWS 2. find a free IP from subnet where a worker node is hosted and add that as secondary IP to NIC of that node. 3. Configure hostsubnet and netnamespace on Openshift cluster
Actual results:
- Eress IP is not being attached to primary interface of node for which hostsubnet has been configured
Expected results:
- Egress IP should get configured without any issue.
Additional info:
This is a clone of issue OCPBUGS-4072. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4026. The following is the description of the original issue:
—
Description of problem:
There is an endless re-render loop and a browser feels slow to stuck when opening the add page or the topology.
Saw also endless API calls to /api/kubernetes/apis/binding.operators.coreos.com/v1alpha1/bindablekinds/bindable-kinds
Version-Release number of selected component (if applicable):
1. Console UI 4.12-4.13 (master)
2. Service Binding Operator (tested with 1.3.1)
How reproducible:
Always with installed SBO
But the "stuck feeling" depends on the browser (Firefox feels more stuck) and your locale machine power
Steps to Reproduce:
1. Install Service Binding Operator
2. Create or update the BindableKinds resource "bindable-kinds"
apiVersion: binding.operators.coreos.com/v1alpha1 kind: BindableKinds metadata: name: bindable-kinds
3. Open the browser console log
4. Open the console UI and navigate to the add page
Actual results:
1. Saw endless API calls to /api/kubernetes/apis/binding.operators.coreos.com/v1alpha1/bindablekinds/bindable-kinds
2. Browser feels slow and get stuck after some time
3. The page crashs after some time
Expected results:
1. The API call should be called just once
2. The add page should just work without feeling laggy
3. No crash
Additional info:
Get introduced after we watching the bindable-kinds resource with https://github.com/openshift/console/pull/11161
It looks like this happen only if the SBO is installed and the bindable-kinds resource exist, but doesn't contain any status.
The status list all available bindable resource types. I could not reproduce this by installing and uninstalling an operator, but you can manually create or update this resource as mentioned above.
This is a clone of issue OCPBUGS-5879. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-5505. The following is the description of the original issue:
—
The upgradeability check in CVO is throttled (essentially cached) for a nondeterministic period of time, same as the minimal sync period computed at runtime. The period can be up to 4 minutes, determined at CVO start time as 2minutes * (0..1 + 1). We agreed with Trevor that such throttling is unnecessarily aggressive (the check is not that expensive). It also causes CI flakes, because the matching test only has 3 minutes timeout. Additionally, the non-determinism and longer throttling results makes UX worse by actions done in the cluster may have their observable effect delayed.
discovered in 4.10 -> 4.11 upgrade jobs
The test seems to flake ~10% of 4.10->4.11 Azure jobs (sippy). There does not seem to be that much impact on non-Azure jobs though which is a bit weird.
Inspect the CVO log and E2E logs from failing jobs with the provided [^check-cvo.py] helper:
$ ./check-cvo.py cvo.log && echo PASS || echo FAIL
Preferably, inspect CVO logs of clusters that just underwent an upgrade (upgrades makes the original problematic behavior more likely to surface)
$ ./check-cvo.py openshift-cluster-version_cluster-version-operator-5b6966c474-g4kwk_cluster-version-operator.log && echo PASS || echo FAIL FAIL: Cache hit at 11:59:55.332339 0:03:13.665006 after check at 11:56:41.667333 FAIL: Cache hit at 12:06:22.663215 0:03:13.664964 after check at 12:03:08.998251 FAIL: Cache hit at 12:12:49.997119 0:03:13.665598 after check at 12:09:36.331521 FAIL: Cache hit at 12:19:17.328510 0:03:13.664906 after check at 12:16:03.663604 FAIL: Cache hit at 12:25:44.662290 0:03:13.666759 after check at 12:22:30.995531 Upgradeability checks: 5 Upgradeability check cache hits: 12 FAIL
Note that the bug is probabilistic, so not all unfixed clusters will exhibit the behavior. My guess of the incidence rate is about 30-40%.
$ ./check-cvo.py openshift-cluster-version_cluster-version-operator-7b8f85d455-mk9fs_cluster-version-operator.log && echo PASS || echo FAIL Upgradeability checks: 12 Upgradeability check cache hits: 11 PASS
The actual numbers are not relevant (unless the upgradeabilily check count is zero, which means the test is not conclusive, the script warns about that), lack of failure is.
$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1607602927633960960/artifacts/e2e-azure-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7b7d4b5bbd-zjqdt_cluster-version-operator.log | grep upgradeable.go ... I1227 06:50:59.023190 1 upgradeable.go:122] Cluster current version=4.10.46 I1227 06:50:59.042735 1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later. I1227 06:51:14.024345 1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later. I1227 06:53:23.080768 1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later. I1227 06:56:59.366010 1 upgradeable.go:122] Cluster current version=4.11.0-0.ci-2022-12-26-193640 $ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1607602927633960960/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Kubernetes 1.25 and therefore OpenShift 4.12' Dec 27 06:51:15.319: INFO: Waiting for Upgradeable to be AdminAckRequired for "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions." ... Dec 27 06:54:15.413: FAIL: Error while waiting for Upgradeable to complain about AdminAckRequired with message "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions.": timed out waiting for the condition
The test passes. Also, the "Upgradeable conditions were recently checked, will try later." messages in CVO logs should never occur after a deterministic, short amount of time (I propose 1 minute) after upgradeability was checked.
I tested the throttling period in https://github.com/openshift/cluster-version-operator/pull/880. With the period of 15m, the test passrate was 4 of 9. Wiht the period of 1m, the test did not fail at all.
Some context in Slack thread
See these threads https://coreos.slack.com/archives/G01F05P2PTL/p1645982017061749?thread_ts=1645970469.871559&cid=G01F05P2PTL for more information
An RW mutex was introduced to the project auth cache with https://github.com/openshift/openshift-apiserver/pull/267, taking exclusive access during cache syncs. On clusters with extremely high object counts for namespaces and RBAC, syncs appear to be extremely slow (on the order of several minutes). The project LIST handler acquires the same mutex in shared mode as part of its critical path.
This is a clone of issue OCPBUGS-4311. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4305. The following is the description of the original issue:
—
Description of problem:
Please add an option to DISABLE debug in ironic-api. Presently it is enabled by default and there is no way to disable it or reduce log level
https://github.com/metal3-io/ironic-image/blob/main/ironic-config/ironic.conf.j2#L3
Version-Release number of selected component (if applicable): none
How reproducible: Every time
Steps to Reproduce:
Please check source code here: https://github.com/metal3-io/ironic-image/blob/main/ironic-config/ironic.conf.j2#L3
It is enabled by default and there is no way to disable it or reduce log level
Actual results:
Please check Case: 03371411, the log file grew to 409 GB
Expected results: Need a way to disable debug
Additional info: Case 03371411. A cluster must gather and log file can be found in the case.
Description of problem: This is a follow-up to OCPBUGS-2795 and OCPBUGS-2941.
The installer fails to destroy the cluster when the OpenStack object storage omits 'content-type' from responses. This can happen on responses with HTTP status code 204, where a reverse proxy is truncating content-related headers (see this nginX bug report). In such cases, the Installer errors with:
level=error msg=Bulk deleting of container "5ifivltb-ac890-chr5h-image-registry-fnxlmmhiesrfvpuxlxqnkoxdbl" objects failed: Cannot extract names from response with content-type: []
Listing container object suffers from the same issue as listing the containers and this one isn't fixed in latest versions of gophercloud. I've reported https://github.com/gophercloud/gophercloud/issues/2509 and fixing it with https://github.com/gophercloud/gophercloud/issues/2510, however we likely won't be able to backport the bump to gophercloud master back to release-4.8 so we'll have to look for alternatives.
I'm setting the priority to critical as it's causing all our jobs to fail in master.
Version-Release number of selected component (if applicable):
4.8.z
How reproducible:
Likely not happening in customer environments where Swift is exposed directly. We're seeing the issue in our CI where we're using a non-RHOSP managed cloud.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
During ocp multinode spoke cluster creation agent provisioning is stuck on "configuring" because machineConfig service is crashing on the node.
After restarting the service still fails with
Can't read link "/var/lib/containers/storage/overlay/l/V2OP2CCVMKSOHK2XICC546DUCG" because it does not exist. A storage corruption might have occurred, attempting to recreate the missing symlinks. It might be best wipe the storage to avoid further errors due to storage corruption.
Version-Release number of selected component (if applicable):
Podman 4.0.2 +
How reproducible:
sometimes
Steps to Reproduce:
1. deploy multinode spoke (ipxe + boot order ) 2. 3.
Actual results:
4 agents in done state and 1 is in "configuring"
Expected results:
all agents are in "done" state
Additional info:
issue mentioned in https://github.com/containers/podman/issues/14003
Fix: https://github.com/containers/storage/issues/1136
This is a clone of issue OCPBUGS-7830. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-7729. The following is the description of the original issue:
—
Description of problem:
Etcd's liveliness probe should be removed.
Version-Release number of selected component (if applicable):
4.11
Additional info:
When the Master Hosts hit CPU load this can cause a cascading restart loop for etcd and kube-api due to the etcd liveliness probes failing. Due to this loop load on the masters stays high because the api and controllers restarting over and over again.. There is no reason for etcd to have a liveliness probe, we removed this probe in 3.11 due issues like this.
This is a clone of issue OCPBUGS-11404. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-11333. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-10690. The following is the description of the original issue:
—
Description of problem:
according to PR: https://github.com/openshift/cluster-monitoring-operator/pull/1824, startupProbe for UWM prometheus/platform prometheus should be 1 hour, but startupProbe for UWM prometheus is still 15m after enabled UWM, platform promethues does not have issue, startupProbe is increased to 1 hour
$ oc -n openshift-user-workload-monitoring get pod prometheus-user-workload-0 -oyaml | grep startupProbe -A20 startupProbe: exec: command: - sh - -c - if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi failureThreshold: 60 periodSeconds: 15 successThreshold: 1 timeoutSeconds: 3 ... $ oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep startupProbe -A20 startupProbe: exec: command: - sh - -c - if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi failureThreshold: 240 periodSeconds: 15 successThreshold: 1 timeoutSeconds: 3
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-03-19-052243
How reproducible:
always
Steps to Reproduce:
1. enable UWM, check startupProbe for UWM prometheus/platform prometheus 2. 3.
Actual results:
startupProbe for UWM prometheus is still 15m
Expected results:
startupProbe for UWM prometheus should be 1 hour
Additional info:
since startupProbe for platform prometheus is increased to 1 hour, and no similar bug for UWM prometheus, won't fix the issue is OK.
Description of problem:
Cluster running 4.10.52 had three aws-ebs-csi-driver-node pods begin to consume multiple GB of memory, causing heavy node memory pressure as the pods have no memory limit. All other aws-ebs-csi-driver-node pods were still in the 50-70MB range: NAME CPU(cores) MEMORY(bytes) aws-ebs-csi-driver-controller-59867579b-d6s2q 0m 397Mi aws-ebs-csi-driver-controller-59867579b-t4wgq 0m 276Mi aws-ebs-csi-driver-node-4rmvk 0m 53Mi aws-ebs-csi-driver-node-5799f 0m 50Mi aws-ebs-csi-driver-node-6dpvg 0m 59Mi aws-ebs-csi-driver-node-6ldzk 0m 65Mi aws-ebs-csi-driver-node-6mbk5 0m 54Mi aws-ebs-csi-driver-node-bkvsr 0m 50Mi aws-ebs-csi-driver-node-c2fb2 0m 62Mi aws-ebs-csi-driver-node-f422m 0m 61Mi aws-ebs-csi-driver-node-lwzbb 6m 1940Mi aws-ebs-csi-driver-node-mjznt 0m 53Mi aws-ebs-csi-driver-node-pczsj 0m 62Mi aws-ebs-csi-driver-node-pmskn 0m 3493Mi aws-ebs-csi-driver-node-qft8w 0m 68Mi aws-ebs-csi-driver-node-v5bpx 11m 2076Mi aws-ebs-csi-driver-node-vn8km 0m 84Mi aws-ebs-csi-driver-node-ws6hx 0m 73Mi aws-ebs-csi-driver-node-xsk7k 0m 59Mi aws-ebs-csi-driver-node-xzwlh 0m 55Mi aws-ebs-csi-driver-operator-8c5ffb6d4-fk6zk 5m 88Mi Deleting the pods caused them to recreate, with normal memory consumption levels.
Version-Release number of selected component (if applicable):
4.10.52
How reproducible:
Unknown
Description of problem:
See https://bugzilla.redhat.com/show_bug.cgi?id=2104275
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-8491. The following is the description of the original issue:
—
Description of problem:
Image registry pods panic while deploying OCP in ap-southeast-4 AWS region
Version-Release number of selected component (if applicable):
4.12.0
How reproducible:
Deploy OCP in AWS ap-southeast-4 region
Steps to Reproduce:
Deploy OCP in AWS ap-southeast-4 region
Actual results:
panic: Invalid region provided: ap-southeast-4
Expected results:
Image registry pods should come up with no errors
Additional info:
This is a clone of issue OCPBUGS-675. The following is the description of the original issue:
—
Description of problem:
A cluster hit a panic in etcd operator in bootstrap:
I0829 14:46:02.736582 1 controller_manager.go:54] StaticPodStateController controller terminated
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1e940ab]
goroutine 2701 [running]:
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.checkSingleMemberHealth({0x29374c0, 0xc00217d920}, 0xc0021fb110)
github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:135 +0x34b
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth.func1()
github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:58 +0x7f
created by github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth
github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:54 +0x2ac
Version-Release number of selected component (if applicable):
How reproducible:
Pulled up a 4.12 cluster and hit panic during bootstrap
Steps to Reproduce:
1. 2. 3.
Actual results:
panic as above
Expected results:
no panic
Additional info:
This is a clone of issue OCPBUGS-6661. The following is the description of the original issue:
—
Description of problem:
CRL list is capped at 1MB due to configmap max size. If multiple public CRLs are needed for ingress controller the CRL pem file will be over 1MB.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Create CRL configmap with the following distribution points: Issuer: C=US, O=DigiCert Inc, CN=DigiCert Global G2 TLS RSA SHA256 2020 CA1 Subject: SOME SIGNED CERT X509v3 CRL Distribution Points: Full Name: URI:http://crl3.digicert.com/DigiCertGlobalG2TLSRSASHA2562020CA1-2.cr # curl -o DigiCertGlobalG2TLSRSASHA2562020CA1-2.crl http://crl3.digicert.com/DigiCertGlobalG2TLSRSASHA2562020CA1-2.crl # openssl crl -in DigiCertGlobalG2TLSRSASHA2562020CA1-2.crl -inform DER -out DigiCertGlobalG2TLSRSASHA2562020CA1-2.pem # du -bsh DigiCertGlobalG2TLSRSASHA2562020CA1-2.pem 604K DigiCertGlobalG2TLSRSASHA2562020CA1-2.pem I still need to find more intermediate CRLS to grow this.
Actual results:
2023-01-25T13:45:01.443Z ERROR operator.init controller/controller.go:273 Reconciler error {"controller": "crl", "object": {"name":"custom","namespace":"openshift-ingress-operator"}, "namespace": "openshift-ingress-operator", "name": "custom", "reconcileID": "d49d9b96-d509-4562-b3d9-d4fc315226c0", "error": "failed to ensure client CA CRL configmap for ingresscontroller openshift-ingress-operator/custom: failed to update configmap: ConfigMap \"router-client-ca-crl-custom\" is invalid: []: Too long: must have at most 1048576 bytes"}
Expected results:
First be able to create a configmap where data only accounted to the 1MB max (see additional info below for more details), second some way to compress or allow a large CRL list that would be larger than 1MB
Additional info:
Only using this CRL and it being only 600K still causes issue and it could be due to the `last-applied-configuration` annotation on the configmap. This is added since we do an apply operation (update) on the configmap. I am not sure if this is counting towards the 1MB max. https://github.com/openshift/cluster-ingress-operator/blob/release-4.10/pkg/operator/controller/crl/crl_configmap.go#L295 Not sure if we could just replace the configmap.
This is a backport from https://issues.redhat.com/browse/OCPBUGS-1044
Description of problem:
https://github.com/prometheus/node_exporter/issues/2299 The node exporter pod when ran on a bare metal worker using an AMD EPYC CPU crashes and fails to start up and crashes with the following error message. State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: 05.145Z caller=node_exporter.go:115 level=info collector=tapestats ts=2022-09-07T20:25:05.145Z caller=node_exporter.go:115 level=info collector=textfile ts=2022-09-07T20:25:05.145Z caller=node_exporter.go:115 level=info collector=thermal_zone ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=time ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=timex ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=udp_queues ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=uname ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=vmstat ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=xfs ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=zfs ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:199 level=info msg="Listening on" address=127.0.0.1:9100 ts=2022-09-07T20:25:05.146Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false panic: "node_rapl_package-0-die-0_joules_total" is not a valid metric name Apparently this is a known issue (See Github link) and was fixed in a later upstream.
Version-Release number of selected component (if applicable):
4.11.0
How reproducible:
Every-time
Steps to Reproduce:
1. Provision a bare metal node using an AMD EPYC CPU 2. Node-exporter pods that try to start on the nodes will crash with error message
Actual results:
Node-exporter pods cannot run on the new nodes
Expected results:
Node exporter pods should be able to start up and run like on any other node
Additional info:
As mentioned above this issue was tracked and fixed in a later upstream of node-exporter https://github.com/prometheus/node_exporter/issues/2299 Would we be able to get the fixed version pulled for 4.11?
This is a clone of issue OCPBUGS-3265. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-3172. The following is the description of the original issue:
—
Customer is trying to install the Logging operator, which appears to attempt to install a dynamic plugin. The operator installation fails in the console because permissions aren't available to "patch resource consoles".
We shouldn't block operator installation if permission issues prevent dynamic plugin installation.
This is an OSD cluster, presumably for a customer with "cluster-admin", although it may be a paired down permission set called "dedicated-admin".
See https://docs.google.com/document/d/1hYS-bm6aH7S6z7We76dn9XOFcpi9CGYcGoJys514YSY/edit for permissions investigation work on OSD
This is a clone of issue OCPBUGS-10514. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-10221. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-5469. The following is the description of the original issue:
—
Description of problem:
When changing channels it's possible that multiple new conditional update risks will need to be evaluated. For instance, a cluster running 4.10.34 in a 4.10 channel today only has to evaluate `OpenStackNodeCreationFails` but when the channel is changed to a 4.11 channel multiple new risks require evaluation and the evaluation of new risks is throttled at one every 10 minutes. This means if there are three new risks it may take up to 30 minutes after the channel has changed for the full set of conditional updates to be computed. This leads to a perception that no update paths are recommended because most will not wait 30 minutes, they expect immediate feedback.
Version-Release number of selected component (if applicable):
4.10.z, 4.11.z, 4.12, 4.13
How reproducible:
100%
Steps to Reproduce:
1. Install 4.10.34 2. Switch from stable-4.10 to stable-4.11 3.
Actual results:
Observe no recommended updates for 10-20 minutes because all available paths to 4.11 have a risk associated with them
Expected results:
Risks are computed in a timely manner for an interactive UX, lets say < 10s
Additional info:
This was intentional in the design, we didn't want risks to continuously re-evaluate or overwhelm the monitoring stack, however we didn't anticipate that we'd have long standing pile of risks and realize how confusing the user experience would be. We intend to work around this in the deployed fleet by converting older risks from `type: promql` to `type: Always` avoiding the evaluation period but preserving the notification. While this may lead customers to believe they're exposed to a risk they may not be, as long as the set of outstanding risks to the latest version is limited to no more than one it's likely no one will notice. All 4.10 and 4.11 clusters currently have a clear path toward relatively recent 4.10.z or 4.11.z with no more than one risk to be evaluated.
This is a clone of issue OCPBUGS-11989. The following is the description of the original issue:
—
Description of problem:
Customer reports that when trying to create an application using the "Import from Git" workflow, the "Create" button at the very bottom of the form stays inactive. You can observe the issue in the video shared via Google Drive here (timestamp 00:35): https://drive.google.com/file/d/1GEA_TF5vV_ai9YDMZ3uzwEwYKkp_CY8r/view?usp=sharing The customer can work around the issue by selecting another Import Strategy than "Builder Image" and then switching back to "Builder Image" (timestamp 00:49).
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.11.25
How reproducible:
Always
Steps to Reproduce:
1. Click the "+Add" button on the left menu 2. Enter a Git repository URL 3. Select "Bitbucket" as Git type 3a. If necessary, select a "Source Secret" 4. For "Import Strategy", select "Builder Image" and select one of the available images 5. In "Application" select "Create application" 6. For "Application Name" and "Name" insert any valid value
Actual results:
"Create" button at the bottom of the form is inactive and cannot be clicked. Changing the Import Strategy to something else and then back to "Builder Image" makes the button active.
Expected results:
Button is active after filling out all the required form fields
Additional info:
* Video of the issue provided: https://drive.google.com/file/d/1GEA_TF5vV_ai9YDMZ3uzwEwYKkp_CY8r/view?usp=sharing
This is a clone of issue OCPBUGS-1522. The following is the description of the original issue:
—
Description of problem:
Normal user cannot open the debug container from the pods(crashLoopbackoff) they created, And would be got error message: pods "<pod name>" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-20-040107, 4.11.z, 4.10.z
How reproducible:
Always
Steps to Reproduce:
1. Login OCP as a normal user eg: flexy-htpasswd-provider 2. Create a project, go to Developer prespective -> +Add page 3. Click "Import from Git", and provide below data to get a Pods with CrashLoopBackOff state Git Repo URL: https://github.com/sclorg/nodejs-ex.git Name: nodejs-ex-git Run command: star a wktw 4. Navigate to /k8s/ns/<project name>/pods page, find the pod with CrashLoopBackOff status, and go to it details page -> Logs Tab 5. Click the link of "Debug container" 6. Check if the Debug container can be opened
Actual results:
6. Error message would be shown on page, user cannot open debug container via UI pods "nodejs-ex-git-6dd986d8bd-9h2wj-debug-tkqk2" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>
Expected results:
6. Normal user could use debug container without any error message
Additional info:
The debug container could be created for the normal user successfully via CommandLine $ oc debug <crashloopbackoff pod name> -n <project name>
The path used by --rotated-pod-logs to gather the rotated pod logs from /var/log/pods node folder via /api/v1/nodes/${NODE}/proxy/logs/${LOG_PATH} is only valid for regular pods but not for static pods.
The main problem is that, while normal pods have their rotated logs at this /var/log/pods/${POD_NAME}_${POD_UID_IN_API}/${CONTAINER_NAME}, static pods have them at /var/log/pods/${POD_NAME}_${CONFIG_HASH}/${CONTAINER_NAME} because the UID cannot be known at the time that the static pod is born (because static pods are created by kubelet before registering them in the kube-apiserver, and UID is assigned by the kube-apiserver).
The visible results of that are:
4.10
Always if there are static pods.
1. oc adm inspect --rotated-pod-logs ns/openshift-etcd (or any other project with static pods).
error: errors occurred while gathering data: one or more errors occurred while gathering pod-specific data for namespace: openshift-etcd [one or more errors occurred while gathering container data for pod etcd-master-0.example.net: the server could not find the requested resource, one or more errors occurred while gathering container data for pod etcd-master-1.example.net: the server could not find the requested resource, one or more errors occurred while gathering container data for pod etcd-master-2.example.net: the server could not find the requested resource]
No errors like the ones above and rotated pod logs to be gathered, if present.
Despite being marked as experimental, this --rotated-pod-logs is used in must-gather, so this issue can be easily reproduced by just running a default must-gather. I focused on bare oc adm inspect reproducers for simplicity.
Just like kube proxy, ovnk should expose port 10256 on every node, so that cloud LBs can send health checks and know which nodes are available. This is relevant for services with externalTrafficPolicy=Cluster.
Description of problem:
Manual backport of * https://github.com/openshift/cluster-dns-operator/pull/336 * https://github.com/openshift/cluster-dns-operator/pull/339
Version-Release number of selected component (if applicable):
4.11
Description of problem:
The oc new-app command using a private Git repository no longer works with oc v4.10. Specifically, the private Git repository is authenticating over SSH and this occurs when no image stream is specified in the command so that the language detection step is required. Here is an example command:
oc new-app git@github.com:scottishkiwi/test-oc-newapp.git --source-secret github-repo-key --name test-app
Version-Release number of selected component (if applicable):
oc v4.10
How reproducible:
Easily reproducible.
Steps to Reproduce:
1. Download v4.10 of the oc tool and add to local executable path as 'oc':
https://access.redhat.com/downloads/content/290/ver=4.10/rhel---8/4.10.10/x86_64/product-software
➜ ~ oc version
Client Version: 4.10.9
2. Setup a private Github repository
3. Add SSH public key as a deploy key to the private Github repository
4. Push some empty test file like index.php to the private repository (used for language detection)
5. Create a new OpenShift project:
➜ ~ oc new-project test-oc-newapp
6. Create a secret to hold the private key of the SSH key pair
➜ ~ oc create secret generic github-repo-key --from-file=ssh-privatekey=/Users/daniel/test/github-repo --type=kubernetes.io/ssh-auth
secret/github-repo-key created
7. Enable access to the secret from the builder service account:
➜ ~ oc secrets link builder github-repo-key
8. Create a new application using the source secret:
oc new-app git@github.com:scottishkiwi/test-oc-newapp.git --source-secret github-repo-key --name test-app
Actual results:
➜ ~ oc new-app git@github.com:scottishkiwi/test-oc-newapp.git --source-secret github-repo-key --name test-app
warning: Cannot check if git requires authentication.
error: local file access failed with: stat git@github.com:scottishkiwi/test-oc-newapp.git: no such file or directory
error: unable to locate any images in image streams, templates loaded in accessible projects, template files, local docker images with name "git@github.com:scottishkiwi/test-oc-newapp.git"
Argument 'git@github.com:scottishkiwi/test-oc-newapp.git' was classified as an image, image~source, or loaded template reference.
The 'oc new-app' command will match arguments to the following types:
1. Images tagged into image streams in the current project or the 'openshift' project
--allow-missing-images can be used to point to an image that does not exist yet.
Expected results (with oc v4.8):
➜ ~ oc-4.8 version
Client Version: 4.8.37
➜ oc-4.8 new-app git@github.com:scottishkiwi/test-oc-newapp.git --source-secret github-repo-key --name test-app
warning: Cannot check if git requires authentication.
--> Found image 22f1bf3 (4 weeks old) in image stream "openshift/php" under tag "7.4-ubi8" for "php"
Apache 2.4 with PHP 7.4
-----------------------
PHP 7.4 available as container is a base platform for building and running various PHP 7.4 applications and frameworks. PHP is an HTML-embedded scripting language. PHP attempts to make it easy for developers to write dynamically generated web pages. PHP also offers built-in database integration for several commercial and non-commercial database management systems, so writing a database-enabled webpage with PHP is fairly simple. The most common use of PHP coding is probably as a replacement for CGI scripts.
Tags: builder, php, php74, php-74
--> Creating resources ...
imagestream.image.openshift.io "test-app" created
buildconfig.build.openshift.io "test-app" created
deployment.apps "test-app" created
service "test-app" created
--> Success
Build scheduled, use 'oc logs -f buildconfig/test-app' to track its progress.
Application is not exposed. You can expose services to the outside world by executing one or more of the commands below:
'oc expose service/test-app'
Run 'oc status' to view your app.
Additional info:
Also tested with oc v4.9 and works as expected:
➜ ~ oc-4.9 version
Client Version: 4.9.29
➜ ~ oc-4.9 new-app git@github.com:scottishkiwi/test-oc-newapp.git --source-secret github-repo-key --name test-app
warning: Cannot check if git requires authentication.
--> Found image 22f1bf3 (4 weeks old) in image stream "openshift/php" under tag "7.4-ubi8" for "php"
Apache 2.4 with PHP 7.4
-----------------------
PHP 7.4 available as container is a base platform for building and running various PHP 7.4 applications and frameworks. PHP is an HTML-embedded scripting language. PHP attempts to make it easy for developers to write dynamically generated web pages. PHP also offers built-in database integration for several commercial and non-commercial database management systems, so writing a database-enabled webpage with PHP is fairly simple. The most common use of PHP coding is probably as a replacement for CGI scripts.
Tags: builder, php, php74, php-74
--> Creating resources ...
buildconfig.build.openshift.io "test-app" created
deployment.apps "test-app" created
service "test-app" created
--> Success
Build scheduled, use 'oc logs -f buildconfig/test-app' to track its progress.
Application is not exposed. You can expose services to the outside world by executing one or more of the commands below:
'oc expose service/test-app'
Run 'oc status' to view your app.
➜ ~ oc status
In project dan-test-oc-newapp on server https://api.dsquirre.2b7w.p1.openshiftapps.com:6443
svc/test-app - 172.30.238.75 ports 8080, 8443
deployment/test-app deploys istag/test-app:latest <-
bc/test-app source builds git@github.com:scottishkiwi/test-oc-newapp.git on openshift/php:7.4-ubi8
deployment #3 running for 38 minutes - 1 pod
deployment #2 deployed 38 minutes ago
deployment #1 deployed 38 minutes ago
1 info identified, use 'oc status --suggest' to see details.
Description of problem:
Created two egressIP object, egressIPs in one egressIP object cannot be applied successfully
Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-11-27-164248
How reproducible:
Frequently happen in auto case
Steps to Reproduce:
1. Label two nodes as egress nodes oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME huirwang-1128a-s6j6t-master-0 Ready master 154m v1.24.6+5658434 10.0.0.8 <none> Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa) 4.18.0-372.32.1.el8_6.x86_64 cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8 huirwang-1128a-s6j6t-master-1 Ready master 154m v1.24.6+5658434 10.0.0.7 <none> Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa) 4.18.0-372.32.1.el8_6.x86_64 cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8 huirwang-1128a-s6j6t-master-2 Ready master 153m v1.24.6+5658434 10.0.0.6 <none> Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa) 4.18.0-372.32.1.el8_6.x86_64 cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8 huirwang-1128a-s6j6t-worker-westus-1 Ready worker 135m v1.24.6+5658434 10.0.1.5 <none> Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa) 4.18.0-372.32.1.el8_6.x86_64 cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8 huirwang-1128a-s6j6t-worker-westus-2 Ready worker 136m v1.24.6+5658434 10.0.1.4 <none> Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa) 4.18.0-372.32.1.el8_6.x86_64 cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8 % oc get node huirwang-1128a-s6j6t-worker-westus-1 --show-labels NAME STATUS ROLES AGE VERSION LABELS huirwang-1128a-s6j6t-worker-westus-1 Ready worker 136m v1.24.6+5658434 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D4s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=westus,failure-domain.beta.kubernetes.io/zone=0,k8s.ovn.org/egress-assignable=true,kubernetes.io/arch=amd64,kubernetes.io/hostname=huirwang-1128a-s6j6t-worker-westus-1,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=Standard_D4s_v3,node.openshift.io/os_id=rhcos,topology.disk.csi.azure.com/zone=,topology.kubernetes.io/region=westus,topology.kubernetes.io/zone=0 % oc get node huirwang-1128a-s6j6t-worker-westus-2 --show-labels NAME STATUS ROLES AGE VERSION LABELS huirwang-1128a-s6j6t-worker-westus-2 Ready worker 136m v1.24.6+5658434 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D4s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=westus,failure-domain.beta.kubernetes.io/zone=0,k8s.ovn.org/egress-assignable=true,kubernetes.io/arch=amd64,kubernetes.io/hostname=huirwang-1128a-s6j6t-worker-westus-2,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=Standard_D4s_v3,node.openshift.io/os_id=rhcos,topology.disk.csi.azure.com/zone=,topology.kubernetes.io/region=westus,topology.kubernetes.io/zone=0 2. Created two egressIP objects 3.
Actual results:
egressip-47032 was not applied to any egress node % oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-47032 10.0.1.166 egressip-47034 10.0.1.181 huirwang-1128a-s6j6t-worker-westus-1 10.0.1.181 % oc get cloudprivateipconfig NAME AGE 10.0.1.130 6m25s 10.0.1.138 6m34s 10.0.1.166 6m34s 10.0.1.181 6m25s % oc get cloudprivateipconfig 10.0.1.166 -o yaml apiVersion: cloud.network.openshift.io/v1 kind: CloudPrivateIPConfig metadata: annotations: k8s.ovn.org/egressip-owner-ref: egressip-47032 creationTimestamp: "2022-11-28T10:27:37Z" finalizers: - cloudprivateipconfig.cloud.network.openshift.io/finalizer generation: 1 name: 10.0.1.166 resourceVersion: "87528" uid: 5221075a-35d0-4670-a6a7-ddfc6cbc700b spec: node: huirwang-1128a-s6j6t-worker-westus-1 status: conditions: - lastTransitionTime: "2022-11-28T10:33:29Z" message: 'Error processing cloud assignment request, err: <nil>' observedGeneration: 1 reason: CloudResponseError status: "False" type: Assigned node: huirwang-1128a-s6j6t-worker-westus-1 % oc get cloudprivateipconfig 10.0.1.138 -o yaml apiVersion: cloud.network.openshift.io/v1 kind: CloudPrivateIPConfig metadata: annotations: k8s.ovn.org/egressip-owner-ref: egressip-47032 creationTimestamp: "2022-11-28T10:27:37Z" finalizers: - cloudprivateipconfig.cloud.network.openshift.io/finalizer generation: 1 name: 10.0.1.138 resourceVersion: "87523" uid: e4604e76-64d8-4735-87a2-eb50d28854cc spec: node: huirwang-1128a-s6j6t-worker-westus-2 status: conditions: - lastTransitionTime: "2022-11-28T10:33:29Z" message: 'Error processing cloud assignment request, err: <nil>' observedGeneration: 1 reason: CloudResponseError status: "False" type: Assigned node: huirwang-1128a-s6j6t-worker-westus-2 oc logs cloud-network-config-controller-6f7b994ddc-vhtbp -n openshift-cloud-network-config-controller ....... E1128 10:30:43.590807 1 controller.go:165] error syncing '10.0.1.138': error assigning CloudPrivateIPConfig: "10.0.1.138" to node: "huirwang-1128a-s6j6t-worker-westus-2", err: network.InterfacesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="InvalidRequestFormat" Message="Cannot parse the request." Details=[{"code":"DuplicateResourceName","message":"Resource /subscriptions//resourceGroups//providers/Microsoft.Network/networkInterfaces/ has two child resources with the same name (huirwang-1128a-s6j6t-worker-westus-2_10.0.1.138)."}], requeuing in cloud-private-ip-config workqueue I1128 10:30:44.051422 1 cloudprivateipconfig_controller.go:271] CloudPrivateIPConfig: "10.0.1.166" will be added to node: "huirwang-1128a-s6j6t-worker-westus-1" E1128 10:30:44.301259 1 controller.go:165] error syncing '10.0.1.166': error assigning CloudPrivateIPConfig: "10.0.1.166" to node: "huirwang-1128a-s6j6t-worker-westus-1", err: network.InterfacesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="InvalidRequestFormat" Message="Cannot parse the request." Details=[{"code":"DuplicateResourceName","message":"Resource /subscriptions//resourceGroups//providers/Microsoft.Network/networkInterfaces/ has two child resources with the same name (huirwang-1128a-s6j6t-worker-westus-1_10.0.1.166)."}], requeuing in cloud-private-ip-config workqueue ..........
Expected results:
EgressIP can be applied successfully.
Additional info:
This is a clone of issue OCPBUGS-4499. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-860. The following is the description of the original issue:
—
Description of problem:
In GCP, once an external IP address is assigned to master/infra node through GCP console, numbers of pending CSR from kubernetes.io/kubelet-serving is increasing, and the following error are reported: I0902 10:48:29.254427 1 controller.go:121] Reconciling CSR: csr-q7bwd I0902 10:48:29.365774 1 csr_check.go:157] csr-q7bwd: CSR does not appear to be client csr I0902 10:48:29.371827 1 csr_check.go:545] retrieving serving cert from build04-c92hb-master-1.c.openshift-ci-build-farm.internal (10.0.0.5:10250) I0902 10:48:29.375052 1 csr_check.go:188] Found existing serving cert for build04-c92hb-master-1.c.openshift-ci-build-farm.internal I0902 10:48:29.375152 1 csr_check.go:192] Could not use current serving cert for renewal: CSR Subject Alternate Name values do not match current certificate I0902 10:48:29.375166 1 csr_check.go:193] Current SAN Values: [build04-c92hb-master-1.c.openshift-ci-build-farm.internal 10.0.0.5], CSR SAN Values: [build04-c92hb-master-1.c.openshift-ci-build-farm.internal 10.0.0.5 35.211.234.95] I0902 10:48:29.375175 1 csr_check.go:202] Falling back to machine-api authorization for build04-c92hb-master-1.c.openshift-ci-build-farm.internal E0902 10:48:29.375184 1 csr_check.go:420] csr-q7bwd: IP address '35.211.234.95' not in machine addresses: 10.0.0.5 I0902 10:48:29.375193 1 csr_check.go:205] Could not use Machine for serving cert authorization: IP address '35.211.234.95' not in machine addresses: 10.0.0.5 I0902 10:48:29.379457 1 csr_check.go:218] Falling back to serving cert renewal with Egress IP checks I0902 10:48:29.382668 1 csr_check.go:221] Could not use current serving cert and egress IPs for renewal: CSR Subject Alternate Names includes unknown IP addresses I0902 10:48:29.382702 1 controller.go:233] csr-q7bwd: CSR not authorized
Version-Release number of selected component (if applicable):
4.11.2
Steps to Reproduce:
1. Assign external IPs to master/infra node in GCP 2. oc get csr | grep kubernetes.io/kubelet-serving
Actual results:
CSRs are not approved
Expected results:
CSRs are approved
Additional info:
This issue is only happen in GCP. Same OpenShift installations in AWS do not have this issue. It looks like the CSR are created using external IP addresses once assigned. Ref: https://coreos.slack.com/archives/C03KEQZC1L2/p1662122007083059
This bug is a backport clone of [Bugzilla Bug 2118318](https://bugzilla.redhat.com/show_bug.cgi?id=2118318). The following is the description of the original bug:
—
+++ This bug was initially created as a clone of Bug #2117569 +++
Description of problem:
The garbage collector resource quota controller must ignore ALL events; otherwise, if a rogue controller or a workload causes unbound event creation, performance will degrade as it has to process the events.
Fix: https://github.com/kubernetes/kubernetes/pull/110939
This bug is to track fix in master (4.12) and also allow to backport to 4.11.1
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
— Additional comment from Michal Fojtik on 2022-08-11 10:52:28 UTC —
I'm using FastFix here as we need to backport this to 4.11.1 to avoid support churn for busy clusters or clusters doing upgrades.
— Additional comment from ART BZ Bot on 2022-08-11 15:13:32 UTC —
Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.12 release.
— Additional comment from zhou ying on 2022-08-12 03:03:34 UTC —
checked the payload commit id , the payload 4.12.0-0.nightly-2022-08-11-191750 has container the fixed pr .
oc adm release info registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-08-11-191750 --commit-urls |grep hyperkube
Warning: the default reading order of registry auth file will be changed from "${HOME}/.docker/config.json" to podman registry config locations in the future version. "${HOME}/.docker/config.json" is deprecated, but can still be used for storing credentials as a fallback. See https://github.com/containers/image/blob/main/docs/containers-auth.json.5.md for the order of podman registry config locations.
hyperkube https://github.com/openshift/kubernetes/commit/da80cd038ee5c3b45ba36d4b48b42eb8a74439a3
commit da80cd038ee5c3b45ba36d4b48b42eb8a74439a3 (HEAD -> master, origin/release-4.13, origin/release-4.12, origin/master, origin/HEAD)
Merge: a9d6306a701 055b96e614a
Author: OpenShift Merge Robot <openshift-merge-robot@users.noreply.github.com>
Date: Thu Aug 11 15:13:05 2022 +0000
Merge pull request #1338 from benluddy/openshift-pick-110888
Bug 2117569: UPSTREAM: 110888: feat: fix a bug thaat not all event be ignored by gc controller
This is a clone of issue OCPBUGS-2495. The following is the description of the original issue:
—
Failures like:
$ oc login --token=... Logged into "https://api..." as "..." using the token provided. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get projects.project.openshift.io)
break login, which tries to gather information before saving the configuration, including a giant project list.
Ideally login would be able to save the successful login credentials, even when the informative gathering had difficulties. And possibly the informative gathering could be made conditional (--quiet or similar?) so expensive gathering could be skipped in use-cases where the context was not needed.
Please review the following PR: https://github.com/openshift/cluster-bootstrap/pull/72
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
NodePort port not accessible
Version-Release number of selected component (if applicable):
OCP 4.8.20
How reproducible:
$oc -n ui-nprd get services -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
docker-registry ClusterIP 10.201.219.240 <none> 5000/TCP 24d app=registry
docker-registry-lb LoadBalancer 10.201.252.253 internal-xxxxxx.xx-xxxx-1.elb.amazonaws.com 5000:30779/TCP 3d22h app=registry
docker-registry-np NodePort 10.201.216.26 <none> 5000:32428/TCP 3d16h app=registry
$oc debug node/ip-xxx.ca-central-1.compute.internal
Starting pod/ip-xxx.ca-central-1computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.81.23.96
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# nc -vz 10.81.23.96 32428
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connection timed out.
In a new-created namespaces the same deployment works:
[RHEL7:> oc project
Using project "test-c1" on server "https://api.xx.xx.xxxx.xx.xx:6443".
[RHEL7:- ~/tmp]> oc port-forward service/docker-registry-np 5000:5000
Forwarding from 127.0.0.1:5000 -> 5000
[1]+ Stopped oc4 port-forward service/docker-registry-np 5000:5000
[RHEL7: ~/tmp]> bg %1
[1]+ oc4 port-forward service/docker-registry-np 5000:5000 &
[RHEL7: ~/tmp]> nc -v localhost 5000
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 127.0.0.1:5000.
Handling connection for 5000
[RHEL7: ~/tmp]> kill %1
[RHEL7: ~/tmp]>
[1]+ Terminated oc4 port-forward service/docker-registry-np 5000:5000
[RHEL7: ~/tmp]> oc get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
docker-registry-np NodePort 10.201.224.174 <none> 5000:31793/TCP 68s
[RHEL7: ~/tmp]> oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
registry-75b7c7fd94-rx29j 1/1 Running 0 7m5s 10.201.1.29 ip-xxx.ca-central-1.compute.internal <none> <none>
[RHEL7: ~/tmp]> oc debug node/ip-xxx.ca-central-1.compute.internal
Starting pod/ip-xxxca-central-1computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.81.23.87
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# nc -v 10.81.23.87 31793
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.81.23.87:31793.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-7885. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-7617. The following is the description of the original issue:
—
Description of problem:
Azure Disk volume is taking time to attach/detach
Version-Release number of selected component (if applicable):
Openshift ARO 4.10.30
How reproducible:
While performing scaledown and scaleup of statefulset pod takes time to attach and detach volume from nodes.
Reviewed must-gather and test output will share my findings in comments.
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-1226. The following is the description of the original issue:
—
We added server groups for control plane and computes as part of OSASINFRA-2570, except for UPI that only creates server group for the control plane.
We need to update the UPI scripts to create server group for computes to be consistent with IPI and have the instruction at https://docs.openshift.com/container-platform/4.11/machine_management/creating_machinesets/creating-machineset-osp.html work out of the box in case customers want to create MachineSets on their UPI clusters.
Related to OCPCLOUD-1135.
This is a clone of issue OCPBUGS-7800. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-266. The following is the description of the original issue:
—
Description of problem: I am working with a customer who uses the web console. From the Developer Perspective's Project Access tab, they cannot differentiate between users and groups and furthermore cannot add groups from this web console. This has led to confusion whether existing resources were in fact users or groups, and furthermore they have added users when they intended to add groups instead. What we really need is a third column in the Project Access tab that says whether a resource is a user or group.
Version-Release number of selected component (if applicable): This is an issue in OCP 4.10 and 4.11, and I presume future versions as well
How reproducible: Every time. My customer is running on ROSA, but I have determined this issue to be general to OpenShift.
Steps to Reproduce:
From the oc cli, I create a group and add a user to it.
$ oc adm groups new techlead
group.user.openshift.io/techlead created
$ oc adm groups add-users techlead admin
group.user.openshift.io/techlead added: "admin"
$ oc get groups
NAME USERS
cluster-admins
dedicated-admins admin
techlead admin
I create a new namespace so that I can assign a group project level access:
$ oc new-project my-namespace
$ oc adm policy add-role-to-group edit techlead -n my-namespace
I then went to the web console -> Developer perspective -> Project -> Project Access. I verified the rolebinding named 'edit' is bound to a group named 'techlead'.
$ oc get rolebinding
NAME ROLE AGE
admin ClusterRole/admin 15m
admin-dedicated-admins ClusterRole/admin 15m
admin-system:serviceaccounts:dedicated-admin ClusterRole/admin 15m
dedicated-admins-project-dedicated-admins ClusterRole/dedicated-admins-project 15m
dedicated-admins-project-system:serviceaccounts:dedicated-admin ClusterRole/dedicated-admins-project 15m
edit ClusterRole/edit 2m18s
system:deployers ClusterRole/system:deployer 15m
system:image-builders ClusterRole/system:image-builder 15m
system:image-pullers ClusterRole/system:image-puller 15m
$ oc get rolebinding edit -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
creationTimestamp: "2022-08-15T14:16:56Z"
name: edit
namespace: my-namespace
resourceVersion: "108357"
uid: 4abca27d-08e8-43a3-b9d3-d20d5c294bbe
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: edit
subjects:
Now back to the CLI, I view the newly created rolebinding named 'developer-view-c15b720facbc8deb', and find that the "View" role is assigned to a user named 'developer', rather than a group.
$ oc get rolebinding
NAME ROLE AGE
admin ClusterRole/admin 17m
admin-dedicated-admins ClusterRole/admin 17m
admin-system:serviceaccounts:dedicated-admin ClusterRole/admin 17m
dedicated-admins-project-dedicated-admins ClusterRole/dedicated-admins-project 17m
dedicated-admins-project-system:serviceaccounts:dedicated-admin ClusterRole/dedicated-admins-project 17m
edit ClusterRole/edit 4m25s
developer-view-c15b720facbc8deb ClusterRole/view 90s
system:deployers ClusterRole/system:deployer 17m
system:image-builders ClusterRole/system:image-builder 17m
system:image-pullers ClusterRole/system:image-puller 17m
[10:21:21] kechung:~ $ oc get rolebinding developer-view-c15b720facbc8deb -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
creationTimestamp: "2022-08-15T14:19:51Z"
name: developer-view-c15b720facbc8deb
namespace: my-namespace
resourceVersion: "113298"
uid: cc2d1b37-922b-4e9b-8e96-bf5e1fa77779
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: view
subjects:
So in conclusion, from the Project Access tab, we're unable to add groups and unable to differentiate between users and groups. This is in essence our ask for this RFE.
Actual results:
Developer perspective -> Project -> Project Access tab shows a list of resources which can be users or groups, but does not differentiate between them. Furthermore, when we add resources, they are only users and there is no way to add a group from this tab in the web console.
Expected results:
Should have the ability to add groups and differentiate between users and groups. Ideally, we're looking at a third column for user or group.
Additional info:
This is a clone of issue OCPBUGS-11636. The following is the description of the original issue:
—
Description of problem:
The ACLs are disabled for all newly created s3 buckets, this causes all OCP installs to fail: the bootstrap ignition can not be uploaded: level=info msg=Creating infrastructure resources... level=error level=error msg=Error: error creating S3 bucket ACL for yunjiang-acl413-4dnhx-bootstrap: AccessControlListNotSupported: The bucket does not allow ACLs level=error msg= status code: 400, request id: HTB2HSH6XDG0Q3ZA, host id: V6CrEgbc6eyfJkUbLXLxuK4/0IC5hWCVKEc1RVonSbGpKAP1RWB8gcl5dfyKjbrLctVlY5MG2E4= level=error level=error msg= with aws_s3_bucket_acl.ignition, level=error msg= on main.tf line 62, in resource "aws_s3_bucket_acl" "ignition": level=error msg= 62: resource "aws_s3_bucket_acl" ignition { level=error level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "bootstrap" stage: failed to create cluster: failed to apply Terraform: exit status 1 level=error level=error msg=Error: error creating S3 bucket ACL for yunjiang-acl413-4dnhx-bootstrap: AccessControlListNotSupported: The bucket does not allow ACLs level=error msg= status code: 400, request id: HTB2HSH6XDG0Q3ZA, host id: V6CrEgbc6eyfJkUbLXLxuK4/0IC5hWCVKEc1RVonSbGpKAP1RWB8gcl5dfyKjbrLctVlY5MG2E4= level=error level=error msg= with aws_s3_bucket_acl.ignition, level=error msg= on main.tf line 62, in resource "aws_s3_bucket_acl" "ignition": level=error msg= 62: resource "aws_s3_bucket_acl" ignition {
Version-Release number of selected component (if applicable):
4.11+
How reproducible:
Always
Steps to Reproduce:
1.Create a cluster via IPI
Actual results:
install fail
Expected results:
install succeed
Additional info:
Heads-Up: Amazon S3 Security Changes Are Coming in April of 2023 - https://aws.amazon.com/blogs/aws/heads-up-amazon-s3-security-changes-are-coming-in-april-of-2023/ https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-ownership-error-responses.html - After you apply the bucket owner enforced setting for Object Ownership, ACLs are disabled.
Description of problem:
The 4.11 version of openshift-installer does not support the mon01 zone
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
container_network* metrics stop reporting after a container restarts. Other container_* metrics continue to report for the same pod.
How reproducible:
Issue can be reproduced by triggering a container restart
Steps to Reproduce:
1.Restart container 2.Check metrics and see container_network* not reporting
Additional info:
Ticket with more detailed debugging process OHSS-16739
Description of problem:
Jenkins and Jenkins Agent Base image versions needs to be updated to use the latest images to mitigate known CVEs in plugins and Jenkins versions.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-1677. The following is the description of the original issue:
—
Description of problem:
pkg/devfile/sample_test.go fails after devfile registry was updated (https://github.com/devfile/registry/pull/126)
This issue is about updating our assertion so that the CI job runs successfully again. We might want to backport this as well.
OCPBUGS-1678 is about updating the code that the test should use a mock response instead of the latest registry content OR check some specific attributes instead of comparing the full JSON response.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Clone openshift/console
2. Run ./test-backend.sh
Actual results:
Unit tests fail
Expected results:
Unit tests should pass again
Additional info:
This is a clone of issue OCPBUGS-4504. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-1557. The following is the description of the original issue:
—
Seen in an instance created recently by a 4.12.0-ec.2 GCP provider:
"scheduling": { "automaticRestart": false, "onHostMaintenance": "MIGRATE", "preemptible": false, "provisioningModel": "STANDARD" },
From GCP's docs, they may stop instances on hardware failures and other causes, and we'd need automaticRestart: true to auto-recover from that. Also from GCP docs, the default for automaticRestart is true. And on the Go provider side, we doc:
If omitted, the platform chooses a default, which is subject to change over time, currently that default is "Always".
But the implementing code does not actually float the setting. Seems like a regression here, which is part of 4.10:
$ git clone https://github.com/openshift/machine-api-provider-gcp.git $ cd machine-api-provider-gcp $ git log --oneline origin/release-4.10 | grep 'migrate to openshift/api' 44f0f958 migrate to openshift/api
But that's not where the 4.9 and earlier code is located:
$ git branch -a | grep origin/release remotes/origin/release-4.10 remotes/origin/release-4.11 remotes/origin/release-4.12 remotes/origin/release-4.13
Hunting for 4.9 code:
$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.9.48-x86_64 | grep gcp gcp-machine-controllers https://github.com/openshift/cluster-api-provider-gcp c955c03b2d05e3b8eb0d39d5b4927128e6d1c6c6 gcp-pd-csi-driver https://github.com/openshift/gcp-pd-csi-driver 48d49f7f9ef96a7a42a789e3304ead53f266f475 gcp-pd-csi-driver-operator https://github.com/openshift/gcp-pd-csi-driver-operator d8a891de5ae9cf552d7d012ebe61c2abd395386e
So looking there:
$ git clone https://github.com/openshift/cluster-api-provider-gcp.git $ cd cluster-api-provider-gcp $ git log --oneline | grep 'migrate to openshift/api' ...no hits... $ git grep -i automaticRestart origin/release-4.9 | grep -v '"description"\|compute-gen.go' origin/release-4.9:vendor/google.golang.org/api/compute/v1/compute-api.json: "automaticRestart": {
Not actually clear to me how that code is structured. So 4.10 and later GCP machine-API providers are impacted, and I'm unclear on 4.9 and earlier.
Description of problem:
In order to understand what is going on with OCPBUGS-5379 we want to add more logs
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
ENV:
OCP 4.11.29
VSphere IPI
ISSUE:
If scaling up 3-5 machines the machines will transition to running, join the cluster.
If scaling up a large number of machines, 10-20+ some will transition to running and join the cluster.
The remaining will stay in provisioned/provisioning state and never power on
If those nodes are manually powered on they join the cluster and are healthy.
OBSERVATION:
There are ~16k tags shared across multiple VMWare Cloud Foundation data centers.
In the logs it is observed the reconciliation of tags occur and then some of the nodes will power on and transition to running.
Do not see in the logs where the reconciliation of the tags on the remaining nodes runs again or completes.