Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
Configure audit logging to capture login, logout and login failure details
TODO(PM): update this
Customer who needs login, logout and login failure details inside the openshift container platform.
I have checked for this on my test cluster but the audit logs do not contain any user name specifying login or logout details. For successful logins or logout, on CLI and openshift console as well we can see 'Login successful' or 'Invalid credentials'.
Expected results: Login, logout and login failures should be captured in audit logging.
The apiserver pods today have ´/var/log/<kube|oauth|openshift>-apiserver` mounted from the host and create audit files there using the upstream audit event format (JSON lines following https://github.com/kubernetes/apiserver/blob/92392ef22153d75b3645b0ae339f89c12767fb52/pkg/apis/audit/v1/types.go#L72). These events are apiserver specific, but as oauth authentication flow events are also requests, we can use the apiserver event format to log logins, login failures and logouts. Hence, we propose to make oauth-server to create /var/log/oauth-server/audit.log files on the master nodes using that format.
When the login flow does not finish within a certain time (e.g. 10min), we can artificially create an event to show a login failure in the audit logs.
Right now there's no way to generate audit logs from this.
Let the Cluster Authentication Operator deliver the policy to OAuthServer.
In order to know if authn events should be logged, OAuthServer needs to be aware of it.
* Stanislav LázničkaCreate an observer to deliver the audit policy to the oauth server
Make the authentication-operator react to the new audit field in the oauth.config/cluster object. Write an observer watching this field, such an observer will translate the top-level configuration into oauth-server config and add it to the rest of the observed config.
Right now there's no way to generate audit logs from this.
OCP/Telco Definition of Done
Feature Template descriptions and documentation.
Early customer feedback is that they see SNO as a great solution covering smaller footprint deployment, but are wondering what is the evolution story OpenShift is going to provide where more capacity or high availability are needed in the future.
While migration tooling (moving workload/config to new cluster) could be a mid-term solution, customer desire is not to include extra hardware to be involved in this process.
For Telecommunications Providers, at the Far Edge they intend to start small and then grow. Many of these operators will start with a SNO-based DU deployment as an initial investment, but as DUs evolve, different segments of the radio spectrum are added, various radio hardware is provisioned and features delivered to the Far Edge, the Telecommunication Providers desire the ability for their Far Edge deployments to scale up from 1 node to 2 nodes to n nodes. On the opposite side of the spectrum from SNO is MMIMO where there is a robust cluster and workloads use HPA.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
This is a ticket meant to track all the all the OCP PRs that are involved in the implementation of the SNO + workers enhancement
When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release
Problem:
Certain Insights Advisor features differentiate between RHEL and OCP advisor
Goal:
Address top priority UI misalignments between RHEL and OCP advisor. Address UI features dropped from Insights ADvisor for OCP GA.
Scope:
Specific tasks and priority of them tracked in https://issues.redhat.com/browse/CCXDEV-7432
This contains all the Insights Advisor widget deliverables for the OCP release 4.11.
Scope
It covers only minor bug fixes and improvements:
Show the error message (mocked in CCXDEV-5868) if the Prometheus metrics `cluster_operator_conditions{name="insights"}` contain two true conditions: UploadDegraded and Degraded at the same time. This state occurs if there was an IO archive upload error = problems with the pipeline.
Expected for 4.11 OCP release.
Scenario: Check if the Insights Advisor widget in the OCP WebConsole UI shows the time of the last data analysis Given: OCP WebConsole UI and the cluster dashboard is accessible And: CCX external data pipeline is in a working state And: administrator A1 has access to his cluster's dashboard And: Insights Operator for this cluster is sending archives When: administrator A1 clicks on the Insights Advisor widget Then: the results of the last analysis are showed in the Insights Advisor widget And: the time of the last analysis is shown in the Insights Advisor widget
Acceptance criteria:
max_over_time(timestamp(changes(insightsclient_request_send_total\{status_code="202"}[1m]) > 0)[24h:1m])
Cloning the existing rule should end up with a new rule in the same namespace.
Modifications can now be done to the new rule.
(Optional) You can silence the existing rule.
Create a new PrometheusRule object inside the namespace that includes the metrics you need to form the alerting rule.
CMO should reconcile the platform Prometheus configuration with the alert-relabel-config resources.
DoD
CMO should reconcile the platform Prometheus configuration with the AlertingRule resources.
DoD
Managing PVs at scale for a fleet creates difficulties where "one size does not fit all". The ability for SRE to deploy prometheus with PVs and have retention based an on a desired size would enable easier management of these volumes across the fleet.
The prometheus-operator exposes retentionSize.
Field | Description |
---|---|
retentionSize | Maximum amount of disk space used by blocks. Supported units: B, KB, MB, GB, TB, PB, EB. Ex: 512MB. |
This is a feature request to enable this configuration option via CMO cluster-monitoring-config ConfigMap.
Today, all configuration for setting individual, for example, routing configuration is done via a single configuration file that only admins have access to. If an environment uses multiple tenants and each tenant, for example, has different systems that they are using to notify teams in case of an issue, then someone needs to file a request w/ an admin to add the required settings.
That can be bothersome for individual teams, since requests like that usually disappear in the backlog of an administrator. At the same time, administrators might get tons of requests that they have to look at and prioritize, which takes them away from more crucial work.
We would like to introduce a more self service approach whereas individual teams can create their own configuration for their needs w/o the administrators involvement.
Last but not least, since Monitoring is deployed as a Core service of OpenShift there are multiple restrictions that the SRE team has to apply to all OSD and ROSA clusters. One restriction is the ability for customers to use the central Alertmanager that is owned and managed by the SRE team. They can't give access to the central managed secret due to security concerns so that users can add their own routing information.
Provide a new API (based on the Operator CRD approach) as part of the Prometheus Operator that allows creating a subset of the Alertmanager configuration without touching the central Alertmanager configuration file.
Please note that we do not plan to support additional individual webhooks with this work. Customers will need to deploy their own version of the third party webhooks.
Team A wants to send all their important notifications to a specific Slack channel.
* CI - CI is running, tests are automated and merged.
* Release Enablement <link to Feature Enablement Presentation>
* DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
* DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
* DEV - Downstream build attached to advisory: <link to errata>
* QE - Test plans in Polarion: <link or reference to Polarion>
* QE - Automated tests merged: <link or reference to automated tests>
* DOC - Downstream documentation merged: <link to meaningful PR>
Now that upstream supports AlertmanagerConfig v1beta1 (see MON-2290 and https://github.com/prometheus-operator/prometheus-operator/pull/4709), it should be deployed by CMO.
DoD:
DoD
DoD
Copy/paste from [_https://github.com/openshift-cs/managed-openshift/issues/60_]
Which service is this feature request for?
OpenShift Dedicated and Red Hat OpenShift Service on AWS
What are you trying to do?
Allow ROSA/OSD to integrate with AWS Managed Prometheus.
Describe the solution you'd like
Remote-write of metrics is supported in OpenShift but it does not work with AWS Managed Prometheus since AWS Managed Prometheus requires AWS SigV4 auth.
Describe alternatives you've considered
There is the workaround to use the "AWS SigV4 Proxy" but I'd think this is not properly supported by RH.
https://mobb.ninja/docs/rosa/cluster-metrics-to-aws-prometheus/
Additional context
The customer wants to use an open and portable solution to centralize metrics storage and analysis. If they also deploy to other clouds, they don't want to have to re-configure. Since most clouds offer a Prometheus service (or it's easy to self-manage Prometheus), app migration should be simplified.
The cluster monitoring operator should allow OpenShift customers to configure remote write with all authentication methods supported by upstream Prometheus.
We will extend CMO's configuration API to support the following authentications with remote write:
Customers want to send metrics to AWS Managed Prometheus that require sigv4 authentication (see https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-secure-metric-ingestion.html#AMP-secure-auth).
Prometheus and Prometheus operator already support sigv4 authentication for remote write. This should be possible to configure the same in the CMO configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
prometheusK8s:
remoteWrite:
- url: "https://remote-write.endpoint"
sigv4:
accessKey:
name: aws-credentialss
key: access
secretKey:
name: aws-credentials
key: secret
profile: "SomeProfile"
roleArn: "SomeRoleArn"
DoD:
Prometheus and Prometheus operator already support custom Authorization for remote write. This should be possible to configure the same in the CMO configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
prometheusK8s:
remoteWrite:
- url: "https://remote-write.endpoint"
Authorization:
type: Bearer
credentials:
name: credentials
key: token
DoD:
As WMCO user, I want to make sure containerd logging information has been updated in documents and scripts.
We drive OpenShift cross-market customer success and new customer adoption with constant improvements and feature additions to the existing capabilities of our OpenShift Core Networking (SDN and Network Edge). This feature captures that natural progression of the product.
There are definitely grey areas, but in general:
Questions to be addressed:
Create a PR in openshift/cluster-ingress-operator to implement configurable router probe timeouts.
The PR should include the following:
User Story: As a customer in a highly regulated environment, I need the ability to secure DNS traffic when forwarding requests to upstream resolvers so that I can ensure additional DNS traffic and data privacy.
tldr: three basic claims, the rest is explanation and one example
While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.
One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.
I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.
We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.
Relevant links:
Per the 4.6.30 Monitoring DNS Post Mortem, we should add E2E tests to openshift/cluster-dns-operator to reduce the risk that changes to our CoreDNS configuration break DNS resolution for clients.
To begin with, we add E2E DNS testing for 2 or 3 client libraries to establish a framework for testing DNS resolvers; the work of adding additional client libraries to this framework can be left for follow-up stories. Two common libraries are Go's resolver and glibc's resolver. A somewhat common library that is known to have quirks is musl libc's resolver, which uses a shorter timeout value than glibc's resolver and reportedly has issues with the EDNS0 protocol extension. It would also make sense to test Java or other popular languages or runtimes that have their own resolvers.
Additionally, as talked about in our DNS Issue Retro & Testing Coverage meeting on Feb 28th 2024, we also decided to add a test for testing a non-EDNS0 query for a larger than 512 byte record, as once was an issue in bug OCPBUGS-27397.
The ultimate goal is that the test will inform us when a change to OpenShift's DNS or networking has an effect that may impact end-user applications.
In OCP 4.8 the router was changed to use the "random" balancing algorithm for non-passthrough routes by default. It was previously "leastconn".
Bug https://bugzilla.redhat.com/show_bug.cgi?id=2007581 shows that using "random" by default incurs significant memory overhead for each backend that uses it.
PR https://github.com/openshift/cluster-ingress-operator/pull/663
reverted the change and made "leastconn" the default again (OCP 4.8 onwards).
The analysis in https://bugzilla.redhat.com/show_bug.cgi?id=2007581#c40 shows that the default haproxy behaviour is to multiply the weight (specified in the route CR) by 16 as it builds its data structures for each backend. If no weight is specified then openshift-router sets the weight to 256. If you have many, many thousands of routes then this balloons quickly and leads to a significant increase in memory usage, as highlighted by customer cases attached to BZ#2007581.
The purpose of this issue is to both explore changing the openshift-router default weight (i.e., 256) to something smaller, or indeed unset (assuming no explicit weight has been requested), and to measure the memory usage within the context of the existing perf&scale tests that we use for vetting new haproxy releases.
It may be that the low-hanging change is to not default to weight=256 for backends that only have one pod replica (i.e., if no value specified, and there is only 1 pod replica, then don't default to 256 for that single server entry).
Outcome: does changing the [default] weight value make it feasible to switch back to "random" as the default balancing algorithm for a future OCP release.
Revert router to using "random" once again in 4.11 once analysis is done on impact of weight and static memory allocation.
Requirement | Notes | isMvp? |
---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
Goal
Add support for PDB (Pod Disruption Budget) to the console.
Requirements:
Designs:
During master nodes upgrade when nodes are getting drained there's currently no protection from two or more operands going down. If your component is required to be available during upgrade or other voluntary disruptions, please consider deploying PDB to protect your operands.
The effort is tracked in https://issues.redhat.com/browse/WRKLDS-293.
Example:
Acceptance Criteria:
1. Create PDB controller in console-operator for both console and downloads pods
2. Add e2e tests for PDB in single node and multi node cluster
Note: We should consider to backport this to 4.10
When viewing the Installed Operators list set to 'All projects' and then selecting an operator that is available in 'All namespaces' (globally installed,) upon clicking the operator to view its details the user is taken into the details of that operator in installed namespace (project selector will switch to the install namespace.)
This can be disorienting then to look at the lists of custom resource instances and see them all blank, since the lists are showing instances only in the currently selected project (the install namespace) and not across all namespaces the operator is available in.
It is likely that making use of the new Operator resource will improve this experience (CONSOLE-2240,) though that may still be some releases away. it should be considered if it's worth a "short term" fix in the meantime.
Note: The informational alert was not implemented. It was decided that since "All namespaces" is displayed in the radio button, the alert was not needed.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Rebase openshift/builder to k8s 1.24
4.11 MVP Requirements
Out of scope use cases (that are part of the Kubeframe/factory project):
Questions to be addressed:
As a deployer, I want to be able to:
so that I can achieve
Currently the Assisted Service generates the credentials by running the ignition generation step of the oepnshift-installer. This is why the credentials are only retrievable from the REST API towards the end of the installation.
In the BILLI usage, which takes down assisted service before the installation is complete there is no obvious point at which to alert the user that they should retrieve the credentials. This means that we either need to:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
The AWS-specific code added in OCPPLAN-6006 needs to become GA and with this we want to introduce a couple of Day2 improvements.
Currently the AWS tags are defined and applied at installation time only and saved in the infrastructure CRD's status field for further operator use, which in turn just add the tags during creation.
Saving in the status field means it's not included in Velero backups, which is a crucial feature for customers and Day2.
Thus the status.resourceTags field should be deprecated in favour of a newly created spec.resourceTags with the same content. The installer should only populate the spec, consumers of the infrastructure CRD must favour the spec over the status definition if both are supplied, otherwise the status should be honored and a warning shall be issued.
Being part of the spec, the behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the AWS resources should be updated accordingly.
On AWS this can be done without re-creating any resources (the behaviour is basically an upsert by tag key) and is possible without service interruption as it is a metadata operation.
Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.
Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").
Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.
After that, we can remove the experimental flag and make this a GA feature.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
List any affected packages or components.
RFE-1101 described user defined tags for AWS resources provisioned by an OCP cluster. Currently user can define tags which are added to the resources during creation. These tags cannot be updated subsequently. The propagation of the tags is controlled using experimental flag. Before this feature goes GA we should define and implement a mechanism to exclude any experimental flags. Day2 operations and deletion of tags is not in the scope.
RFE-2012 aims to make the user-defined resource tags feature GA. This means that user defined tags should be updatable.
Currently the user-defined tags during install are passed directly as parameters of the Machine and Machineset resources for the master and worker. As a result these tags cannot be updated by consulting the Infrastructure resource of the cluster where the user defined tags are written.
The MCO should be changed such that during provisioning the MCO looks up the values of the tags in the Infrastructure resource and adds the tags during creation of the EC2 resources. The MCO should also watch the infrastructure resource for changes and when the resource tags are updated it should update the tags on the EC2 instances without restarts.
Acceptance Criteria:
Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Goal
Add the ability to choose between a full cluster upgrade (which exists today) or control plane upgrade (which will pause all worker pools) in the console.
Background
Currently in the console, users only have the ability to complete a full cluster upgrade. For many customers, upgrades take longer than what their maintenance window allows. Users need the ability to upgrade the control plane independently of the other worker nodes.
Ex. Upgrades of huge clusters may take too long so admins may do the control plane this weekend, worker-pool-A next weekend, worker-pool-B the weekend after, etc. It is all at a pool level, they will not be able to choose specific hosts.
Requirements
Design deliverables:
Goal
Improve the UX on the machine config pool page to reflect the new enhancements on the cluster settings that allows users to select the ability to update the control plane only.
Background
Currently in the console, users only have the ability to complete a full cluster upgrade. For many customers, upgrades take longer than what their maintenance window allows. Users need the ability to upgrade the control plane independently of the other worker nodes.
Ex. Upgrades of huge clusters may take too long so admins may do the control plane this weekend, worker-pool-A next weekend, worker-pool-B the weekend after, etc. It is all at a pool level, they will not be able to choose specific hosts.
Requirements
Design deliverables:
Enable sharing ConfigMap and Secret across namespaces
Requirement | Notes | isMvp? |
---|---|---|
Secrets and ConfigMaps can get shared across namespaces | YES |
NA
NA
Consumption of RHEL entitlements has been a challenge on OCP 4 since it moved to a cluster-based entitlement model compared to the node-based (RHEL subscription manager) entitlement mode. In order to provide a sufficiently similar experience to OCP 3, the entitlement certificates that are made available on the cluster (OCPBU-93) should be shared across namespaces in order to prevent the need for cluster admin to copy these entitlements in each namespace which leads to additional operational challenges for updating and refreshing them.
Questions to be addressed:
* What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
* Does this feature have doc impact?
* New Content, Updates to existing content, Release Note, or No Doc Impact
* If unsure and no Technical Writer is available, please contact Content Strategy.
* What concepts do customers need to understand to be successful in [action]?
* How do we expect customers will use the feature? For what purpose(s)?
* What reference material might a customer want/need to complete [action]?
* Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
* What is the doc impact (New Content, Updates to existing content, or Release Note)?
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As an OpenShift engineer
I want the shared resource CSI Driver webhook to be installed with the cluster storage operator
So that the webhook is deployed when the CSI driver is deployed
None - no new functional capabilities will be added
None - we can verify in CI that we are deploying the webhook correctly.
None - no new functional capabilities will be added
The scope of this story is to just deploy the "hello world" webhook with the Cluster Storage Operator.
Adding the live ValidatingWebhook configuration and service will be done in a separate story.
As a developer using SharedSecrets and ConfigMaps
I want to ensure all pods set readOnly; true on admission
So that I don't have pods stuck in the "Pending" state because of a bad volume mount
QE will need to verify the new Pod Admission behavior
Docs will need to ensure that readOnly: true is required and must be set to true.
None.
QE testing/verification of the feature - require readOnly to be true
Actions:
1. Create smoke test and submit to GitHub
2. Run script to integrate smoke test with Polarion
As an OpenShift engineer,
I want to initialize a validating admission webhook for the shared resource CSI driver
So that I can eventually require readOnly: true to be set on all pods that use the Shared Resource CSI Driver
None.
None.
None.
This is a prerequisite for implementing the validating admission webhook.
We need to have ART build the container image downstream so that we can add the correct image references for the CVO.
If we reference images in the CVO manifests which do not have downstream counterparts, we break the downstream build for the payload.
CI is capable of producing multiple images for a GitHub repository. For example, github.com/openshift/oc produces 4-5 images with various capabilities.
We did similar work in BUILD-234 - some of these steps are not required.
See also:
Tasks:
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled
https://issues.redhat.com/browse/AUTH-2 revealed that, in prinicipal, Pod Security Admission is possible to integrate into OpenShift while retaining SCC functionality.
This epic is about the concrete steps to enable Pod Security Admission by default in OpenShift
Enhancement - https://github.com/openshift/enhancements/pull/1010
ingress-operator must comply to pod security. The current audit warning is:
{ "objectRef": "openshift-ingress-operator/deployments/ingress-operator", "pod-security.kubernetes.io/audit-violations": "would violate PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.run AsNonRoot=true), seccompProfile (pod or containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")" }
dns-operator must comply to restricted pod security level. The current audit warning is:
{ "objectRef": "openshift-dns-operator/deployments/dns-operator", "pod-security.kubernetes.io/audit-violations": "would violate PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.allowPrivilegeEscalation=false), unre stricted capabilities (containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.runAsNonRoot=tr ue), seccompProfile (pod or containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")" }HyperShift provisions OpenShift clusters with externally managed control-planes. It follows a slightly different process for provisioning clusters. For example, HyperShift uses cluster API as a backend and moves all the machine management bits to the management cluster.
showing machine management/cluster auto-scaling tabs in the console is likely to confuse users and cause unnecessary side effects.
See Design Doc: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#
It's based on the SERVER_FLAG controlPlaneTopology being set to External is really the driving factor here; this can be done in one of two ways:
To test work related to cluster upgrade process, use a 4.10.3 cluster set on the candidate-4.10 upgrade channel using 4.11 frontend code.
If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml co the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to remove the ability to “Add identity providers” under “Set up your Cluster”. In addition to the getting started card, we should remove the ability to update a cluster on the details card when applicable (anything that changes a cluster version should be read only).
Summary of changes to the overview page:
Check section 03 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#
If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml co the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need surface a message that the control plane is externally managed and add following changes:
In general, anything that changes a cluster version should be read only.
Check section 02 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#
Based on Cesar's comment we should be removing the `Control Plane` section, if the infrastructure.status.controlplanetopology being "External".
If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml to the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to suspend these notifications:
For these we will need to check `ControlPlaneTopology`, if it's set to 'External' and also check if the user can edit cluster version(either by creating a hook or an RBAC call, eg. `canEditClusterVersion`)
Check section 05 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#
If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml to the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to suspend kubeadmin notifier, from the global notifications, since it contain link for updating the cluster OAuth configuration (see attachment).
PatternFly Dark Theme Handbook: https://docs.google.com/document/d/1mRYEfUoOjTsSt7hiqjbeplqhfo3_rVDO0QqMj2p67pw/edit
Admin Console -> Workloads & Pods
Dev Console -> Gotcha pages: Observe Dashboard and Metrics, Add, Pipelines: builder, list, log, and run
As a developer, I want to be able to scope the changes needed to enable dark mode for the admin console. As such, I need to investigate how much of the console will display dark mode using PF variables and also define a list of gotcha pages/components which will need special casing above and beyond PF variable settings.
Acceptance criteria:
As a developer, I want to be able to fix remaining issues from the spreadsheet of issues generated after the initial pass and spike of adding dark theme to the console.. As such, I need to make sure to either complete all remaining issues for the spreadsheet, or, create a bug or future story for any remaining issues in these two documents.
Acceptance criteria:
An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.
The Cluster Dashboard Details Card Protractor integration test was failing at high rate, and despite multiple attempts to fix, was never fully resolved, so it was disabled as a way to fix https://bugzilla.redhat.com/show_bug.cgi?id=2068594. Migrating this entire file to Cypress should give us better debugging capability, which is what was done to fix a similarly problematic project dashboard Protractor test.
Currently, you need to navigate to
Cluster Settings ->
Global configuration ->
Console (operator) config ->
Console plugins
to see and managed plugins. This takes a lot of clicks and is not discoverable. We should look at surfacing plugin details where they're easier to find – perhaps on the Cluster Settings page – or at least provide a more convenient link somewhere in the UI.
AC: Add the Dynamic Plugins section to the Status Card in the overview that will contain:
Currently, enabled plugins can fail to load for a variety of reasons. For instance, plugins don't load if the plugin name in the manifest doesn't match the ConsolePlugin name or the plugin has an invalid codeRef. There is no indication in the UI that something has gone wrong. We should explore ways to report this problem in the UI to cluster admins. Depending on the nature of the issue, an admin might be able to resolve the issue or at least report a bug against the plugin.
The message about failing could appear in the notification drawer and/or console plugins tab on the operator config. We could also explore creating an alert if a plugin is failing.
AC:
We have a Timestamp component for consistent display of dates and times that we should expose through the SDK. We might also consider a hook that formats dates and times for places were you don't want or cant use the component, eg. times on a chart.
This will become important when we add a user preference for dates so that plugins show consistent dates and times as console. If I set my user preference to UTC dates, console should show UTC dates everywhere.
AC:
We need to provide a base for running integration tests using the dynamic plugins. The tests should initially
Once the basic framework is in place, we can update the demo plugin and add new integration tests when we add new extension points.
https://github.com/openshift/console/tree/master/frontend/dynamic-demo-plugin
https://github.com/openshift/enhancements/blob/master/enhancements/console/dynamic-plugins.md
https://github.com/openshift/console/tree/master/frontend/packages/console-plugin-sdk
In the 4.11 release, a console.openshift.io/default-i18next-namespace annotation is being introduced. The annotation indicates whether the ConsolePlugin contains localization resources. If the annotation is set to "true", the localization resources from the i18n namespace named after the dynamic plugin (e.g. plugin__kubevirt), are loaded. If the annotation is set to any other value or is missing on the ConsolePlugin resource, localization resources are not loaded.
In case these resources are not present in the dynamic plugin, the initial console load will be slowed down. For more info check BZ#2015654
AC:
Follow up of https://issues.redhat.com/browse/CONSOLE-3159
Goal
Background
RFE: for 4.10, Cincinnati and the cluster-version operator are adding conditional updates (a.k.a. targeted edge blocking): https://issues.redhat.com/browse/OTA-267
High-level plans in https://github.com/openshift/enhancements/blob/master/enhancements/update/targeted-update-edge-blocking.md#update-client-support-for-the-enhanced-schema
Example of what the oc adm upgrade UX will be in https://github.com/openshift/enhancements/blob/master/enhancements/update/targeted-update-edge-blocking.md#cluster-administrator.
The oc implementation landed via https://github.com/openshift/oc/pull/961.
Design
See design doc: https://docs.google.com/document/d/1Nja4whdsI5dKmQNS_rXyN8IGtRXDJ8gXuU_eSxBLMIY/edit#
See marvel: https://marvelapp.com/prototype/h3ehaa4/screen/86077932
Update the cluster settings page to inform the user when the latest available update is supported but not recommended. Add an informational popover to the latest version in update path visualization.
The "Update Version" modal on the cluster settings page should be updated to give users information about recommended, not recommended, and blocked update versions.
Story: As an administrator I want to rely on a default configuration that spreads image registry pods across topology zones so that I don't suffer from a long recovery time (>6 mins) in case of a complete zone failure if all pods are impacted.
Background: The image registry currently uses affinity/anti-affinity rules to spread registry pods across different hosts. However this might cause situations in which all pods end up on hosts of a single zone, leading to a long recovery time of the registry if that zone is lost entirely. However due to problems in the past with the preferred setting of anti-affinity rule adherence the configuration was forced instead with required and the rules became constraints. With zones as constraints the internal registry would not have deployed anymore in environments with a single zone, e.g. internal CI environment. Pod topology constraints is a new API that is supported in OCP which can also relax constraints in case they cannot be satisfied. Details here: https://docs.openshift.com/container-platform/4.7/nodes/scheduling/nodes-scheduler-pod-topology-spread-constraints.html
Acceptance criteria:
Open Questions:
As an OpenShift administrator
I want to provide the registry operator with a custom certificate authority for S3 storage
so that I can use a third-party S3 storage provider.
Remove Jenkins from the OCP Payload.
See epic linking - need alternative non payload image available to provide relatively seamless migration
Also, the EP for this is approved and merged at https://github.com/openshift/enhancements/blob/master/enhancements/builds/remove-jenkins-payload.md
PARTIAL ANSWER ^^: confirmed with Ben Parees in https://coreos.slack.com/archives/C014MHHKUSF/p1646683621293839 that EP merging is currently sufficient OCP "technical leadership" approval.
assuming none
As maintainers of the OpenShift jenkins component, we need run Jenkins CI for PR testing against openshift/jenkins, openshift/jenkins-sync-plugin, openshift/jenkins-client-plugin, openshift/jenkins-openshift-login-plugin, using images built in the CI pipeline but not injected into CI test clusters via sample operator overriding the jenkins sample imagestream with the jenkins payload image.
As maintainers of the OpenShift Jenkins component, we need Jenkins periodics for the client and sync plugins to run against the latest non payload, CPaas image, promoted to CI's image locations on quay.io, for the current release in development.
As maintainers of the OpenShift Jenkins component, we need Jenkins related tests outside of very basic Jenkins Pipieline Strategy Build Config verification, removed from openshift-tests in OpenShift Origin, using a non-payload, CPaas image pertinent to the branch in question.
High Level, we ideally want to vet the new CPaas image via CI and periodics BEFORE we start changing the samples operator so that it does not manipulate the jenkins imagestream (our tests will override the samples operator override)
NONE ... QE should wait until JNKS-254
NONE
NONE
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Possible staging
1) before CPaas is available, we can validate images generated by PRs to openshift/jenkins, openshift/jenkins-sync-plugin, openshift/jenkins-client-plugin by taking the image built by the image (where the info needed to get the right image from the CI registry is in the IMAGE_FORMAT env var) and then doing an `oc tag --source=docker <PR image ref> openshift/jenkins:2` to replace the use of the payload image in the jenkins imagestream in the openshift namespace with the PRs image
2) insert 1) in https://github.com/openshift/release/blob/master/ci-operator/step-registry/jenkins/sync-plugin/e2e/jenkins-sync-plugin-e2e-commands.sh and https://github.com/openshift/release/blob/master/ci-operator/step-registry/jenkins/client-plugin/tests/jenkins-client-plugin-tests-commands.sh where you test for IMAGE_FORMAT being set
3) or instead of 2) you update the Makefiles for the plugins to call a script that does the same sort of thing, see what is in IMAGE_FORMAT, and if it has something, do the `oc tag`
https://github.com/openshift/release/pull/26979 is a prototype of how to stick the image built from a PR and conceivably the periodics to get the image built from it and tag it into the jenkins imagestream in the openshift namespace in the test cluster
After installing or upgrading to the latest OCP version, the existing OpenShift route to the prometheus-k8s service is updated to be a path-based route to '/api/v1'.
DoD:
Following up on https://issues.redhat.com/browse/MON-1320, we added three new CLI flags to Prometheus to apply different limits on the samples' labels. These new flags are available starting from Prometheus v2.27.0, which will most likely be shipped in OpenShift 4.9.
The limits that we want to look into for OCP are the following ones:
# Per-scrape limit on number of labels that will be accepted for a sample. If # more than this number of labels are present post metric-relabeling, the # entire scrape will be treated as failed. 0 means no limit. [ label_limit: <int> | default = 0 ] # Per-scrape limit on length of labels name that will be accepted for a sample. # If a label name is longer than this number post metric-relabeling, the entire # scrape will be treated as failed. 0 means no limit. [ label_name_length_limit: <int> | default = 0 ] # Per-scrape limit on length of labels value that will be accepted for a sample. # If a label value is longer than this number post metric-relabeling, the # entire scrape will be treated as failed. 0 means no limit. [ label_value_length_limit: <int> | default = 0 ]
We could benefit from them by setting relatively high values that could only induce unbound cardinality and thus reject the targets completely if they happened to breach our constrainst.
DoD:
When users configure CMO to interact with systems outside of an OpenShift cluster, we want to provide an easy way to add the cluster ID to the data send.
Technically this can be achieved today, by adding an identifying label to the remote_write configuration for a given cluster. The operator adding the remote_write integration needs to take care that the label is unique over the managed fleet of clusters. This however adds management complexity. Any given cluster already has a pseudo-unique datum, that can be used for this purpose.
Expose a flag in the CMO configuration, that is false by default (keeps backward compatibility) and when set to true will add the _id label to a remote_write configuration. More specifically it will be added to the top of a remote_write relabel_config list via the replace action. This will add the label as expect, but additionally a user could alter this label in a later relabel config to suit any specific requirements (say rename the label or add additional information to the value).
The location of this flag is the remote_write Spec, so this can be set for individual remote_write configurations.
Add an optional boolean flag to CMOs definition of RemoteWriteSpec that if true adds an entry in the specs WriteRelabelConfigs list.
I went with adding the relabel config to all user-supplied remote_write configurations. This path has no risk for backwards compatibility (unless users use the {}tmp_openshift_cluster_id{} label, seems unlikely) and reduces overall complexity, as well as documentation complexity.
The entry should look like what is already added to the telemetry remote write config and it should be added as the first entry in the list, before any user supplied relabel configs.
We currently use a sample app to e2e test remote write in CMO.
In order to test the addition of the cluster_id relabel config, we need to confirm that the metrics send actually have the expected label.
For this test we should use Prometheus as the remote_write target. This allows us to query the metrics send via remote write and confirm they have the expected label.
The potential target ServiceMonitors are:
As a user, I want to understand which service bindings connected a service to a component successfully or not. Currently it's really difficult to understand and needs inspection into each ServiceBinding resource (yaml).
See also https://docs.google.com/document/d/1OzE74z2RGO5LPjtDoJeUgYBQXBSVmD5tCC7xfJotE00/edit
As a user, I want the topology view to be less cluttered as I doom out showing only information that I can discern and still be able to get a feel for the status of my project.
This epic is mainly focused on the 4.10 Release QE activities
1. Identify the scenarios for automation
2. Segregate the test Scenarios into smoke, Regression and other user stories
a. Update the https://docs.jboss.org/display/ODC/Automation+Status+Report
3. Align with layered operator teams for updating scripts
3. Work closely with dev team for epic automation
4. Create the automation scripts using cypress
5. Implement CI for nightly builds
6. Execute scripts on sprint basis
To the track the QE progress at one place in 4.10 Release Confluence page
Acceptance criteria:
This epic covers a number of customer requests(RFEs) as well as increases usability.
Customer satisfaction as well as improved usability.
None
As a user, I should be able to switch between the form and yaml editor while creating the ProjectHelmChartRepository CR.
Form component https://github.com/openshift/console/pull/11227
As a user, I want to use a form to create Deployments
Edit deployment form ODC-5007
Currently we are only able to get limited telemetry from the Dev Sandbox, but not from any of our managed clusters or on prem clusters.
In order to improve properly analyze usage and the user experience, we need to be able to gather as much data as possible.
// JS type
telemetry?: Record<string, string>
./bin/bridge --telemetry SEGMENT_API_KEY=a-key-123-xzy ./bin/bridge --telemetry CONSOLE_LOG=debug
Goal:
Enhance oc adm release new (and related verbs info, extract, mirror) with heterogeneous architecture support
tl;dr
oc adm release new (and related verbs info, extract, mirror) would be enhanced to optionally allow the creation of manifest list release payloads. The manifest list flow would be triggered whenever the CVO image in an imagestream was a manifest list. If the CVO image is a standard manifest, the generated release payload will also be a manifest. If the CVO image is a manifest list, the generated release payload would be a manifest list (containing a manifest for each arch possessed by the CVO manifest list).
In either case, oc adm release new would permit non-CVO component images to be manifest or manifest lists and pass them through directly to the resultant release manifest(s).
If a manifest list release payload is generated, each architecture specific release payload manifest will reference the same pullspecs provided in the input imagestream.
More details in Option 1 of https://docs.google.com/document/d/1BOlPrmPhuGboZbLZWApXszxuJ1eish92NlOeb03XEdE/edit#heading=h.eldc1ppinjjh
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled
I asked Zvonko Kaiser and he seemed open to it. I need to confirm with Shiva Merla
Rename Provider to Infrastructure Provider
Add GPU Provider
https://miro.com/app/board/uXjVOeUB2B4=/?moveToWidget=3458764514332229879&cot=14
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As a developer building container images on OpenShift
I want to specify that my build should run without elevated privileges
So that builds do not run as root from the host's perspective with elevated privileges
No QE required for Dev Preview. OpenShift regression testing will verify that existing behavior is not impacted.
We will need to document how to enable this feature, with sufficient warnings regarding Dev Preview.
This likely warrants an OpenShift blog post, potentially?
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled
#Description of problem:
Developer Console > +ADD > Develoeper Catalog > Service > select Types Templates > Initiate Template
Input values in Instantiate Template are disappeared randomly.
#Version-Release number of selected component (if applicable):
#How reproducible:
I reproduced this issue in ocp410ovn shared cluster in the quicklab
Select Apache HTTP Server > Input name "test" in Application Hostname box
After several seconds, the value has disappeared in the web console.
#Steps to Reproduce:
0. Developer Console > +ADD > Develoeper Catalog > Service > select Types Templates > Initiate Template
1. Input values in the box of template menu.
2. The values are disappeared after several seconds later. (20s~ or randomly)
3. Many users have experienced this issue.
==> the browser version doesn't matter.
#Actual results:
Input values in "Instantiate Template" are disappeared randomly.
Users can't use the Initiate Template feature in the Dev console.
#Expected results:
Input values remain in the web console and users creat the object by the "Instantiate Template"
#Additional info:
See "Application Name" has disappeared in the video I attached.
This is a clone of OCPBUGSM-47085
Version:
$ openshift-install version
4.11.0-rc2
Platform:
Nutanix
On `openshift-installer create manifests` stage a connection to Prism is made (see https://github.com/openshift/installer/blob/master/pkg/asset/installconfig/nutanix/validation.go#L15-L36=)
This make generating manifests separately impossible, which breaks Assisted Installer flow. Instead of storing sensitive user information, Assisted Installer sets fake details in install-config.yaml and asks user to update these after installation has completed.
With validation happening on `openshift-install create manifests` phase installation process can't start with invalid credentials.
Please move this validation to ValidateForProvisioning, similar to vSphere
We will want to establish some basic metrics we can report back to Telemetry.
Let's consider:
Below is some background info from MTC when we added Telemetry support that may help
See: https://github.com/konveyor/metrics-queries/blob/master/README.md
Design/Development info:
OpenShift Monitoring Integration Guide
Monitoring integration with OLM operators
https://www.openshift.com/blog/observability-superpower-correlation
Source Code:
https://github.com/konveyor/mig-controller/blob/master/pkg/controller/migmigration/metrics.go
See these threads https://coreos.slack.com/archives/G01F05P2PTL/p1645982017061749?thread_ts=1645970469.871559&cid=G01F05P2PTL for more information
Description of problem:
This is a clone of https://issues.redhat.com/browse/OCPBUGS-469
Description of problem: Numerous erroreneous logs in OVN master
I0823 18:00:11.163491 1 obj_retry.go:1063] Retry object setup: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k
I0823 18:00:11.163546 1 obj_retry.go:1096] Removing old object: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k
I0823 18:00:11.163555 1 pods.go:124] Deleting pod: openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k
I0823 18:00:11.163631 1 obj_retry.go:1103] Retry delete failed for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k, will try again later: deleteLogicalPort failed for pod openshift-operator-lifecycle-manager_collect-profiles-27687900-hlp6k: unable to locate portUUID+nodeName for pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k: error getting logical port <nil>: object not found
W0823 18:00:41.163633 1 obj_retry.go:1031] Dropping retry entry for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k: exceeded number of failed attempts
Must-gather: http://shell.lab.bos.redhat.com/~anusaxen/must-gather.local.2234927131259452300/
Version-Release number of selected component (if applicable): 4.12.0-0.nightly-2022-08-23-031342
How reproducible: Always
Steps to Reproduce:
1. Bring up OVN cluster on 4.12
2.
3.
Actual results: deleteLogicalPort failed for already gone object
Expected results: deleteLogicalPort should not keep retrying post object deletion
Additional info:
[Updated story request]
Decision is to always display Red Hat OpenShift logo for OCP instead of conditionally. And also update the OCP login, errors, providers templates. https://openshift.github.io/oauth-templates/
Related note in comments.
[Original request]
If the ACM or the ACS dynamic plugin is enabled and there is not a custom branding set, then the default "Red Hat Openshift" branding should be shown.
This was identified as an issue during the Hybrid Console Scrum on 11/15/20201
PRs associated with this change
https://github.com/openshift/console/pull/10940 [merged]
https://github.com/openshift/oauth-templates/pull/20 [merged]
https://github.com/openshift/cluster-authentication-operator/pull/540 [merged]
Description of problem:
During ocp multinode spoke cluster creation agent provisioning is stuck on "configuring" because machineConfig service is crashing on the node.
After restarting the service still fails with
Can't read link "/var/lib/containers/storage/overlay/l/V2OP2CCVMKSOHK2XICC546DUCG" because it does not exist. A storage corruption might have occurred, attempting to recreate the missing symlinks. It might be best wipe the storage to avoid further errors due to storage corruption.
Version-Release number of selected component (if applicable):
Podman 4.0.2 +
How reproducible:
sometimes
Steps to Reproduce:
1. deploy multinode spoke (ipxe + boot order ) 2. 3.
Actual results:
4 agents in done state and 1 is in "configuring"
Expected results:
all agents are in "done" state
Additional info:
issue mentioned in https://github.com/containers/podman/issues/14003
Fix: https://github.com/containers/storage/issues/1136
Description of problem:
Setting a telemeter proxy in the cluster-monitoring-config config map does not work as expected
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
the following KCS details steps to add a proxy.
The steps have been verified at 4.7 but do not work at 4.8, 4.9 or 4.10
https://access.redhat.com/solutions/6172402
When testing at 4.8, 4.9 and 4.10 the proxy setting where also nested under `telemeterClient`
which triggered a telemeter restart but the proxy setting do not get set in the deployment as they do in 4.7
Actual results:
4.8, 4.9 and 4.10 without the nested `telemeterClient`
does not trigger a restart of the telemeter pod
Expected results:
I think the proxy setting should be nested under telemeterClient
but should set the environment variables in the deployment
Additional info:
This is a backport of https://bugzilla.redhat.com/show_bug.cgi?id=2116382 from 4.12 to 4.11.z. Creating manually because as seen in https://github.com/openshift/cluster-monitoring-operator/pull/1743 `/cherry-pick` doesn't work for bugs originally created in bugzilla
Description of problem: defined in https://bugzilla.redhat.com/show_bug.cgi?id=2051533
When adding remote worker node using ZTP the agent finishes the installation and is marked as done. oc get agent -o wide NAME CLUSTER APPROVED ROLE STAGE HOSTNAME REQUESTED HOSTNAME 0277804e-2a7c-4d95-9d0f-e22a190d582a spoke-0 true worker Done spoke-worker-0-0.spoke-0.qe.lab.redhat.com spoke-worker-0-0 12efa520-5b99-4474-805d-931e46ad43f7 spoke-0 true master Done spoke-master-0-2.spoke-0.qe.lab.redhat.com spoke-master-0-2 3b8eec89-f26f-4896-8f71-8a810894c560 spoke-0 true master Done spoke-master-0-0.spoke-0.qe.lab.redhat.com spoke-master-0-0 3fb3749e-c132-4258-ad1a-08a0445c9022 spoke-0 true worker Done spoke-worker-0-1.spoke-0.qe.lab.redhat.com spoke-worker-0-1 728559e9-5543-41d9-adb0-e58196f765af spoke-0 true master Done spoke-master-0-1.spoke-0.qe.lab.redhat.com spoke-master-0-1 982e1ff6-6e83-4800-b061-8cdfd0b844fb spoke-0 true worker Done spoke-rwn-0-1.spoke-rwn-0.qe.lab.redhat.com spoke-rwn-0-1 a76eaa6a-b351-429f-bfa1-e53a70503573 spoke-0 true worker Done spoke-rwn-0-0.spoke-rwn-0.qe.lab.redhat.com spoke-rwn-0-0 Logging into the spoke cluster the bmh and machine resources are created and the node resource is not: oc get bmh -n openshift-machine-api NAME STATE CONSUMER ONLINE ERROR AGE spoke-master-0-0 unmanaged spoke-0-pxbfh-master-0 true 3h32m spoke-master-0-1 unmanaged spoke-0-pxbfh-master-1 true 3h32m spoke-master-0-2 unmanaged spoke-0-pxbfh-master-2 true 3h32m spoke-rwn-0-0-bmh externally provisioned spoke-0-spoke-rwn-0-0-bmh true provisioned registration error 168m spoke-rwn-0-1-bmh externally provisioned spoke-0-spoke-rwn-0-1-bmh true provisioned registration error 168m spoke-worker-0-0 unmanaged spoke-0-pxbfh-worker-0-65mrb true 3h32m spoke-worker-0-1 unmanaged spoke-0-pxbfh-worker-0-nnmcq true 3h32m oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE spoke-0-pxbfh-master-0 Running 3h33m spoke-0-pxbfh-master-1 Running 3h33m spoke-0-pxbfh-master-2 Running 3h33m spoke-0-pxbfh-worker-0-65mrb Running 3h19m spoke-0-pxbfh-worker-0-nnmcq Running 3h20m spoke-0-spoke-rwn-0-0-bmh Provisioned 169m spoke-0-spoke-rwn-0-1-bmh Provisioned 169m Note: bmh is in error state: Normal ProvisionedRegistrationError 30m metal3-baremetal-controller Host adoption failed: Error while attempting to adopt node 529b3e75-5d04-4486-9296-269081d0ec02: Error validating Redfish virtual media. Some parameters were missing in node's driver_info. Missing are: ['deploy_kernel', 'deploy_ramdisk']. oc get nodes NAME STATUS ROLES AGE VERSION spoke-master-0-0.spoke-0.qe.lab.redhat.com Ready master 72m v1.22.3+2cb6068 spoke-master-0-1.spoke-0.qe.lab.redhat.com Ready master 50m v1.22.3+2cb6068 spoke-master-0-2.spoke-0.qe.lab.redhat.com Ready master 72m v1.22.3+2cb6068 spoke-worker-0-0.spoke-0.qe.lab.redhat.com Ready worker 51m v1.22.3+2cb6068 spoke-worker-0-1.spoke-0.qe.lab.redhat.com Ready worker 51m v1.22.3+2cb6068 node-bootstrapper CSR is created but not auto-approved; periodically another node-strapper csr is created until it is manually approved: oc get csr | grep Pending csr-5ll2g 9m9s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-f8vbl 8m24s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending
Version-Release number of selected component (if applicable):
assisted-service master at revision af0bafb3f7f629932f8c3dc31ccddedfe6984926 ocp version: 4.10.0-rc.1
How reproducible:
1. Install remote worker node using ztp 2. Wait for node resource to be created
Steps to Reproduce:
1. Install remote worker node using ztp 2. Wait for node resource to be created
Actual results:
node-bootstrapper and node CSR are not auto-approved and node resource is not created. The bmh resource remains in registration error
Expected results:
node-bootstrapper and node CSR should be auto-approved and node resource created. The bmh resource should not be in registration error
Additional info:
We're seeing a slight uptick in how long upgrades are taking[1][2]. We are not 100% sure of the cause, but it looks like it started with 4.11 rc.7. There's no obvious culprits in the diff[3].
Looking at some of the jobs, we are seeing the gaps between kube-scheduler being updated and then machine-api appear to take longer. Example job run[4] showing 10+ minutes waiting for it.
TRT had a debugging session, and we have two suggestions:
[1] https://search.ci.openshift.org/graph/metrics?metric=job%3Aduration%3Atotal%3Aseconds&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-aws-sdn-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-ovn-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade
[2] https://sippy.dptools.openshift.org/sippy-ng/tests/4.12/analysis?test=Cluster%20upgrade.%5Bsig-cluster-lifecycle%5D%20cluster%20upgrade%20should%20complete%20in%2075.00%20minutes
[3] https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.11.0-rc.7
[4] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-azure-sdn-upgrade/1556865989923049472
update ironic software to pick up latest bug fixes
update dependencies as needed
this is also related to https://bugzilla.redhat.com/show_bug.cgi?id=2115122
The two modules that are auto generated for the CLI docs need to add ":_content-type: REFERENCE" to the top of the files. Update the doc generation templates to add these.
When a thin provisioned COW format disk is created on OCP on RHV via CSI driver (a PVC -
https://github.com/openshift/ovirt-csi-driver/blob/master/deploy/example/storage-claim.yaml
But this is thin provisioned disk, so the initial size of the disk should be default of the engine and then grow as needed, it shouldn't be this big.
This causes all the disks created this way to be functionally preallocated (since it eats all that space), which is a real waste of space.
How reproducible: 100%
Steps to Reproduce:
1. Create a storage claim (PVC) in Openshift (
https://github.com/openshift/ovirt-csi-driver/blob/master/deploy/example/storage-claim.yaml
) using the default storage class (or any other storage class with thinProvisioning: "true") and with requested storage i.e. 100Gi
$ oc create -f storage-claim.yaml
2. In the RHV web console navigate to Storage -> Disks and check Virtual size and Actual size of the created disk (PVC)
Actual results:
Disk from our example with requested storage 100GB reports virtual size 100GB and actual size 110 GB.
Expected results:
Thin provisioned disks should start with small initial size and then grow as needed, so its actual size should be considerably smaller (the default initial size set by the engine should be 2.5 GB if I'm not mistaken).
Note: The extra 10GB in the actual size are caused by overhead for the qcow2 disk format, which is 10%, and this was tracked here as a separate issue:
https://bugzilla.redhat.com/show_bug.cgi?id=2097139
Description of problem:
This repository is out of sync with the downstream product builds for this component.
One or more images differ from those being used by ART to create product builds. This
should be addressed to ensure that the component's CI testing is accurately
reflecting what customers will experience.
The information within the following ART component metadata is driving this alignment
request: ose-baremetal-installer.yml.
The vast majority of these PRs are opened because a different Golang version is being
used to build the downstream component. ART compiles most components with the version
of Golang being used by the control plane for a given OpenShift release. Exceptions
to this convention (i.e. you believe your component must be compiled with a Golang
version independent from the control plane) must be granted by the OpenShift
architecture team and communicated to the ART team.
Version-Release number of selected component (if applicable): 4.11
Additional info: This issue is needed for the bot-created PR to merge after the 4.11 GA.
The relevant code in ironic-image was not updated to support TLS, so it still uses the old port and explicit http://
This is a clone of issue OCPBUGS-884. The following is the description of the original issue:
—
Description of problem:
Since the decomissioning of the psi cluster, and subsequent move of the rhcos release browser, product builds machine-os-images builds have been failing. See e.g. https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=47565717
Version-Release number of selected component (if applicable):
4.12, 4.11, 4.10.
How reproducible:
Have ART build the image
Steps to Reproduce:
1. Have ART build the image
Actual results:
Build failure
Expected results:
Build succesful
Additional info:
Description of problem:
Users search a resource (for example, Pod) with Name filter applied and input a text to the filter field then the search results filtered accordingly.
Once the results are shown, when the user clear the value in one-shot (i.e. select whole filter text from the field and clear it using delete or backspace key) from the field,
then the search result doesn't clear accordingly and the previous result stays on the page.
Version-Release number of selected components (if applicable):
4.11.0-0.nightly-2022-08-16-194731 & works fine with OCP 4.12 latest version.
How reproducible:
Always
Steps to Reproduce:
Actual results:
Search result doesn't clear when user clears name filter in one-shot for any resources.
Expected results:
Search results should clear when the user clears name filters in one-shot for any resources.
Additional info:
Reproduced in both chrome[103.0.5060.114 (Official Build) (64-bit)] and firefox[91.11.0esr (64-bit)] browsers.
Attached screen share for the same issue. SearchIssues.mp4
1. Proposed title of this feature request
--> Alert generation when the etcd container memory consumption goes beyond 90%
2. What is the nature and description of the request?
--> When the etcd database starts growing rapidly due to some high number of objects like secrets, events, or configmap generation by application/workload, the memory and CPU consumption of APIserver and etcd container (control plane component) spikes up and eventually the control plane nodes goes to hung/unresponsive or crash due to out of memory errors as some of the critical processes/services running on master nodes get killed. Hence we request an alert/alarm when the ETCD container's memory consumption goes beyond 90% so that the cluster administrator can take some action before the cluster/nodes go unresponsive.
I see we already have a etcdExcessiveDatabaseGrowth Prometheus rule which helps when the surge in etcd writes leading to a 50% increase in database size over the past four hours on etcd instance however it does not consider the memory consumption:
$ oc get prometheusrules etcd-prometheus-rules -o yaml|grep -i etcdExcessiveDatabaseGrowth -A 9
- alert: etcdExcessiveDatabaseGrowth
annotations:
description: 'etcd cluster "{{ $labels.job }}": Observed surge in etcd writes
leading to 50% increase in database size over the past four hours on etcd
instance {{ $labels.instance }}, please check as it might be disruptive.'
expr: |
increase(((etcd_mvcc_db_total_size_in_bytes/etcd_server_quota_backend_bytes)*100)[240m:1m]) > 50
for: 10m
labels:
severity: warning
3. Why does the customer need this? (List the business requirements here)
--> Once the etcd memory consumption goes beyond 90-95% of total ram as it's system critical container, the OCP cluster goes unresponsive causing revenue loss to business and impacting the productivity of users of the openshift cluster.
4. List any affected packages or components.
--> etcd
This is a clone of issue OCPBUGS-212. The following is the description of the original issue:
—
Description of problem:
oc --context build02 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-ec.1 True False 45h Error while reconciling 4.12.0-ec.1: the cluster operator kube-controller-manager is degraded oc --context build02 get co kube-controller-manager NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-controller-manager 4.12.0-ec.1 True False True 2y87d GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.153.28:9091: connect: cannot assign requested address
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
build02 is a build farm cluster in CI production.
I can provide credentials to access the cluster if needed.
Users can't configure the retention period for Thanos Ruler currently and the default value is 24h (from the prometheus operator).
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled
When running yarn dev, type warnings can be seen in the console and in the dev overlay UI. These need to be resolved.