Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
Dependencies (internal and external)
The tool should be able to upload an OpenID Connect (OIDC) configuration to an S3 bucket, and create an AWS IAM Identity Provider that trusts identities from the OIDC provider. It should take infra name as input so that user can identify all the resources created in AWS. Make sure that resources created in AWS are tagged appropriately.
Sample command with existing key pair:
tool-name create identity-provider <infra-name> --public-key ./path/to/public/key
Ensure the Identity Provider includes audience config for both the in-cluster components ('openshift') and the pod-identity-webhook ('sts.amazonaws.com').
ccoctl should be able to delete AWS resources it created
ccoctl delete <infra-name>
https://github.com/openshift/enhancements/pull/555
https://github.com/openshift/api/pull/827
The console operator will need to support single-node clusters.
We have a console deployment and downloads deployment. Each will to be updated so that there's only a single replica when high availability mode is disabled in the Infrastructure config. We should also remove the anti-affinity rule in the console deployment that tries to spread console pods across nodes.
The downloads deployment is currently a static manifest. That likely needs to be created by the console operator instead going forward.
Acceptance Criteria:
Bump github.com/openshift/api to pickup changes from openshift/api#827
As a OpenShift administrator
I want the registry operator to use topology mode from Infrastructure (HighAvailable = 2 replicas, SingleReplica = 1 replica)
so that it the operator is not spending resources for high-availability purpose when it's not needed.
See also:
https://github.com/openshift/enhancements/blob/master/enhancements/cluster-high-availability-mode-api.md
https://github.com/openshift/api/pull/827/files
Platform | SingleReplica | HighAvailable |
---|---|---|
AWS | 1 replica | 2 replicas |
Azure | 1 replica | 2 replicas |
GCP | 1 replica | 2 replicas |
OpenStack (swift) | 1 replica | 2 replicas |
OpenStack (cinder) | 1 replica | 1 replica (PVC) |
oVirt | 1 replica | 1 replica (PVC) |
bare metal | Removed | Removed |
vSphere | Removed | Removed |
Research if we can dynamically reserve memory and CPU for nodes.
Feature Overview
This will be phase 1 of Internationalization of the OpenShift Console.
Phase 1 will include the following:
Phase 1 will not include:
Initial List of Languages to Support
---------- 4.7* ----------
*This will be based on the ability to get all the strings externalized, there is a good chance this gets pushed to 4.8.
---------- Post 4.7 ----------
POC
Goals
Internationalization has become table stakes. OpenShift Console needs to support different languages in each of the major markets. This is key functionality that will help unlock sales in different regions.
Requirements
Requirement | Notes | isMvp? |
---|---|---|
Language Selector | YES | |
Localized Date. + Time | YES | |
Externalization and translation of all client side strings | YES | |
Translation for Chinese and Japanese | YES | |
Process, infra, and testing capabilities put into place | YES | |
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Out of Scope
Assumptions
Customer Considerations
We are rolling this feature in phases, based on customer feedback, there may be no phase 2.
Documentation Considerations
I believe documentation already supports a large language set.
We have too many namespaces if we're loading them upfront. We should consolidate some of the files.
Consolidate namepsaces E-I to reduce change size
Just do namespaces from A-D to reduce number of files being changed at once
Consolidate namepsaces N-R to reduce change size
Consolidate namepsaces K-M to reduce change size
Consolidate namepsaces S-Z to reduce change size
We need to automate how we send and receive updated translations using Memsource for the Red Hat Globalization team. The Ansible Tower team already has automation in place that we might be able to reuse.
Acceptance Criteria:
Openshift Sandboxed Containers provide the ability to add an additional layer of isolation through virtualization for many workloads. The main way to enable the use of katacontainers on an Openshift Cluster is by first installing the Operator (for more information about operator enablement check [1]).
Once the feature is enabled on the cluster, it just a matter of a one-liner YAML modification on the pod/deployment level to run the workload using katacontianers. That might sound easy for some, but for others who don't care about YAML they might want more abstractions on how to use katacontainers for their workloads.
This feature covers all the efforts required to integrate and present Kata in Openshift UI (console) to cater to all user personas.
To enable for users to adopt Kata as a runtime, it is important to make it easy to use. Adding hook-points in the UI with ease-of-use as a goal in mind is one way to bring in more users.
The main goal of this feature is to make sure that:
Questions to be addressed:
References
[1] https://issues.redhat.com/browse/KATA-429?jql=project %3D KATA AND issuetype %3D Feature
The grand goal is to improve the usability of Kata from Openshift UI. This EPIC aims to cover only a subset that would help:
To use a different runtime e.g., Kata, the "runtimeClassName" will be set to the desired low-level runtime. Also please see [1]:
"RuntimeClassName refers to a RuntimeClass object in the node.k8s.io group, which should be used to run this pod. If no RuntimeClass resource matches the named class, the pod will not be run. If unset or empty, the "legacy" RuntimeClass will be used, which is an implicit class with an empty definition that uses the default runtime handler. More info: https://git.k8s.io/enhancements/keps/sig-node/runtime-class.md This is a beta feature as of Kubernetes v1.14.."
apiVersion: v1 kind: Pod metadata: name: nginx-runc spec: runtimeClassName: runC
The value of the runtime class cannot be changed on the pod level, but it can be changed on the deployment level
apiVersion: apps/v1 kind: Deployment metadata: name: sandboxed-nginx spec: replicas: 2 selector: matchLabels: app: sandboxed-nginx template: metadata: labels: app: sandboxed-nginx spec: runtimeClassName: kata. # ---> This can be changed containers: - name: nginx image: nginx ports: - containerPort: 80 protocol: TCP
[1] https://docs.openshift.com/container-platform/4.6/rest_api/workloads_apis/pod-core-v1.html
We should show the runtime class on workloads pages and add a badge to the heading in the case a workload uses Kata. A workload uses Kata if its pod template has `runtimeClassName` set to `kata`.
Acceptance Criteria:
Andrew Ronaldson indicated that adding a "kata" badge in the heading would be too much noise around other heading badges (ContainerCreating, Failed, etc).
When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release
As a stakeholder aiming to adopt KubeSaw as a Namespace-as-a-Service solution, I want the project to provide streamlined tooling and a clear code-base, ensuring seamless adoption and integration into my clusters.
Efficient adoption of KubeSaw, especially as a Namespace-as-a-Service solution, relies on intuitive tooling and a transparent codebase. Improving these aspects will empower stakeholders to effortlessly integrate KubeSaw into their Kubernetes clusters, ensuring a smooth transition to enhanced namespace management.
As a Stakeholder, I want streamlined setup of the KubeSaw project and fully automated way of upgrading this setup aling with the updates of the installation.
The expected outcome within the market is both growth and retention. The improved tooling and codebase will attract new stakeholders (growth) and enhance the experience for existing users (retention) by providing a straightforward path to adopting KubeSaw's Namespace-as-a-Service features in their clusters.
This epic is to track all the unplanned work related to security incidents, fixing flaky e2e tests, and other urgent and unplanned efforts that may arise during the sprint.
We drive OpenShift cross-market customer success and new customer adoption with constant improvements and feature additions to the existing capabilities of our OpenShift Core Networking (SDN and Network Edge). This feature captures that natural progression of the product.
There are definitely grey areas, but in general:
Questions to be addressed:
Create a PR in openshift/cluster-ingress-operator to specify the random balancing algorithm if the feature gate is enabled, and to specify the leastconn balancing algorithm (the current default) otherwise.
This story is for actually updating the version of CoreDNS in github.com/openshift/coredns. Our fork will need to be rebased onto https://github.com/coredns/coredns/releases/tag/v1.8.1, which may involve some git fu. Refer to previous CoreDNS Rebase PR's for any pointers there.
We need to verify that no new CoreDNS dual stack features require any configuration changes or feature flags.
(All dual stack changes should just work once we rebase to coredns v1.8.1).
See https://github.com/coredns/coredns/pull/4339 .
We also need to verify that cluster DNS works for both v4 and v6 for a dual stack cluster IP service. (ie request via A and AAAA, make sure you get the desired response, and not just one or the other). A brief CI test on our dual stack metal CI might make the most sense here (KNI Might have a job like this already, need to investigate our options to add dual stack coverage to openshift/coredns).
CoreDNS v1.7 renamed some metrics that we use in our alerting rules. Make sure the alerting rules in https://github.com/openshift/cluster-dns-operator/blob/master/manifests/0000_90_dns-operator_03_prometheusrules.yaml are using the correct metrics names (and still work as intended).
Create a PR in openshift/cluster-ingress-operator to implement the PROXY protocol API.
The multiple destinations provided as a part of the allowedDestinations field is not working as it used to on OCP4: https://github.com/openshift/images/blob/master/egress/router/egress-router.sh#L70-L109
We need to parse this from the NAD and modify the iptables here to support them:
https://github.com/openshift/egress-router-cni/blob/master/pkg/macvlan/macvlan.go#L272-L349
Testing:
1) Created NAD:
[dsal@bkr-hv02 surya_multiple_destinations]$ cat nad_multiple_destination.yaml --- apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: egress-router spec: config: '{ "cniVersion": "0.4.0", "type": "egress-router", "name": "egress-router", "ip": { "addresses": [ "10.200.16.10/24" ], "destinations": [ "80 tcp 10.100.3.200", "8080 tcp 203.0.113.26 80", "8443 tcp 203.0.113.26 443" ], "gateway": "10.200.16.1" } }'
2) Created pod:
[dsal@bkr-hv02 surya_multiple_destinations]$ cat egress-router-pod.yaml --- apiVersion: v1 kind: Pod metadata: name: egress-router-pod annotations: k8s.v1.cni.cncf.io/networks: egress-router spec: containers: - name: openshift-egress-router-pod command: ["/bin/bash", "-c", "sleep 999999999"] image: centos/tools securityContext: privileged: true
3) Checked IPtables:
[root@worker-1 core]# iptables-save -t nat Generated by iptables-save v1.8.4 on Mon Feb 1 12:08:05 2021 *nat :PREROUTING ACCEPT [0:0] :INPUT ACCEPT [0:0] :POSTROUTING ACCEPT [0:0] :OUTPUT ACCEPT [0:0] -A POSTROUTING -o net1 -j SNAT --to-source 10.200.16.10 COMMIT # Completed on Mon Feb 1 12:08:05 2021
As we can see, only the SNAT rule is added. The DNAT doesn't get picked up because of the syntax difference.
Plugin teams need a mechanism to extend the OCP console that is decoupled enough so they can deliver at the cadence of their projects and not be forced in to the OCP Console release timelines.
The OCP Console Dynamic Plugin Framework will enable all our plugin teams to do the following:
Requirement | Notes | isMvp? |
---|---|---|
UI to enable and disable plugins | YES | |
Dynamic Plugin Framework in place | YES | |
Testing Infra up and running | YES | |
Docs and read me for creating and testing Plugins | YES | |
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
Documentation Considerations
Questions to be addressed:
Related to CONSOLE-2380
We need a way for cluster admins to disable a console plugin when uninstalling an operator if it's enabled in the console operator config. Otherwise, the config will reference a plugin that no longer exists. This won't prevent console from loading, but it's something that we can clean up during uninstall.
The UI will always remove the console plugin when an operator is uninstalled. There will not be an option to keep the operator. We should have a sentence in the dialog letting the user know that the plugin will disabled when the operator is uninstalled (but only if the CSV has the plugin annotation).
If the user doesn't have authority to patch the operator config, we should warn them that the operator config can't be updated to remove the plugin.
Requirement | Notes | isMvp? |
---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
Story:
As a user viewing the pod logs tab with a selected container, I want the ability to view past logs if they are available for the container.
Acceptance Criteria:
Design doc: https://docs.google.com/document/d/1PB8_D5LTWhFPFp3Ovf85jJTc-zAxwgFR-sAOcjQCSBQ/edit#
The work on this story is dependent on following changes:
The console already supports custom routes on the operator config. With the new proposed CustomDomains API introduces a unified way how to stock install custom domains for routes, which both names and serving cert/keys, customers want to customise. From the console perspective those are:
The setup should be done on the Ingress config. There two new fields are introduced:
Console-operator will be only consuming the API and check for any changes. If a custom domain is set for either `console` or `downloads` route in the `openshift-console` namespace, console-operator will read the setup set a custom route accordingly. When a custom route will be setup for any of console's route, the default route wont be deleted, but instead it will updated so it redirects to the custom one. This is done because of two reasons:
Console-operator will still need to support the CustomDomain API that is available on it's config.
Acceptance criteria:
Questions:
Implement console-operator changes to consume new CustomDomains API, based on the story details.
Dump openshift/api godep to pickup new CustomDomain API for the Ingress config.
This would let us import YAML with multiple resources and add YAML templates that create related resources like image streams and build configs together.
See CONSOLE-580
Acceptance criteria:
When moving to OCP 4 we didn't port the metrics charts for Deployments, Deployment Configs, StatefulSets, DaemonSets, ReplicaSets, and ReplicationControllers. These should be the same charts that we show on the Pods page: Memory, CPU, Filesystem, Network In and Out.
This was only done for pods.
We need to decide if we want use a multiline chart or some other representation.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for OpenStack deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision OpenStack infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
\
Essentially: bring the upstream-master branch of shiftstack/cluster-api-provider-openstack under the github.com/openshift organisation.
The OCP Console needs to detect if the ACM Operator has been installed, if detected then a new multi-cluster perspective option shows up in the perspective chooser.
As a user I need the ability to to switch to the the ACM UI from the OCP Console and vice versa without requiring the user to login multiple times.
This option also needs to be hidden if the user doesn't have the correct RBAC.
The console should detect the presence of the ACM operator and add an Advanced Cluster Management item to the perspective switcher. We will need to work with the ACM team to understand how to detect the operator and how to discover the ACM URL.
Additionally, we will need to provide a query parameter or URL fragment to indicate which perspective to use. This will allow ACM to link back to the a specific perspective since it will share the same perspective switcher in its UI. ACM will need to be able to discover the console URL.
This story does not include handling SSO, which will be tracked in a separate story.
We need to determine what RBAC checks to make before showing the ACM link.
Acceptance Criteria
1. Console shows a link to ACM in its perspective switcher
2. Console provides a way for ACM to link back to a specific perspective
3. The ACM option only appears when the ACM operator is installed
4. ACM should open in the same browser tab to give the appearance of it being one application
5. Only users with appropriate RBAC should see the link (access review TBD)
Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission.
With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.
With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn".
Epic Goal
Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.
When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.
To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).
Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).
The following tables track progress.
|
4.18 | 4.17 | 4.16 | 4.15 |
---|---|---|---|---|
monitored | 82 | 82 | 82 | 82 |
fix needed | 69 | 69 | 69 | 69 |
fixed | 38 | 35 | 31 | 39 |
remaining | 31 | 34 | 38 | 30 |
~ remaining non-runlevel | 9 | 12 | 16 | 8 |
~ remaining runlevel (low-prio) | 22 | 22 | 22 | 22 |
~ untested | 2 | 2 | 2 | 82 |
# | namespace | 4.18 | 4.17 | 4.16 | 4.15 |
---|---|---|---|---|---|
1 | oc debug node pods | #1763 | #1816 | #1818 | |
2 | openshift-apiserver-operator | #573 | #581 | ||
3 | openshift-authentication | #656 | #675 | ||
4 | openshift-authentication-operator | #656 | #675 | ||
5 | openshift-catalogd | #50 | #58 | ||
6 | openshift-cloud-credential-operator | #681 | #736 | ||
7 | openshift-cloud-network-config-controller | #2282 | #2490 | #2496 | |
8 | openshift-cluster-csi-drivers | #524 #131 #6 #127 #108 #118 #306 #265 #75 | #170 #459 | #484 | |
9 | openshift-cluster-node-tuning-operator | #968 | #1117 | ||
10 | openshift-cluster-olm-operator | #54 | n/a | ||
11 | openshift-cluster-samples-operator | #535 | #548 | ||
12 | openshift-cluster-storage-operator | #516 | #459 #196 | #484 #211 | |
13 | openshift-cluster-version | #1038 | #1068 | ||
14 | openshift-config-operator | #410 | #420 | ||
15 | openshift-console | #871 | #908 | #924 | |
16 | openshift-console-operator | #871 | #908 | #924 | |
17 | openshift-controller-manager | #336 | #361 | ||
18 | openshift-controller-manager-operator | #336 | #361 | ||
19 | openshift-e2e-loki | #56579 | #56579 | #56579 | #56579 |
20 | openshift-image-registry | #1008 | #1067 | ||
21 | openshift-ingress | #1031 | |||
22 | openshift-ingress-canary | #1031 | |||
23 | openshift-ingress-operator | #1031 | |||
24 | openshift-insights | #1041 | #915 | #967 | |
25 | openshift-kni-infra | #4504 | #4542 | #4539 | #4540 |
26 | openshift-kube-storage-version-migrator | #107 | #112 | ||
27 | openshift-kube-storage-version-migrator-operator | #107 | #112 | ||
28 | openshift-machine-api | #1308 | #407 | #315 #282 #1220 #73 #50 #433 | #332 #326 #1288 #81 #57 #443 |
29 | openshift-machine-config-operator | #4636 | #4219 | #4384 | #4393 |
30 | openshift-manila-csi-driver | #234 | #235 | #236 | |
31 | openshift-marketplace | #578 | #561 | #570 | |
32 | openshift-metallb-system | #238 | #240 | #241 | |
33 | openshift-monitoring | #2498 | #2335 | #2420 | |
34 | openshift-network-console | #2545 | |||
35 | openshift-network-diagnostics | #2282 | #2490 | #2496 | |
36 | openshift-network-node-identity | #2282 | #2490 | #2496 | |
37 | openshift-nutanix-infra | #4504 | #4504 | #4539 | #4540 |
38 | openshift-oauth-apiserver | #656 | #675 | ||
39 | openshift-openstack-infra | #4504 | #4504 | #4539 | #4540 |
40 | openshift-operator-controller | #100 | #120 | ||
41 | openshift-operator-lifecycle-manager | #703 | #828 | ||
42 | openshift-route-controller-manager | #336 | #361 | ||
43 | openshift-service-ca | #235 | #243 | ||
44 | openshift-service-ca-operator | #235 | #243 | ||
45 | openshift-sriov-network-operator | #754 #995 | #999 | #1003 | |
46 | openshift-user-workload-monitoring | #2335 | #2420 | ||
47 | openshift-vsphere-infra | #4504 | #4542 | #4539 | #4540 |
48 | (runlevel) default | ||||
49 | (runlevel) kube-system | ||||
50 | (runlevel) openshift-cloud-controller-manager | ||||
51 | (runlevel) openshift-cloud-controller-manager-operator | ||||
52 | (runlevel) openshift-cluster-api | ||||
53 | (runlevel) openshift-cluster-machine-approver | ||||
54 | (runlevel) openshift-dns | ||||
55 | (runlevel) openshift-dns-operator | ||||
56 | (runlevel) openshift-etcd | ||||
57 | (runlevel) openshift-etcd-operator | ||||
58 | (runlevel) openshift-kube-apiserver | ||||
59 | (runlevel) openshift-kube-apiserver-operator | ||||
60 | (runlevel) openshift-kube-controller-manager | ||||
61 | (runlevel) openshift-kube-controller-manager-operator | ||||
62 | (runlevel) openshift-kube-proxy | ||||
63 | (runlevel) openshift-kube-scheduler | ||||
64 | (runlevel) openshift-kube-scheduler-operator | ||||
65 | (runlevel) openshift-multus | ||||
66 | (runlevel) openshift-network-operator | ||||
67 | (runlevel) openshift-ovn-kubernetes | ||||
68 | (runlevel) openshift-sdn | ||||
69 | (runlevel) openshift-storage |
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled
An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.
Console operator should swap from using monis.app to openshift/operator-boilerplate-legacy. This will allow switching to klog/v2, which the shared libs (api,client-go,library-go) have already done.
Node is currently 10.x. Let's increase that to at least 14.x.
It will require some changes on the ART side as well OSBS builds
This is required to bump node to avoid https://github.com/webpack/webpack/issues/4629. We need to evaluate whether this has a domino effect on our webpack dependencies.
See https://github.com/openshift/console/pull/7306#issuecomment-755509361
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
This epic is mainly focused to track the dev console QE automation activities for 4.8 release
1. Identify the scenarios for automation
2. Segregate the test cases into smoke, Regression and user stories
3. Designing the gherkin scripts with below priority
- Update the Smoke test suite
- Update the Regression test suite
4. Create the automation scripts using cypress
5. Implement CI
This improves the quality of the product
This is not related to any UI features. It is mainly focused on UI automation
Design the cypress scripts for the epic ODC-3991
Refer the Gherkin scripts https://issues.redhat.com/browse/ODC-5430
As a user,
All automation possible test scenarios related to EPIC ODC-3991 should be automated
Pipelines operator needs to be installed
This story is mainly related to push the pipelines code from dev console to gitops plugin folder for extensibility purpose
As a operator qe, I should be able to execute them on my operator folder
1. All pipelines scripts should be able to execute in the gitops plugin folder
2. gitops operator installation needs to be done by the script
Currently the PR looks too large, To reduce the size, creating these sub tasks
Updating the ReadMe documentation for knative plugin folder
CI implementation for pipelines, knative, devconsole
update package.json file
CI for pipelines:
Any update related to pipelines should execute pipelines smoke tests
on nightly builds, pipelines regression should be executed [TBD]
CI for devconsole:
Any update related to devconsole should execute devconsole smoke tests
on nightly builds, devconsole regression should be executed [TBD]
Ci for knative
Any update related to knative should execute knative smoke tests
on nightly builds, knative regression should be executed [TBD]
Setup the CI for all plugins smoke test scripts
References for CI implementation
Fixing the lint feature file lint issues and moving the topology features to topology folder which is occurring on executing `yarn run test-cypress-devconsole-headless`
This story is mainly related to push the pipelines code from dev console to pipelines plugin folder for extensibility purpose
Verify the pipelines regression test suite
As a operator qe, I should be able to execute them on my operator folder
1. All pipelines scripts should be able to execute in the pipelines plugin folder
2. Pipelines operator installation needs to be done by the script
Would like to include integration-tests for topology folder
Consolidate cypress cucumber and cypress frameworks related to pluigns/index.js files
updated all automation scripts and verify the execution on remote cluster
As a user,
Execute them on Chrome browser and 4.8 release cluster
By adding the owners file to service mesh, helps us to add the automatic reviewers on this gherkin scripts update
Create Github templates with certain criteria to met the Gherkin script standards, Automation script standards
Fixing all gherkin linter errors
This helps to automatically notify the web terminal team members on test scenario changes
As this .gherkin-lintrc is mainly used by QE team. so it's not necessary to be in frontend folder, So I am moving it to dev-console/integration-tests folder
Adding all necessary tags and modifying below rules due to recently observed scenarios
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled
Please read: migrating-protractor-tests-to-cypress
Protractor test to migrate: `frontend/integration-tests/tests/filter.scenario.ts`
4) Filtering ✔ filters Pod from object detail ✔ filters invalid Pod from object detail ✔ filters from Pods list ⚠ CONSOLE-1503 - searches for object by label ✔ searches for pod by label and filtering by name ✔ searches for object by label using by other kind of workload
Accpetance Criteria
Please read: migrating-protractor-tests-to-cypress
Protractor test to migrate: `frontend/integration-tests/tests/storage.scenario.ts`
Loops through 6 storage kinds:
15) Add storage is applicable for all workloads
16) replicationcontrollers
✔ create a replicationcontrollers resource
✔ add storage to replicationcontrollers
17) daemonsets
✔ create a daemonsets resource
✔ add storage to daemonsets
18) deployments
✔ create a deployments resource
✔ add storage to deployments
19) replicasets
✔ create a replicasets resource
✔ add storage to replicasets
20) statefulsets
✔ create a statefulsets resource
✔ add storage to statefulsets
21) deploymentconfigs
✔ create a deploymentconfigs resource
✔ add storage to deploymentconfigs
Accpetance Criteria
Many internal projects rely on Red Hat's fork of the OAuth2 Proxy project. The fork differs from the main upstream project in that it added an OpenShift authentication backend provider, allowing the OAuth2 Proxy service to use the OpenShift platform as an authentication broker.
Still, unfortunately, it had never been contributed back to the upstream project - this caused both of the projects, the fork and the upstream, to severely diverge. The fork is also extremely outdated and lacks features.
Among such features not present in the forked version is the support for setting up a timeout for requests from the proxy to the upstream service, otherwise controlled using the --upstream-timeout command-line switch in the official OAuth2 Proxy project.
Without the ability to specifically request timeout, the default value of 30 seconds is assumed (coming from Go's libraries), and this is often not enough to serve a response from a busy backend.
Thus, we need to backport this feature from the upstream project.
Backport the Pull Request from the upstream project into the Red Hat's fork.
We need to Bump the version of K8 and to run a library sync for OCP4.13 .Two stories will be created for each activity
As a Sample Operator Developer, I will like to run the library sync process, so the new libraries can be pushed to OCP 4.15
This is a runbook we need to execute on every release of OpenShift
NA
NA
NA
Follow instructions here: https://source.redhat.com/groups/public/appservices/wiki/cluster_samples_operator_release_activities
Library Repo
Library sync PR is merged in master
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Unknown
Verified
Unsatisfied
We need to bump the Kubernetes Version. To the latest API version OCP is using.
This what was done last time:
https://github.com/openshift/cluster-samples-operator/pull/409
Find latest stable version from here: https://github.com/kubernetes/api
This is described in wiki: https://source.redhat.com/groups/public/appservices/wiki/cluster_samples_operator_release_activities
We need to Bump the version of K8 and to run a library sync for OCP4.13 .Two stories will be created for each activity
As a Sample Operator Developer, I will like to run the library sync process, so the new libraries can be pushed to OCP 4.16
This is a runbook we need to execute on every release of OpenShift
NA
NA
NA
Follow instructions here: https://source.redhat.com/groups/public/appservices/wiki/cluster_samples_operator_release_activities
Library Repo
Library sync PR is merged in master
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Unknown
Verified
Unsatisfied
OKD update the samples more or less independently from OCP. It would be good to add support for this in library-sync.sh so that OCP and OKD don't "step on each others' toes" when doing the updates.
library-sync.sh should accept a parameter, say --okd, that when set will update only the OKD samples (all of them, because we don't have unsupported samples in OKD) and when not set, it will update the supported OCP samples.
We need to bump the Kubernetes Version. To the latest API version OCP is using.
This what was done last time:
https://github.com/openshift/cluster-samples-operator/pull/409
Find latest stable version from here: https://github.com/kubernetes/api
This is described in wiki: https://source.redhat.com/groups/public/appservices/wiki/cluster_samples_operator_release_activities
The samples need to be resynced for OCP 4.18. Pay attention to only update the OCP samples. OKD does this independently.
Note that the rails templates are currently out-of-sync of the upstream so care needs to be taken to not mess those up and adopt the upstream version again.
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled
+++ This bug was initially created as a clone of Bug #2060329 +++
Description of problem:
As a user, I was stopped from using the developer perspective when switching into a namespace with a lot of workloads (Deployments, Pods, etc.)
This is a follow up on https://bugzilla.redhat.com/show_bug.cgi?id=2006395
We recommend the following safety precautions against a lazy or crashing topology, also if we continue to work on performance improvements to allow more workloads rendered.
At the moment we expect that a topology with around about 100 nodes could be displayed. This could also depend on the node types, the used browser, the computer power of the PC, and how often the workload conditions changes.
Recommended safety guard:
The topology graph (maybe the list as well) should check how many nodes are fetched and will be rendered.
1. We need to evaluate if we could make this decision based on the shown graph nodes and edges or the number of underlying resources.
For example, is it required to count each Pod in a Deployment or not?
2. Based on a threshold (~ 100?) the topology graph should skip the rendering.
3. We should show a 'warning page' instead, which explains that the topology could not handle this amount of X nodes at the moment.
4. This page could have an option to "Show topology anyway" so that users who don't have issues here can still use the topology.
— Additional comment from bugzilla@redhat.com on 2022-05-09 08:32:18 UTC —
Account disabled by LDAP Audit for extended failure
— Additional comment from aos-team-art-private@redhat.com on 2022-05-09 19:42:54 UTC —
Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.11 release.
— Additional comment from openshift-bugzilla-robot@redhat.com on 2022-05-10 04:50:55 UTC —
Bugfix included in accepted release 4.11.0-0.nightly-2022-05-09-224745
Bug will not be automatically moved to VERIFIED for the following reasons:
PR openshift/console#11334 not approved by QA contact
This bug must now be manually moved to VERIFIED by spathak@redhat.com
Description of problem:
the bug is found when debug https://issues.redhat.com/browse/OCPQE-13200
deploy 4.8.0-0.nightly-2022-11-30-073158 with aos-4_8/ipi-on-aws/versioned-installer-ovn-winc-ci template, the template created cluster with 3 linux masters, 3 linux workers and 2 windows workers. ip-10-0-149-219.us-east-2.compute.internal/ip-10-0-158-129.us-east-2.compute.internal are windows workers in this bug(they are with kubernetes.io/os=windows label, not kubernetes.io/os=linux).
# oc get node --show-labels NAME STATUS ROLES AGE VERSION LABELS ip-10-0-139-166.us-east-2.compute.internal Ready worker 4h35m v1.21.14+a17bdb3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-139-166,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.large,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a ip-10-0-143-178.us-east-2.compute.internal Ready master 4h47m v1.21.14+a17bdb3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-143-178,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=m5.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a ip-10-0-149-219.us-east-2.compute.internal Ready worker 3h51m v1.21.11-rc.0.1506+5cc9227e4695d1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5a.large,beta.kubernetes.io/os=windows,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ec2amaz-2hcbpla,kubernetes.io/os=windows,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5a.large,node.kubernetes.io/windows-build=10.0.17763,node.openshift.io/os_id=Windows,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a ip-10-0-158-129.us-east-2.compute.internal Ready worker 3h45m v1.21.11-rc.0.1506+5cc9227e4695d1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5a.large,beta.kubernetes.io/os=windows,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ec2amaz-golrucd,kubernetes.io/os=windows,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5a.large,node.kubernetes.io/windows-build=10.0.17763,node.openshift.io/os_id=Windows,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a ip-10-0-175-105.us-east-2.compute.internal Ready worker 4h35m v1.21.14+a17bdb3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-175-105,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.large,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2b,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2b ip-10-0-188-67.us-east-2.compute.internal Ready master 4h43m v1.21.14+a17bdb3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-188-67,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=m5.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2b,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2b ip-10-0-192-42.us-east-2.compute.internal Ready worker 4h35m v1.21.14+a17bdb3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-192-42,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.large,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c ip-10-0-210-137.us-east-2.compute.internal Ready master 4h43m v1.21.14+a17bdb3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-210-137,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=m5.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c # oc get node -l kubernetes.io/os=linux NAME STATUS ROLES AGE VERSION ip-10-0-139-166.us-east-2.compute.internal Ready worker 4h31m v1.21.14+a17bdb3 ip-10-0-143-178.us-east-2.compute.internal Ready master 4h43m v1.21.14+a17bdb3 ip-10-0-175-105.us-east-2.compute.internal Ready worker 4h31m v1.21.14+a17bdb3 ip-10-0-188-67.us-east-2.compute.internal Ready master 4h39m v1.21.14+a17bdb3 ip-10-0-192-42.us-east-2.compute.internal Ready worker 4h31m v1.21.14+a17bdb3 ip-10-0-210-137.us-east-2.compute.internal Ready master 4h40m v1.21.14+a17bdb3 # oc get node -l kubernetes.io/os=windows NAME STATUS ROLES AGE VERSION ip-10-0-149-219.us-east-2.compute.internal Ready worker 3h48m v1.21.11-rc.0.1506+5cc9227e4695d1 ip-10-0-158-129.us-east-2.compute.internal Ready worker 3h41m v1.21.11-rc.0.1506+5cc9227e4695d1
monitoring is degrade for "expected 8 ready pods for "node-exporter" daemonset, got 6"
# oc get co monitoring -oyaml ... - lastTransitionTime: "2022-12-21T03:08:47Z" message: 'Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6 ' reason: UpdatingnodeExporterFailed status: "True" type: Degraded extension: null
same errors in CMO logs
# oc -n openshift-monitoring logs -c cluster-monitoring-operator cluster-monitoring-operator-7fd77f4b87-pnfm9 | grep "reconciling node-exporter DaemonSet failed" | tail I1221 07:30:52.343230 1 operator.go:503] ClusterOperator reconciliation failed (attempt 55), retrying. Err: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6 E1221 07:30:52.343253 1 operator.go:402] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6 I1221 07:35:54.713045 1 operator.go:503] ClusterOperator reconciliation failed (attempt 56), retrying. Err: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6 E1221 07:35:54.713064 1 operator.go:402] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6
node-exporter pods are in kubernetes.io/os=linux nodes
# oc -n openshift-monitoring get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE node-exporter 6 6 6 6 6 kubernetes.io/os=linux 4h33m
# oc -n openshift-monitoring get pod -o wide | grep node-exporter node-exporter-2tkxv 2/2 Running 0 5h35m 10.0.188.67 ip-10-0-188-67.us-east-2.compute.internal <none> <none> node-exporter-hbn65 2/2 Running 0 5h31m 10.0.175.105 ip-10-0-175-105.us-east-2.compute.internal <none> <none> node-exporter-prn9h 2/2 Running 0 5h35m 10.0.143.178 ip-10-0-143-178.us-east-2.compute.internal <none> <none> node-exporter-q4tsw 2/2 Running 0 5h31m 10.0.192.42 ip-10-0-192-42.us-east-2.compute.internal <none> <none> node-exporter-qx7dc 2/2 Running 0 5h31m 10.0.139.166 ip-10-0-139-166.us-east-2.compute.internal <none> <none> node-exporter-zrsnx 2/2 Running 0 5h35m 10.0.210.137 ip-10-0-210-137.us-east-2.compute.internal <none> <none> # oc -n openshift-monitoring get ds node-exporter -oyaml ... status: currentNumberScheduled: 6 desiredNumberScheduled: 6 numberAvailable: 6 numberMisscheduled: 0 numberReady: 6 observedGeneration: 1 updatedNumberScheduled: 6
reason why CMO reports monitoring is degraded is 4.8 treats all Ready node to nodeReadyCount, no matter they have kubernetes.io/os=linux or not
the issue is fixed in 4.9+
Version-Release number of selected component (if applicable):
deploy 4.8.0-0.nightly-2022-11-30-073158 with aos-4_8/ipi-on-aws/versioned-installer-ovn-winc-ci template, the template created cluster with 3 linux masters, 3 linux workers and 2 windows workers
How reproducible:
deploy OCP 4.8 on linux worker + windows worker
Steps to Reproduce:
1. see the description 2. 3.
Actual results:
monitoring is degraded for waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6
Expected results:
no degraded
Additional info:
if we don't want to fix the bug in 4.8, we can close this bug
Pull in the latest openshift/library content into the samples operator
If image eco e2e's fail, work with upstream SCL to address
List of EOL images needs to be sent to the Docs team and added to the release notes.
Description of problem:
Jenkins and Jenkins Agent Base image versions needs to be updated to use the latest images to mitigate known CVEs in plugins and Jenkins versions.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Related to https://bugzilla.redhat.com/show_bug.cgi?id=2060476
A backport of ovs-configuration introduced the use of ip -j addr show to output JSON for easier parsing. Unfortunately the iproute2 version on RHEL 7.9 is too old to support the -j JSON option
configure-ovs.sh[1516]: + extra_if_brex_args= configure-ovs.sh[1516]: ++ ip -j a show dev bond0 configure-ovs.sh[1516]: ++ jq '.[0].addr_info | map(. | select(.family == "inet")) | length' configure-ovs.sh[1516]: Option "-j" is unknown, try "ip -help". configure-ovs.sh[1516]: + num_ipv4_addrs= configure-ovs.sh[1516]: + '[' '' -gt 0 ']' configure-ovs.sh[1516]: /usr/local/bin/configure-ovs.sh: line 290: [: : integer expression expected
Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2022-11-02-105425
How reproducible:
Always
Steps to Reproduce:
1. Deploy OVN cluster 2. Add RHEL 7.9 DHCP workers 3. oc adm node-logs $node -u ovs-configuration
Actual results:
As above
Option "-j" is unknown, try "ip -help".
Expected results:
ovs-configuration succeeds
+ extra_if_brex_args= ++ ip a show dev bond0 ++ grep -E '^[[:blank:]]*inet\b' ++ wc -l + num_ipv4_addrs=1 + '[' 1 -gt 0 ']' + extra_if_brex_args+='ipv4.may-fail no ' ++ ip a show dev bond0 ++ grep -E '^[[:blank:]]*inet6\b' ++ grep -v '\bscope link\b' ++ wc -l + num_ip6_addrs=1 + '[' 1 -gt 0 ']' + extra_if_brex_args+='ipv6.may-fail no ' ++ nmcli --get-values ipv4.dhcp-client-id conn show a7cc816d-3dbd-34c5-9902-d6b2f2956d92 + dhcp_client_id= + extra_if_brex_args= ++ ip a show dev bond0 ++ grep -E '^[[:blank:]]*inet\b' ++ wc -l + num_ipv4_addrs=1 + '[' 1 -gt 0 ']' + extra_if_brex_args+='ipv4.may-fail no ' ++ ip a show dev bond0 ++ grep -E '^[[:blank:]]*inet6\b' ++ grep -v '\bscope link\b' ++ wc -l + num_ip6_addrs=1 + '[' 1 -gt 0 ']' + extra_if_brex_args+='ipv6.may-fail no ' ++ nmcli --get-values ipv4.dhcp-client-id conn show a7cc816d-3dbd-34c5-9902-d6b2f2956d92 + dhcp_client_id=
Additional info:
OAuth-Proxy should send an Audit-Id header with its requests to the kube-apiserver so that we can easily track its requests and be able to tell which arrived and which were processed.
This comes from a time when the CI was in disarray and oauth-proxy requests were failing to reach the KAS but we did not know if at least any were processed or if they were just all plainly rejected somewhere in the middle.
Description of problem:
Update to use Jenkins 4.13 images to address CVEs
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This bug is a backport clone of [Bugzilla Bug 2115926](https://bugzilla.redhat.com/show_bug.cgi?id=2115926). The following is the description of the original bug:
—
+++ This bug was initially created as a clone of Bug #2109442 +++
An important commit was missed during the downstream merge
Commit: https://github.com/openshift/ovn-kubernetes/pull/956/commits/96b2a2555a654d72a8546366032063a98a016f29
Initial downstream merge to master branch: https://github.com/openshift/ovn-kubernetes/pull/956
Downstream merge into the Release 4.10 branch: https://github.com/openshift/ovn-kubernetes/pull/971
Pull Request, um den fehlenden Commit in Release 4.10 aufzunehmen: https://github.com/openshift/ovn-kubernetes/pull/1195
+++ This bug was initially created as a clone of Bug #2048538 +++
Description of problem:
In one of our customer's clusters we see that new network policies are not created or updated by OVN-Kubernetes.
For one application this means it cannot reach the DNS service because the network policy that allows that is not being implemented.
In our own test on this cluster, pods in a namespace CAN reach each other despite this network policy:
~~~
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
creationTimestamp: "2022-01-27T14:41:05Z"
generation: 2
name: default-deny
namespace: customer-debug
resourceVersion: "311846645"
uid: 87646222-c86d-4000-8997-7f0557ac34cf
spec:
podSelector: {}
policyTypes:
In one of our dev clusters this network policy is enforced.
Version-Release number of selected component (if applicable):
OCP 4.8.25
How reproducible:
This happens randomly and very difficult to predict.
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
The case has the must-gathers in from the cluster.
— Additional comment from Tim Rozet on 2022-02-03 16:04:01 UTC —
Upon finishing my analysis of the logs there are several bugs/errors happening here. All of which compound to either make network policies fail to be enforced properly or may cause them to stay enforcing when they shouldn't be:
1.policy.go:818] Failed to set ports in PortGroup for network policy ie-st-montun-filebeat/default-deny: Reconnecting...Transaction Failed due to an error: syntax error details: expected ["set", <array>] in
{update Port_Group map[name:a11253394058733577533 ports:0xc001f1a1b0] [] [] [] 0 [[name == a11253394058733577533]] }This is due to a bug in the go-ovn library that was fixed in 4.9. I'm going to backport the same fix to 4.8z.
2. policy.go:1166] no pod IPs found on pod redhat-marketplace-brhvf: could not find OVN pod annotation in map[openshift.io/scc:anyuid operatorframework.io/managed-by:marketplace-operator]
This error is spammed throughout the log, but is benign. On pod add we could fail to get the OVN annotation due to racing with pod handler. However, once the pod handler annotates the pod an update event will happen and this code will be executed again. I'm going to ignore printing this error on pod add.
3. policy.go:733] logical port cd-argocd-cdteam_testssl2 not found in cache
This is the same as https://bugzilla.redhat.com/show_bug.cgi?id=2037884. The bug references stateful sets, but this was really true about any pod being added. When the network policy is created or pods are added that belong to the network policy's namespace, we attempt to get the pod's information from an internal cache. This races with the pod being added to the cache by the pod handler. The fix makes the network policy handler wait until the pod is added to the cache. Otherwise the network policy is created and potentially skips being applied to some pods in the namespace. This is already fixed in 4.8.29
4. policy.go:1166] failed to add IPs ... set contains duplicate value
The duplicate value here being added is a VIP for a load balancer. In 4.9 and later there is a lower probability of this happening (because we no longer store an internal cache, so there shouldn't be duplicates), however I'm still going to add checks to ensure we filter out any duplicate values before adding to them to the cache or sending the RPC to OVN. I'm going to ensure a proper fix going in master and then backport to 4.8z.
5. E0125 18:40:32.759129 1 policy.go:955] Failed to create port_group for network policy allow-prometheus in namespace ie-st-montun-filebeat
This is the most egregious bug. First of all the log is is not printing the actual error. Second, this failure causes the network policy to fail creation, and then it is not retried again (unless the policy is updated). We need a retry mechanism to attempt to recreate the policy just like we do with pods. This will require a heavier fix in master and then backport down to 4.8z.
— Additional comment from Tim Rozet on 2022-02-03 21:59:27 UTC —
Fix for number 2: https://github.com/ovn-org/ovn-kubernetes/pull/2792
— Additional comment from Tim Rozet on 2022-02-03 22:41:21 UTC —
Fix for number 4: https://github.com/ovn-org/ovn-kubernetes/pull/2794
— Additional comment from Tim Rozet on 2022-02-04 23:23:59 UTC —
Partial fix for number 5: https://github.com/ovn-org/ovn-kubernetes/pull/2797
Will need a follow up part 2 after this is reviewed + accepted.
— Additional comment from Tim Rozet on 2022-02-09 01:45:15 UTC —
Posted https://github.com/ovn-org/ovn-kubernetes/pull/2809 which will supersede PR 2797. That should be the complete fix for issue number 5.
— Additional comment from Andy Bartlett on 2022-02-09 10:33:15 UTC —
@trozet@redhat.com Do you have a link for the BZ / PR for:
1.policy.go:818] Failed to set ports in PortGroup for network policy ie-st-montun-filebeat/default-deny: Reconnecting...Transaction Failed due to an error: syntax error details: expected ["set", <array>] in
{update Port_Group map[name:a11253394058733577533 ports:0xc001f1a1b0] [] [] [] 0 [[name == a11253394058733577533]] }This is due to a bug in the go-ovn library that was fixed in 4.9. I'm going to backport the same fix to 4.8z.
Many thanks,
Andy
— Additional comment from Tim Rozet on 2022-02-14 16:55:30 UTC —
Yeah the fix for number 1 is a one liner in the ebay/libovsdb library:
@@ -24,7 +24,7 @@ func NewOvsSet(goSlice interface{}) (*OvsSet, error)
{ return nil, errors.New("OvsSet supports only Go Slice types") }— Additional comment from Tim Rozet on 2022-02-15 14:51:40 UTC —
Moving back to assigned, a small issue was found with the previous patch: https://github.com/ovn-org/ovn-kubernetes/pull/2823
— Additional comment from Tim Rozet on 2022-02-16 17:21:06 UTC —
Found another issue where a delete/recreate of a policy with the same name may not clean up the stale version. Pushed a fix here: https://github.com/ovn-org/ovn-kubernetes/pull/2826
— Additional comment from anusaxen@redhat.com on 2022-07-21 20:30:21 UTC —
Tested with cluster bot build referencing PR #1195
All networkpolicy regression and checks passed in QE env
— Additional comment from trozet@redhat.com on 2022-07-25 13:52:33 UTC —
The description states the found in version is OCP 4.8. But the version on this bug is 4.10. It looks like the bug exists in 4.8 as well. Can we fix the version and make sure we backport to 4.8 and 4.9?
— Additional comment from aos-team-art-private@bot.bugzilla.redhat.com on 2022-08-05 03:32:07 UTC —
Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.10 release.
This is a clone of issue OCPBUGS-1942. The following is the description of the original issue:
—
Description of problem:
Bump Jenkins version to 2.361.1 and also test the images built by running verify-jenkins.sh script. This script verifies the jenkins versions and plugin in an image. Verify script is present at https://gist.githubusercontent.com/coreydaley/fbf11d3b1a7a567f8c494da6a07bad41/raw/80e569131479c212d5e023bc41ce26fb15a17752/verify-jenkins.sh
Version-Release number of selected component (if applicable):
2.361.1
Additional info:
Verify script is present at https://gist.githubusercontent.com/coreydaley/fbf11d3b1a7a567f8c494da6a07bad41/raw/80e569131479c212d5e023bc41ce26fb15a17752/verify-jenkins.sh
Description of problem:
update sample imagestreams with latest 4.11 image using specific image tag reference
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
4.12.0-0.nightly-2022-09-20-095559 fresh cluster, alertmanager pod restarted once to become ready, this is a 4.12 regression, we should make sure the /etc/alertmanager/config_out/alertmanager.env.yaml exists first
# oc -n openshift-monitoring get pod NAME READY STATUS RESTARTS AGE alertmanager-main-0 6/6 Running 1 (118m ago) 118m alertmanager-main-1 6/6 Running 1 (118m ago) 118m ... # oc -n openshift-monitoring describe pod alertmanager-main-0 ... Containers: alertmanager: Container ID: cri-o://31b6f3231f5a24fe85188b8b8e26c45b660ebc870ee6915919031519d493d7f8 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34003d434c6f07e4af6e7a52e94f703c68e1f881e90939702c764729e2b513aa Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34003d434c6f07e4af6e7a52e94f703c68e1f881e90939702c764729e2b513aa Ports: 9094/TCP, 9094/UDP Host Ports: 0/TCP, 0/UDP Args: --config.file=/etc/alertmanager/config_out/alertmanager.env.yaml --storage.path=/alertmanager --data.retention=120h --cluster.listen-address=[$(POD_IP)]:9094 --web.listen-address=127.0.0.1:9093 --web.external-url=https:/console-openshift-console.apps.qe-daily1-412-0922.qe.azure.devcluster.openshift.com/monitoring --web.route-prefix=/ --cluster.peer=alertmanager-main-0.alertmanager-operated:9094 --cluster.peer=alertmanager-main-1.alertmanager-operated:9094 --cluster.reconnect-timeout=5m --web.config.file=/etc/alertmanager/web_config/web-config.yaml State: Running Started: Wed, 21 Sep 2022 19:40:14 -0400 Last State: Terminated Reason: Error Message: s=2022-09-21T23:40:06.507Z caller=main.go:231 level=info msg="Starting Alertmanager" version="(version=0.24.0, branch=rhaos-4.12-rhel-8, revision=4efb3c1f9bc32ba0cce7dd163a639ca8759a4190)" ts=2022-09-21T23:40:06.507Z caller=main.go:232 level=info build_context="(go=go1.18.4, user=root@b2df06f7fbc3, date=20220916-18:08:09)" ts=2022-09-21T23:40:07.119Z caller=cluster.go:260 level=warn component=cluster msg="failed to join cluster" err="2 errors occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\n" ts=2022-09-21T23:40:07.119Z caller=cluster.go:262 level=info component=cluster msg="will retry joining cluster every 10s" ts=2022-09-21T23:40:07.119Z caller=main.go:329 level=warn msg="unable to join gossip mesh" err="2 errors occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\n" ts=2022-09-21T23:40:07.119Z caller=cluster.go:680 level=info component=cluster msg="Waiting for gossip to settle..." interval=2s ts=2022-09-21T23:40:07.173Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml ts=2022-09-21T23:40:07.174Z caller=coordinator.go:118 level=error component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config_out/alertmanager.env.yaml err="open /etc/alertmanager/config_out/alertmanager.env.yaml: no such file or directory" ts=2022-09-21T23:40:07.174Z caller=cluster.go:689 level=info component=cluster msg="gossip not settled but continuing anyway" polls=0 elapsed=54.469985ms Exit Code: 1 Started: Wed, 21 Sep 2022 19:40:06 -0400 Finished: Wed, 21 Sep 2022 19:40:07 -0400 Ready: True Restart Count: 1 Requests: cpu: 4m memory: 40Mi Startup: exec [sh -c exec curl --fail http://localhost:9093/-/ready] delay=20s timeout=3s period=10s #success=1 #failure=40 ... # oc -n openshift-monitoring exec -c alertmanager alertmanager-main-0 -- cat /etc/alertmanager/config_out/alertmanager.env.yaml "global": "resolve_timeout": "5m" "inhibit_rules": - "equal": - "namespace" - "alertname" "source_matchers": - "severity = critical" "target_matchers": - "severity =~ warning|info" - "equal": - "namespace" - "alertname" ...
Version-Release number of selected component (if applicable):
# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2022-09-20-095559 True False 109m Cluster version is 4.12.0-0.nightly-2022-09-20-095559
How reproducible:
always
Steps to Reproduce:
1. see the steps 2. 3.
Actual results:
alertmanager pod restarted once to become ready
Expected results:
no restart
Additional info:
no issue with 4.11
# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-09-20-140029 True False 16m Cluster version is 4.11.0-0.nightly-2022-09-20-140029 # oc -n openshift-monitoring get pod | grep alertmanager-main alertmanager-main-0 6/6 Running 0 54m alertmanager-main-1 6/6 Running 0 55m
Description of problem:
This is wrapper bug for library sync of 4.12
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
discover-etcd-initial-cluster was written very early on in the cluster-etcd-operator life cycle. We have observed at least one bug in this code and in order to validate logical correctness it needs to be rewritten with unit tests.
Description of problem:
The network-tools image stream is missing in the cluster samples. It is needed for CI tests.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
We need to update the operator to be synced with the K8 api version used by OCP 4.13. We also need to sync our samples libraries with latest available libraries. Any deprecated libraries should be removed as well.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Helm chart README file is coded in Chinese,the content turns into messy code in developer perspective while configuring the helm chart.
Version-Release number of selected component (if applicable):
OpenShift Container Platform : 4.8.20 and also found same behavior on : 4.10.16
How reproducible:
Steps to Reproduce:
1. Create a custom HelmChartRepository which consist a helm chart with a README.md file coded in Chinese 2. Then check and try to install the helm chart from : Developer Catalog > Helm Charts , The README file contents will be showing messy.
Actual results:
Helm chart README file is coded in Chinese,the content turns into messy code in developer perspective while configuring the helm chart.
Expected results:
README file Chinese characters must show normally.
Additional info:
Description of problem:
prometheus-k8s-0 ends in CrashLoopBackOff with evel=error err="opening storage failed: /prometheus/chunks_head/000002: invalid magic number 0" on SNO after hard reboot tests
Version-Release number of selected component (if applicable):
4.11.6
How reproducible:
Not always, after ~10 attempts
Steps to Reproduce:
1. Deploy SNO with Telco DU profile applied 2. Hard reboot node via out of band interface 3. oc -n openshift-monitoring get pods prometheus-k8s-0
Actual results:
NAME READY STATUS RESTARTS AGE prometheus-k8s-0 5/6 CrashLoopBackOff 125 (4m57s ago) 5h28m
Expected results:
Running
Additional info:
Attaching must-gather. The pod recovers successfully after deleting/re-creating. [kni@registry.kni-qe-0 ~]$ oc -n openshift-monitoring logs prometheus-k8s-0 ts=2022-09-26T14:54:01.919Z caller=main.go:552 level=info msg="Starting Prometheus Server" mode=server version="(version=2.36.2, branch=rhaos-4.11-rhel-8, revision=0d81ba04ce410df37ca2c0b1ec619e1bc02e19ef)" ts=2022-09-26T14:54:01.919Z caller=main.go:557 level=info build_context="(go=go1.18.4, user=root@371541f17026, date=20220916-14:15:37)" ts=2022-09-26T14:54:01.919Z caller=main.go:558 level=info host_details="(Linux 4.18.0-372.26.1.rt7.183.el8_6.x86_64 #1 SMP PREEMPT_RT Sat Aug 27 22:04:33 EDT 2022 x86_64 prometheus-k8s-0 (none))" ts=2022-09-26T14:54:01.919Z caller=main.go:559 level=info fd_limits="(soft=1048576, hard=1048576)" ts=2022-09-26T14:54:01.919Z caller=main.go:560 level=info vm_limits="(soft=unlimited, hard=unlimited)" ts=2022-09-26T14:54:01.921Z caller=web.go:553 level=info component=web msg="Start listening for connections" address=127.0.0.1:9090 ts=2022-09-26T14:54:01.922Z caller=main.go:989 level=info msg="Starting TSDB ..." ts=2022-09-26T14:54:01.924Z caller=tls_config.go:231 level=info component=web msg="TLS is disabled." http2=false ts=2022-09-26T14:54:01.926Z caller=main.go:848 level=info msg="Stopping scrape discovery manager..." ts=2022-09-26T14:54:01.926Z caller=main.go:862 level=info msg="Stopping notify discovery manager..." ts=2022-09-26T14:54:01.926Z caller=manager.go:951 level=info component="rule manager" msg="Stopping rule manager..." ts=2022-09-26T14:54:01.926Z caller=manager.go:961 level=info component="rule manager" msg="Rule manager stopped" ts=2022-09-26T14:54:01.926Z caller=main.go:899 level=info msg="Stopping scrape manager..." ts=2022-09-26T14:54:01.926Z caller=main.go:858 level=info msg="Notify discovery manager stopped" ts=2022-09-26T14:54:01.926Z caller=main.go:891 level=info msg="Scrape manager stopped" ts=2022-09-26T14:54:01.926Z caller=notifier.go:599 level=info component=notifier msg="Stopping notification manager..." ts=2022-09-26T14:54:01.926Z caller=main.go:844 level=info msg="Scrape discovery manager stopped" ts=2022-09-26T14:54:01.926Z caller=manager.go:937 level=info component="rule manager" msg="Starting rule manager..." ts=2022-09-26T14:54:01.926Z caller=main.go:1120 level=info msg="Notifier manager stopped" ts=2022-09-26T14:54:01.926Z caller=main.go:1129 level=error err="opening storage failed: /prometheus/chunks_head/000002: invalid magic number 0"
Description of problem:
Setting a telemeter proxy in the cluster-monitoring-config config map does not work as expected
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
the following KCS details steps to add a proxy.
The steps have been verified at 4.7 but do not work at 4.8, 4.9 or 4.10
https://access.redhat.com/solutions/6172402
When testing at 4.8, 4.9 and 4.10 the proxy setting where also nested under `telemeterClient`
which triggered a telemeter restart but the proxy setting do not get set in the deployment as they do in 4.7
Actual results:
4.8, 4.9 and 4.10 without the nested `telemeterClient`
does not trigger a restart of the telemeter pod
Expected results:
I think the proxy setting should be nested under telemeterClient
but should set the environment variables in the deployment
Additional info:
Description of problem:
Optional operators unpacking failure due to the `cpb` issue.
MacBook-Pro:~ jianzhang$ oc get pods NAME READY STATUS RESTARTS AGE 29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00d5pz2 0/1 Init:Error 0 4h24m 29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00ftpqw 0/1 Init:Error 0 4h25m 29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00ftts8 0/1 Init:Error 0 4h24m 29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00jbz8v 0/1 Init:Error 0 4h24m certified-operators-xjh27 1/1 Running 0 5h25m MacBook-Pro:~ jianzhang$ oc describe pods pods 29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00jbz8v Name: 29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00jbz8v Namespace: openshift-marketplace Priority: 0 ... ... Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:90944becade86164c70b8dcd70415de83cab3951cb5dcda9e4fac6968c4f2492 [cloud-user@preserve-olm-env2 jian]$ sudo podman run --rm -ti quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:90944becade86164c70b8dcd70415de83cab3951cb5dcda9e4fac6968c4f2492 bash-4.4$ which cob which: no cob in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin) bash-4.4$ which cpb /usr/bin/cpb bash-4.4$ ldd /usr/bin/cpb linux-vdso.so.1 (0x00007ffe341a2000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f51a0214000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f519fff4000) libc.so.6 => /lib64/libc.so.6 (0x00007f519fc2f000) /lib64/ld-linux-x86-64.so.2 (0x00007f51a0418000)
Version-Release number of selected component (if applicable):
Cluster version is 4.8.0-0.nightly-2023-07-21-001905
How reproducible:
always
Steps to Reproduce:
1. Install OCP 4.8 2. Subscribe to an operator.
Actual results:
Failed to install the operator. Unpack pod failed to init.
MacBook-Pro:~ jianzhang$ oc get pods NAME READY STATUS RESTARTS AGE 29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00d5pz2 0/1 Init:Error 0 4h24m 29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00ftpqw 0/1 Init:Error 0 4h25m 29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00ftts8 0/1 Init:Error 0 4h24m 29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00jbz8v 0/1 Init:Error 0 4h24m certified-operators-xjh27 1/1 Running 0 5h25m community-operators-wdzp9 1/1 Running 0 5h25m marketplace-operator-6d8f7f6f89-bgfxb 1/1 Running 0 5h36m qe-app-registry-zc4pz 1/1 Running 0 5h4m redhat-marketplace-bffjz 1/1 Running 0 5h25m redhat-operators-hjjfc 1/1 Running 0 5h25m MacBook-Pro:~ jianzhang$ oc logs 29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00jbz8v Defaulted container "extract" out of: extract, util (init), pull (init) Error from server (BadRequest): container "extract" in pod "29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00jbz8v" is waiting to start: PodInitializing
Expected results:
Unpack pod running well.
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-bootstrap/pull/72
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/230
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Prometheus fails to scrape metrics from the storage operator after some time.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Always
Steps to Reproduce:
1. Install storage operator. 2. Wait for 24h (time for the certificate to be recycled). 3.
Actual results:
Targets being down because Prometheus didn't reload the CA certificate.
Expected results:
Prometheus reloads its client TLS certificate and scrapes the target successfully.
Additional info:
This is a clone of issue OCPBUGS-249. The following is the description of the original issue:
—
+++ This bug was initially created as a clone of
Bug #2070318
+++
Description of problem:
In OCP VRRP deployment (using OCP cluster networking), we have an additional data interface which is configured along with the regular management interface in each control node. In some deployments, the kubernetes address 172.30.0.1:443 is nat’ed to the data management interface instead of the mgmt interface (10.40.1.4:6443 vs 10.30.1.4:6443 as we configure the boostrap node) even though the default route is set to 10.30.1.0 network. Because of that, all requests to 172.30.0.1:443 were failed. After 10-15 minutes, OCP magically fixes it and nat’ing correctly to 10.30.1.4:6443.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.Provision OCP cluster using cluster networking for DNS & Load Balancer instead of external DNS & Load Balancer. Provision the host with 1 management interface and an additional interface for data network. Along with OCP manifest, add manifest to create a pod which will trigger communication with kube-apiserver.
2.Start cluster installation.
3.Check on the custom pod log in the cluster when the first 2 master nodes were installing to see GET operation to kube-apiserver timed out. Check nft table and chase the ip chains to see the that the data IP address was nat'ed to kubernetes service IP address instead of the management IP. This is not happening all the time, we have seen 50:50 chance.
Actual results:
After 10-15 minutes OCP will correct that by itself.
Expected results:
Wrong natting should not happen.
Additional info:
ClusterID: 24bbde0b-79b3-4ae6-afc5-cb694fa48895
ClusterVersion: Stable at "4.8.29"
ClusterOperators:
clusteroperator/authentication is not available (OAuthServerRouteEndpointAccessibleControllerAvailable: Get "
https://oauth-openshift.apps.ocp-binhle-wqepch.contrail.juniper.net/healthz
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) because OAuthServerRouteEndpointAccessibleControllerDegraded: Get "
https://oauth-openshift.apps.ocp-binhle-wqepch.contrail.juniper.net/healthz
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
clusteroperator/baremetal is degraded because metal3 deployment inaccessible
clusteroperator/console is not available (RouteHealthAvailable: failed to GET route (
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
): Get "
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) because RouteHealthDegraded: failed to GET route (
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
): Get "
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
clusteroperator/dns is progressing: DNS "default" reports Progressing=True: "Have 4 available DNS pods, want 5."
clusteroperator/ingress is degraded because The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
clusteroperator/insights is degraded because Unable to report: unable to build request to connect to Insights server: Post "
https://cloud.redhat.com/api/ingress/v1/upload
": dial tcp: lookup cloud.redhat.com on 172.30.0.10:53: read udp 10.128.0.26:53697->172.30.0.10:53: i/o timeout
clusteroperator/network is progressing: DaemonSet "openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
— Additional comment from
bnemec@redhat.com
on 2022-03-30 20:00:25 UTC —
This is not managed by runtimecfg, but in order to route the bug correctly I need to know which CNI plugin you're using - OpenShiftSDN or OVNKubernetes. Thanks.
— Additional comment from
lpbinh@gmail.com
on 2022-03-31 08:09:11 UTC —
Hi Ben,
We were deploying Contrail CNI with OCP. However, this issue happens at very early deployment time, right after the bootstrap node is started
and there's no SDN/CNI there yet.
— Additional comment from
bnemec@redhat.com
on 2022-03-31 15:26:23 UTC —
Okay, I'm just going to send this to the SDN team then. They'll be able to provide more useful input than I can.
— Additional comment from
trozet@redhat.com
on 2022-04-04 15:22:21 UTC —
Can you please provide the iptables rules causing the DNAT as well as the routes on the host? Might be easiest to get a sosreport during initial bring up during that 10-15 min when the problem occurs.
— Additional comment from
lpbinh@gmail.com
on 2022-04-05 16:45:13 UTC —
All nodes have two interfaces:
eth0: 10.30.1.0/24
eth1: 10.40.1.0/24
machineNetwork is 10.30.1.0/24
default route points to 10.30.1.1
The kubeapi service ip is 172.30.0.1:443
all Kubernetes services are supposed to be reachable via machineNetwork (10.30.1.0/24)
To make the kubeapi service ip reachable in hostnetwork, something (openshift installer?) creates a set of nat rules which translates the service ip to the real ip of the nodes which have kubeapi active.
Initially kubeapi is only active on the bootstrap node so there should be a nat rule like
172.30.0.1:443 -> 10.30.1.10:6443 (assuming that 10.30.1.10 is the bootstrap nodes' ip address in the machine network)
However, what we see is
172.30.0.1:443 -> 10.40.1.10:6443 (which is the bootstrap nodes' eth1 ip address)
The rule is configured on the controller nodes and lead to asymmetrical routing as the controller sends a packet FROM machineNetwork (10.30.1.x) to 172.30.0.1 which is then translated and forwarded to 10.40.1.10 which then tries to reply back on the 10.40.1.0 network which fails as the request came from 10.30.1.0 network.
So, we want to understand why openshift installer picks the 10.40.1.x ip address rather than the 10.30.1.x ip for the nat rule. What's the mechanism for getting the ip in case the system has multiple interfaces with ips configured.
Note: after a while (10-20 minutes) the bootstrap process resets itself and then it picks the correct ip address from the machineNetwork and things start to work.
— Additional comment from
smerrow@redhat.com
on 2022-04-13 13:55:04 UTC —
Note from Juniper regarding requested SOS report:
In reference to
https://bugzilla.redhat.com/show_bug.cgi?id=2070318
that @Binh Le has been working on. The mustgather was too big to upload for this Bugzilla. Can you access this link?
https://junipernetworks-my.sharepoint.com/:u:/g/personal/sleigon_juniper_net/ETOrHMqao1tLm10Gmq9rzikB09H5OUwQWZRAuiOvx1nZpQ
— Additional comment from
smerrow@redhat.com
on 2022-04-21 12:24:33 UTC —
Can we please get an update on this BZ?
Do let us know if there is any other information needed.
— Additional comment from
trozet@redhat.com
on 2022-04-21 14:06:00 UTC —
Can you please provide another link to the sosreport? Looks like the link is dead.
— Additional comment from
smerrow@redhat.com
on 2022-04-21 19:01:39 UTC —
See mustgather here:
https://drive.google.com/file/d/16y9IfLAs7rtO-SMphbYBPgSbR4od5hcQ
— Additional comment from
trozet@redhat.com
on 2022-04-21 20:57:24 UTC —
Looking at the must-gather I think your iptables rules are most likely coming from the fact that kube-proxy is installed:
[trozet@fedora must-gather.local.288458111102725709]$ omg get pods -n openshift-kube-proxy
NAME READY STATUS RESTARTS AGE
openshift-kube-proxy-kmm2p 2/2 Running 0 19h
openshift-kube-proxy-m2dz7 2/2 Running 0 16h
openshift-kube-proxy-s9p9g 2/2 Running 1 19h
openshift-kube-proxy-skrcv 2/2 Running 0 19h
openshift-kube-proxy-z4kjj 2/2 Running 0 19h
I'm not sure why this is installed. Is it intentional? I don't see the configuration in CNO to enable kube-proxy. Anyway the node IP detection is done via:
https://github.com/kubernetes/kubernetes/blob/f173d01c011c3574dea73a6fa3e20b0ab94531bb/cmd/kube-proxy/app/server.go#L844
Which just looks at the IP of the node. During bare metal install a VIP is chosen and used with keepalived for kubelet to have kapi access. I don't think there is any NAT rule for services until CNO comes up. So I suspect what really is happening is your node IP is changing during install, and kube-proxy is getting deployed (either intentionally or unintentionally) and that is causing the behavior you see. The node IP is chosen via the node ip configuration service:
https://github.com/openshift/machine-config-operator/blob/da6494c26c643826f44fbc005f26e0dfd10513ae/templates/common/_base/units/nodeip-configuration.service.yaml
This service will determine the node ip via which interfaces have a default route and which one has the lowest metric. With your 2 interfaces, do they both have default routes? If so, are they using dhcp and perhaps its random which route gets installed with a lower metric?
— Additional comment from
trozet@redhat.com
on 2022-04-21 21:13:15 UTC —
Correction: looks like standalone kube-proxy is installed by default when the provider is not SDN, OVN, or kuryr so this looks like the correct default behavior for kube-proxy to be deployed.
— Additional comment from
lpbinh@gmail.com
on 2022-04-25 04:05:14 UTC —
Hi Tim,
You are right, kube-proxy is deployed by default and we don't change that behavior.
There is only 1 default route configured for the management interface (10.30.1.x) , we used to have a default route for the data/vrrp interface (10.40.1.x) with higher metric before. As said, we don't have the default route for the second interface any more but still encounter the issue pretty often.
— Additional comment from
trozet@redhat.com
on 2022-04-25 14:24:05 UTC —
Binh, can you please provide a sosreport for one of the nodes that shows this behavior? Then we can try to figure out what is going on with the interfaces and the node ip service. Thanks.
— Additional comment from
trozet@redhat.com
on 2022-04-25 16:12:04 UTC —
Actually Ben reminded me that the invalid endpoint is actually the boostrap node itself:
172.30.0.1:443 -> 10.30.1.10:6443 (assuming that 10.30.1.10 is the bootstrap nodes' ip address in the machine network)
vs
172.30.0.1:443 -> 10.40.1.10:6443 (which is the bootstrap nodes' eth1 ip address)
So maybe a sosreport off that node is necessary? I'm not as familiar with the bare metal install process, moving back to Ben.
— Additional comment from
lpbinh@gmail.com
on 2022-04-26 08:33:45 UTC —
Created attachment 1875023 [details]sosreport
— Additional comment from
lpbinh@gmail.com
on 2022-04-26 08:34:59 UTC —
Created attachment 1875024 [details]sosreport-part2
Hi Tim,
We observe this issue when deploying clusters using OpenStack instances as our infrastructure is based on OpenStack.
I followed the steps here to collect the sosreport:
https://docs.openshift.com/container-platform/4.8/support/gathering-cluster-data.html
Got the sosreport which is 22MB which exceeds the size permitted (19MB), so I split it to 2 files (xaa and xab), if you can't join them then we will need to put the collected sosreport on a share drive like we did with the must-gather data.
Here are some notes about the cluster:
First two control nodes are below, ocp-binhle-8dvald-ctrl-3 is the bootstrap node.
[core@ocp-binhle-8dvald-ctrl-2 ~]$ oc get node
NAME STATUS ROLES AGE VERSION
ocp-binhle-8dvald-ctrl-1 Ready master 14m v1.21.8+ed4d8fd
ocp-binhle-8dvald-ctrl-2 Ready master 22m v1.21.8+ed4d8fd
We see the behavior that wrong nat'ing was done at the beginning, then corrected later:
sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 3 bytes 180 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y
}
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4
}
sh-4.4#
sh-4.4#
<....after a while....>
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y
}
sh-4.4# nft list chain ip nat KUBE-SEP-X33IBTDFOZRR6ONM
table ip nat {
chain KUBE-SEP-X33IBTDFOZRR6ONM
}
sh-4.4#
— Additional comment from
lpbinh@gmail.com
on 2022-05-12 17:46:51 UTC —
@
trozet@redhat.com
May we have an update on the fix, or the plan for the fix? Thank you.
— Additional comment from
lpbinh@gmail.com
on 2022-05-18 21:27:45 UTC —
Created support Case 03223143.
— Additional comment from
vkochuku@redhat.com
on 2022-05-31 16:09:47 UTC —
Hello Team,
Any update on this?
Thanks,
Vinu K
— Additional comment from
smerrow@redhat.com
on 2022-05-31 17:28:54 UTC —
This issue is causing delays in Juniper's CI/CD pipeline and makes for a less than ideal user experience for deployments.
I'm getting a lot of pressure from the partner on this for an update and progress. I've had them open a case [1] to help progress.
Please let us know if there is any other data needed by Juniper or if there is anything I can do to help move this forward.
[1]
https://access.redhat.com/support/cases/#/case/03223143
— Additional comment from
vpickard@redhat.com
on 2022-06-02 22:14:23 UTC —
@
bnemec@redhat.com
Tim mentioned in
https://bugzilla.redhat.com/show_bug.cgi?id=2070318#c14
that this issue appears to be at BM install time. Is this something you can help with, or do we need help from the BM install team?
— Additional comment from
bnemec@redhat.com
on 2022-06-03 18:15:17 UTC —
Sorry, I missed that this came back to me.
(In reply to Binh Le from
comment #16
)> We observe this issue when deploying clusters using OpenStack instances as
> our infrastructure is based on OpenStack.This does not match the configuration in the must-gathers provided so far, which are baremetal. Are we talking about the same environments?
I'm currently discussing this with some other internal teams because I'm unfamiliar with this type of bootstrap setup. I need to understand what the intended behavior is before we decide on a path forward.
— Additional comment from
rurena@redhat.com
on 2022-06-06 14:36:54 UTC —
(In reply to Ben Nemec from
comment #22
)> Sorry, I missed that this came back to me.
>
> (In reply to Binh Le from comment #16)
> > We observe this issue when deploying clusters using OpenStack instances as
> > our infrastructure is based on OpenStack.
>
> This does not match the configuration in the must-gathers provided so far,
> which are baremetal. Are we talking about the same environments?
>
> I'm currently discussing this with some other internal teams because I'm
> unfamiliar with this type of bootstrap setup. I need to understand what the
> intended behavior is before we decide on a path forward.I spoke to the CU they tell me that all work should be on baremetal. They were probably just testing on OSP and pointing out that they saw the same behavior.
— Additional comment from
bnemec@redhat.com
on 2022-06-06 16:19:37 UTC —
Okay, I see now that this is an assisted installer deployment. Can we get the cluster ID assigned by AI so we can take a look at the logs on our side? Thanks.
— Additional comment from
lpbinh@gmail.com
on 2022-06-06 16:38:56 UTC —
Here is the cluster ID, copied from the bug description:
ClusterID: 24bbde0b-79b3-4ae6-afc5-cb694fa48895
In regard to your earlier question about OpenStack & baremetal (2022-06-03 18:15:17 UTC):
We had an issue with platform validation in OpenStack earlier. Host validation was failing with the error message “Platform network settings: Platform OpenStack Compute is allowed only for Single Node OpenShift or user-managed networking.”
It's found out that there is no platform type "OpenStack" available in [
https://github.com/openshift/assisted-service/blob/master/models/platform_type.go#L29
] so we set "baremetal" as the platform type on our computes. That's the reason why you are seeing baremetal as the platform type.
Thank you
— Additional comment from
ercohen@redhat.com
on 2022-06-08 08:00:18 UTC —
Hey, first you are currect, When you set 10.30.1.0/24 as the machine network, the bootstrap process should use the IP on that subnet in the bootstrap node.
I'm trying to understand how exactly this cluster was installed.
You are using on-prem deployment of assisted-installer (podman/ACM)?
You are trying to form a cluster from OpenStack Vms?
You set the platform to Baremetal where?
Did you set user-managed-netwroking?
Some more info, when using OpenStack platform you should install the cluster with user-managed-netwroking.
And that's what the failing validation is for.
— Additional comment from
bnemec@redhat.com
on 2022-06-08 14:56:53 UTC —
Moving to the assisted-installer component for further investigation.
— Additional comment from
lpbinh@gmail.com
on 2022-06-09 07:37:54 UTC —
@Eran Cohen:
Please see my response inline.
You are using on-prem deployment of assisted-installer (podman/ACM)?
--> Yes, we are using on-prem deployment of assisted-installer.
You are trying to form a cluster from OpenStack Vms?
--> Yes.
You set the platform to Baremetal where?
--> It was set in the Cluster object, Platform field when we model the cluster.
Did you set user-managed-netwroking?
--> Yes, we set it to false for VRRP.
— Additional comment from
itsoiref@redhat.com
on 2022-06-09 08:17:23 UTC —
@
lpbinh@gmail.com
can you please share assisted logs that you can download when cluster is failed or installed?
Will help us to see the full picture
— Additional comment from
ercohen@redhat.com
on 2022-06-09 08:23:18 UTC —
OK, as noted before when using OpenStack platform you should install the cluster with user-managed-netwroking (set to true).
Can you explain how you workaround this failing validation? “Platform network settings: Platform OpenStack Compute is allowed only for Single Node OpenShift or user-managed networking.”
What does this mean exactly? 'we set "baremetal" as the platform type on our computes'
To be honest I'm surprised that the installation was completed successfully.
@
oamizur@redhat.com
I thought installing on OpenStack VMs with baremetal platform (user-managed-networking=false) will always fail?
— Additional comment from
lpbinh@gmail.com
on 2022-06-10 16:04:56 UTC —
@
itsoiref@redhat.com
: I will reproduce and collect the logs. Is that supposed to be included in the provided must-gather?
@
ercohen@redhat.com
:
— Additional comment from
itsoiref@redhat.com
on 2022-06-13 13:08:17 UTC —
@
lpbinh@gmail.com
you will have download_logs link in UI. Those logs are not part of must-gather
— Additional comment from
lpbinh@gmail.com
on 2022-06-14 18:52:02 UTC —
Created attachment 1889993 [details]cluster log per need info request - Cluster ID caa475b0-df04-4c52-8ad9-abfed1509506
Attached is the cluster log per need info request.
Cluster ID: caa475b0-df04-4c52-8ad9-abfed1509506
In this reproduction, the issue is not resolved by OpenShift itself, wrong NAT still remained and cluster deployment failed eventually
sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 2 bytes 120 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y
}
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4
}
Tue Jun 14 17:40:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4
}
Tue Jun 14 17:59:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4
}
Tue Jun 14 18:17:38 UTC 2022
sh-4.4#
sh-4.4#
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4
}
Tue Jun 14 18:49:28 UTC 2022
sh-4.4#
— Additional comment from
itsoiref@redhat.com
on 2022-06-15 15:59:22 UTC —
@
lpbinh@gmail.com
just for the protocol, we don't support baremetal ocp on openstack that's why validation is failing
— Additional comment from
lpbinh@gmail.com
on 2022-06-15 17:47:39 UTC —
@
itsoiref@redhat.com
as explained it's just a workaround on our side to make OCP work in our lab, and from my understanding on OCP perspective it will see that deployment is on baremetal only, not related to OpenStack (please correct me if I am wrong).
We have been doing thousands of OCP cluster deployments in our automation so far, if it's why validation is failing then it should be failing every time. However it only occurs occasionally when nodes have 2 interfaces, using OCP internal DNS and Load balancer, and sometime resolved by itself and sometime not.
— Additional comment from
itsoiref@redhat.com
on 2022-06-19 17:00:01 UTC —
For now i can assume that this endpoint is causing the issue:
{
"apiVersion": "v1",
"kind": "Endpoints",
"metadata": {
"creationTimestamp": "2022-06-14T17:31:10Z",
"labels":
,
"name": "kubernetes",
"namespace": "default",
"resourceVersion": "265",
"uid": "d8f558be-bb68-44ac-b7c2-85ca7a0fdab3"
},
"subsets": [
{
"addresses": [
],
"ports": [
{
"name": "https",
"port": 6443,
"protocol": "TCP"
}
]
}
]
},
— Additional comment from
itsoiref@redhat.com
on 2022-06-21 17:03:51 UTC —
The issue is that kube-api service advertise wrong ip but it does it cause kubelet chooses the one arbitrary and we currently have no mechanism to set kubelet ip, especially in bootstrap flow.
— Additional comment from
lpbinh@gmail.com
on 2022-06-22 16:07:29 UTC —
@
itsoiref@redhat.com
how do you perform OCP deployment in setups that have multiple interfaces if letting kubelet chooses an interface arbitrary instead of configuring a specific IP address for it to listen on? With what you describe above chance of deployment failure in system with multiple interfaces would be high.
— Additional comment from
dhellard@redhat.com
on 2022-06-24 16:32:26 UTC —
I set the Customer Escalation flag = Yes, per ACE EN-52253.
The impact is noted by the RH Account team: "Juniper is pressing and this impacts the Unica Next Project at Telefónica Spain. Unica Next is a critical project for Red Hat. We go live the 1st of July and this issue could impact the go live dates. We need clear information about the status and its possible resolution.
— Additional comment from
itsoiref@redhat.com
on 2022-06-26 07:28:44 UTC —
I have sent an image with possible fix to Juniper and waiting for their feedback, once they will confirm it works for them we will proceed with the PRs.
— Additional comment from
pratshar@redhat.com
on 2022-06-30 13:26:26 UTC —
=== In Red Hat Customer Portal Case 03223143 ===
— Comment by Prateeksha Sharma on 6/30/2022 6:56 PM —
//EMT note//
Update from our consultant Manuel Martinez Briceno -
====
on 28th June, 2022 the last feedback from Juniper Project Manager and our Partner Manager was that they are testing the fix. They didn't give an Estimate Time to finish, but we will be tracking this closely and let us know of any news.
====
Thanks & Regards,
Prateeksha Sharma
Escalation Manager | RHCSA
Global Support Services, Red Hat
This is a clone of issue OCPBUGS-6881. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-5947. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4833. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4819. The following is the description of the original issue:
—
Description of problem:
s2i/run script has a bug - /usr/libexec/s2i/run: line 578: [: too many arguments
Version-Release number of selected component (if applicable):
v4.10
How reproducible:
Start jenkins from the container image using /usr/libexec/s2i/run while having a route that contains a certificate or key that includes special characters.
Steps to Reproduce:
1. create a route that contains a TLS certificate 2. start a pod using openshift4/ose-jenkins:v4.10.0 3. view the log
Actual results:
2022/12/12 17:30:33 [go-init] No pre-start command defined, skip 2022/12/12 17:30:33 [go-init] Main command launched : /usr/libexec/s2i/run CONTAINER_MEMORY_IN_MB='12288', using /usr/lib/jvm/java-11-openjdk-11.0.16.0.8-1.el8_4.x86_64/bin/java and /usr/lib/jvm/java-11-openjdk-11.0.16.0.8-1.el8_4.x86_64/bin/javac Administrative monitors that contact the update center will remain active Migrating slave image configuration to current version tag ... /usr/libexec/s2i/run: line 578: [: too many arguments Using JENKINS_SERVICE_NAME=jenkins Generating jenkins.model.JenkinsLocationConfiguration.xml using (/var/lib/jenkins/jenkins.model.JenkinsLocationConfiguration.xml.tpl) ... Jenkins URL set to: https://bojenkinsdev.micron.com/ in file: /var/lib/jenkins/jenkins.model.JenkinsLocationConfiguration.xml + exec java -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -XX:+ParallelRefProcEnabled -XX:+UseG1GC -XX:+UseStringDeduplication -XX:HeapDumpPath=/var/log/jenkins/ '-Xlog:gc*=debug:file=/var/log/jenkins-engserv/gc-%t.log:utctime:filecount=2,filesize=100m' -Xms2g -Xmx8g -Dfile.encoding=UTF8 -Djavamelody.displayed-counters=log,error -Djava.util.logging.config.file=/var/lib/jenkins/logging.properties -Djavax.net.ssl.trustStore=/var/lib/jenkins/ca-anchors-keystore -Dcom.redhat.fips=false -Djdk.http.auth.tunneling.disabledSchemes= -Djdk.http.auth.proxying.disabledSchemes= -Duser.home=/var/lib/jenkins -Djavamelody.application-name=jenkins -Dhudson.security.csrf.GlobalCrumbIssuerConfiguration.DISABLE_CSRF_PROTECTION=true -Djenkins.install.runSetupWizard=false -Dhudson.security.csrf.GlobalCrumbIssuerConfiguration.DISABLE_CSRF_PROTECTION=false -XX:+AlwaysPreTouch -XX:ErrorFile=/var/log/jenkins-engserv -Dhudson.model.ParametersAction.keepUndefinedParameters=false -jar /usr/lib/jenkins/jenkins.war --prefix=/je... Picked up JAVA_TOOL_OPTIONS: -XX:+UnlockExperimentalVMOptions -Dsun.zip.disableMemoryMapping=true
Expected results:
not have the following error: /usr/libexec/s2i/run: line 578: [: too many arguments
Additional info:
Description of problem:
We need to update the operator to be synced with the K8 api version used by OCP 4.13. We also need to sync our samples libraries with latest available libraries. Any deprecated libraries should be removed as well.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
- After upgrading to OCP 4.10.41, thanos-ruler-user-workload-1 in the openshift-user-workload-monitoring namespace is consistently being created and deleted. - We had to scale down the Prometheus operator multiple times so that the upgrade is considered as successful. - This fix is temporary. After some time it appears again and Prometheus operator needs to be scaled down and up again. - The issue is present on all clusters in this customer environment which are upgraded to 4.10.41.
Version-Release number of selected component (if applicable):
How reproducible:
N/A, I wasn't able to reproduce the issue.
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
P-06-TC01 | Text change is required |
P-06-TC04 | Text change is required |
P-06-TC13 | Text change is required |
P-03-TC03 also get fixed with this bug
3.4.21 is about to go out soonish with this plan: https://github.com/etcd-io/etcd/issues/14438
Two important BZs from our side are pending this rebase:
Update the kafka test scenarios in eventing-kafka-event-source.feature file
While Regression Test execution, updated the test scenarios
Description of problem:
Updating the k* version to v0.27.2 in cluster samples operator for OCP 4.14 release
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-1758. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-1354. The following is the description of the original issue:
—
This was originally reported in BZ as https://bugzilla.redhat.com/show_bug.cgi?id=2046335
—
Description of problem:
The issue reported here https://bugzilla.redhat.com/show_bug.cgi?id=1954121 still occur (tested on OCP 4.8.11, the CU also verified that the issue can happen even with OpenShift 4.7.30, 4.8.17 and 4.9.11)
How reproducible:
Attach a NIC to a master node will trigger the issue
Steps to Reproduce:
1. Deploy an OCP cluster (I've tested it IPI on AWS)
2. Attach a second NIC to a running master node (in my case "ip-10-0-178-163.eu-central-1.compute.internal")
Actual results:
~~~
$ oc get node ip-10-0-178-163.eu-central-1.compute.internal -o json | jq ".status.addresses"
[
,
,
,
{ "address": "ip-10-0-178-163.eu-central-1.compute.internal", "type": "InternalDNS" }]
$ oc get co etcd
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
etcd 4.8.11 True False True 31h
$ oc get co etcd -o json | jq ".status.conditions[0]"
{ "lastTransitionTime": "2022-01-26T15:47:42Z", "message": "EtcdCertSignerControllerDegraded: [x509: certificate is valid for 10.0.178.163, not 10.0.187.247, x509: certificate is valid for ::1, 10.0.178.163, 127.0.0.1, ::1, not 10.0.187.247]", "reason": "EtcdCertSignerController_Error", "status": "True", "type": "Degraded" }~~~
Expected results:
To have the certificate valid also for the second IP (the newly created one "10.0.187.247")
Additional info:
Deleting the following secrets seems to solve the issue:
~~~
$ oc get secret n openshift-etcd | grep kubernetes.io/tls | grep ^etcd
etcd-client kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 60s
etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 60s
etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 58s
etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 58s
$ oc get secret n openshift-etcd | grep kubernetes.io/tls | grep ^etcd | awk '
' | xargs -I {} oc delete secret {} -n openshift-etcd
secret "etcd-client" deleted
secret "etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal" deleted
$ oc get co etcd -o json | jq ".status.conditions[0]"
{ "lastTransitionTime": "2022-01-26T15:52:21Z", "message": "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found", "reason": "AsExpected", "status": "False", "type": "Degraded" }~~~
Description of problem:
When queried dns hostname from certain pod on the certain node, responded from random coredns pod, not prefer local one. Is it expected result ? # In OCP v4.8.13 case // Ran dig command on the certain node which is running the following test-7cc4488d48-tqc4m pod. sh-4.4# while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done : 07:16:33 :172.217.175.238 07:16:34 :172.217.175.238 <--- Refreshed the upstream result 07:16:36 :142.250.207.46 07:16:37 :142.250.207.46 // The dig results is matched with the running node one as you can see the above one. $ oc rsh test-7cc4488d48-tqc4m bash -c 'while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done' : 07:16:35 :172.217.175.238 07:16:36 :172.217.175.238 <--- At the same time, the pod dig result is also refreshed. 07:16:37 :142.250.207.46 07:16:38 :142.250.207.46 But in v4.10 case, in contrast, the dns query result is various and responded randomly regardless local dns results on the node as follows. # In OCP v4.10.23 case, pod's response from DNS services are not consistent. $ oc rsh test-848fcf8ddb-zrcbx bash -c 'while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done' 07:23:00 :142.250.199.110 07:23:01 :142.250.207.46 07:23:02 :142.250.207.46 07:23:03 :142.250.199.110 07:23:04 :142.250.199.110 07:23:05 :172.217.161.78 # Even though the node which is running the pod keep responding the same IP... sh-4.4# while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done 07:23:00 :172.217.161.78 07:23:01 :172.217.161.78 07:23:02 :172.217.161.78 07:23:03 :172.217.161.78 07:23:04 :172.217.161.78 07:23:05 :172.217.161.78
Version-Release number of selected component (if applicable):
v4.10.23 (ROSA) SDN: OpenShiftSDN
How reproducible:
You can always reproduce this issue using "dig google.com" from both any pod and the node the pod running according to the above "Description" details.
Steps to Reproduce:
1. Run any usual pod, and check which node the pod is running on. 2. Run dig google.com on the pod and the node. 3. Check the IP is consistent with the running node each other.
Actual results:
The response IPs are not consistent and random IP is responded.
Expected results:
The response IP is kind of consistent, and aware of prefer local dns.
Additional info:
This issue affects EgressNetworkPolicy dnsName feature.
This is a clone of issue OCPBUGS-6086. The following is the description of the original issue:
—
We're still seeing errors with swift in the 4.8 branch:
level=error msg=Cannot extract names from response with content-type: []
See for instance this job: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cloud-provider-openstack/169/pull-ci-openshift-cloud-provider-openstack-release-4.8-e2e-openstack/1616019853557108736
Create Namespaces script is keep on failing due to load issue
Unable to execute the create namespace script
Create Namespace script should work without any issue
Description of problem:
Description of problem: Version-Release number of selected component (if applicable): How reproducible: Search link: https://search.ci.openshift.org/?search=Create+namespace+from+install+operators+creates+namespace+from+operator+install+page&maxAge=12h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Steps to Reproduce: 1. 2. 3.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Some of the steps in test scenarios [A-06-TC02]- Script fix required
A-06-TC05 - script fix required
A-06-TC11 update required as per the latest UI
This is a clone of issue OCPBUGS-2173. The following is the description of the original issue:
—
Description of problem:
When a custom machineConfigPool is created and no node is associated with it, the mcp remains at 0% progress.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Create a custom mcp: ~~~ cat << EOF | oc create -f - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: custom spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,custom]} nodeSelector: matchLabels: node-role.kubernetes.io/custom: "" EOF ~~~
Actual results:
The mcp is visible from the from "Administrator view > Cluster Settings > Details" at 0% progress
Expected results:
It shouldn't be stuck at 0%
Additional info:
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/222
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
release-4.8 of openshift/cloud-provider-openstack is missing some commits that were backported in upstream project into the release-1.21 branch. We should import them in our downstream fork.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The statefulset thanos-ruler-user-workload no serviceName. As the document described, the serviceName is a must for Statefulset. I'm not sure if we need service here, but one question, if we don't need service, why not use a regular Deployment? Thanks!
MacBook-Pro:k8sgpt jianzhang$ oc explain statefulset.spec.serviceName KIND: StatefulSet VERSION: apps/v1FIELD: serviceName <string>DESCRIPTION: serviceName is the name of the service that governs this StatefulSet. This service must exist before the StatefulSet, and is responsible for the network identity of the set. Pods get DNS/hostnames that follow the pattern: pod-specific-string.serviceName.default.svc.cluster.local where "pod-specific-string" is managed by the StatefulSet controller. MacBook-Pro:k8sgpt jianzhang$ oc get statefulset -n openshift-user-workload-monitoring -o=jsonpath={.spec.serviceName} MacBook-Pro:k8sgpt jianzhang$ MacBook-Pro:k8sgpt jianzhang$ oc get statefulset -n openshift-user-workload-monitoring NAME READY AGE prometheus-user-workload 2/2 4h44m thanos-ruler-user-workload 2/2 4h44m MacBook-Pro:k8sgpt jianzhang$ oc get svc -n openshift-user-workload-monitoring NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE prometheus-operated ClusterIP None <none> 9090/TCP,10901/TCP 4h44m prometheus-operator ClusterIP None <none> 8443/TCP 4h44m prometheus-user-workload ClusterIP 172.30.46.204 <none> 9091/TCP,9092/TCP,10902/TCP 4h44m prometheus-user-workload-thanos-sidecar ClusterIP None <none> 10902/TCP 4h44m thanos-ruler ClusterIP 172.30.110.49 <none> 9091/TCP,9092/TCP,10901/TCP 4h44m thanos-ruler-operated ClusterIP None <none> 10902/TCP,10901/TCP 4h44m
Version-Release number of selected component (if applicable):
MacBook-Pro:k8sgpt jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-05-31-080250 True False 7h30m Cluster version is 4.14.0-0.nightly-2023-05-31-080250
How reproducible:
always
Steps to Reproduce:
1. Install OCP 4.14 cluster. 2. Check cluster's statefulset instances or run `k8sgpt analyze -d` 3.
Actual results:
MacBook-Pro:k8sgpt jianzhang$ k8sgpt analyze -d Service nfs-provisioner/example.com-nfs does not exist AI Provider: openai 0 openshift-user-workload-monitoring/thanos-ruler-user-workload(thanos-ruler-user-workload) - Error: StatefulSet uses the service openshift-user-workload-monitoring/ which does not exist. Kubernetes Doc: serviceName is the name of the service that governs this StatefulSet. This service must exist before the StatefulSet, and is responsible for the network identity of the set. Pods get DNS/hostnames that follow the pattern: pod-specific-string.serviceName.default.svc.cluster.local where "pod-specific-string" is managed by the StatefulSet controller.
Expected results:
There is the serviceName for statefulset.
Additional info:
console-operator codebase contains a lot of inline manifests. Instead we should put those manifests into a `/bindata` folder, from which they will be read and then updated per purpose.
Description of problem:
We need to update the operator to be synced with the K8 api version used by OCP 4.13. We also need to sync our samples libraries with latest available libraries. Any deprecated libraries should be removed as well.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
1) We want to fix the order of Imports in the files.
2) We want to have vendor import, followed by console/package import and then relative imports should come at last.
Can be done manually or introduce some linter rules for this.
Description of problem:
We need to update the operator to be synced with the K8 api version used by OCP 4.14. We also need to sync our samples libraries with latest available libraries. Any deprecated libraries should be removed as well.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-2594. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-1538. The following is the description of the original issue:
—
Description of problem:
Tracking this for backport of https://bugzilla.redhat.com/show_bug.cgi?id=2072710
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
ovnkube master container crash because service controller panic 2022-11-07T04:45:39.923889951+11:00 E1106 17:45:39.923862 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) 2022-11-07T04:45:39.923889951+11:00 goroutine 16732 [running]: 2022-11-07T04:45:39.923889951+11:00 k8s.io/apimachinery/pkg/util/runtime.logPanic(0x190bb00, 0x28ce990) 2022-11-07T04:45:39.923889951+11:00 /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95 2022-11-07T04:45:39.923889951+11:00 k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) 2022-11-07T04:45:39.923889951+11:00 /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86 2022-11-07T04:45:39.923889951+11:00 panic(0x190bb00, 0x28ce990) 2022-11-07T04:45:39.923889951+11:00 /usr/lib/golang/src/runtime/panic.go:965 +0x1b9 2022-11-07T04:45:39.923889951+11:00 k8s.io/api/core/v1.(*Service).GetObjectKind(0x0, 0x7f25ab6ead68, 0x0) 2022-11-07T04:45:39.923889951+11:00 <autogenerated>:1 +0x5 2022-11-07T04:45:39.923889951+11:00 k8s.io/client-go/tools/reference.GetReference(0xc00023b420, 0x1d5f6e0, 0x0, 0x7f257803baf8, 0x0, 0x0) 2022-11-07T04:45:39.923889951+11:00 /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/reference/ref.go:59 +0x14d 2022-11-07T04:45:39.923889951+11:00 k8s.io/client-go/tools/record.(*recorderImpl).generateEvent(0xc0078afd80, 0x1d5f6e0, 0x0, 0x0, 0xc0d21a90f70f1ab5, 0x8c5d1203c70a, 0x290c9a0, 0x1b168b6, 0x7, 0x1b36c3b, ...) 2022-11-07T04:45:39.923889951+11:00 /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/record/event.go:327 +0x5d 2022-11-07T04:45:39.923889951+11:00 k8s.io/client-go/tools/record.(*recorderImpl).Event(0xc0078afd80, 0x1d5f6e0, 0x0, 0x1b168b6, 0x7, 0x1b36c3b, 0x1d, 0xc0167a8340, 0x186) 2022-11-07T04:45:39.923889951+11:00 /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/record/event.go:349 +0xc5 2022-11-07T04:45:39.923889951+11:00 k8s.io/client-go/tools/record.(*recorderImpl).Eventf(0xc0078afd80, 0x1d5f6e0, 0x0, 0x1b168b6, 0x7, 0x1b36c3b, 0x1d, 0x1b7df1f, 0x41, 0xc00ef72060, ...) 2022-11-07T04:45:39.923889951+11:00 /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/record/event.go:353 +0xca 2022-11-07T04:45:39.923889951+11:00 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*Controller).syncServices(0xc001e60c60, 0xc00f1de210, 0x2f, 0x0, 0x0) 2022-11-07T04:45:39.923889951+11:00 /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/controller/services/services_controller.go:246 +0x682 2022-11-07T04:45:39.923889951+11:00 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*Controller).processNextWorkItem(0xc001e60c60, 0x203000) 2022-11-07T04:45:39.923889951+11:00 /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/controller/services/services_controller.go:184 +0xcd 2022-11-07T04:45:39.923889951+11:00 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*Controller).worker(...) 2022-11-07T04:45:39.923889951+11:00 /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/controller/services/services_controller.go:173 2022-11-07T04:45:39.923889951+11:00 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc0082cdda0) 2022-11-07T04:45:39.923889951+11:00 /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f 2022-11-07T04:45:39.923889951+11:00 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0082cdda0, 0x1d535c0, 0xc005ef6e70, 0xc0106e2701, 0xc0000db500) 2022-11-07T04:45:39.923889951+11:00 /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b 2022-11-07T04:45:39.923889951+11:00 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0082cdda0, 0x3b9aca00, 0x0, 0xc0106e2701, 0xc0000db500) 2022-11-07T04:45:39.923889951+11:00 /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98 2022-11-07T04:45:39.923889951+11:00 k8s.io/apimachinery/pkg/util/wait.Until(0xc0082cdda0, 0x3b9aca00, 0xc0000db500) 2022-11-07T04:45:39.923889951+11:00 /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d 2022-11-07T04:45:39.923889951+11:00 created by github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*Controller).Run 2022-11-07T04:45:39.923889951+11:00 /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/controller/services/services_controller.go:161 +0x3b1 2022-11-07T04:45:39.927176450+11:00 panic: runtime error: invalid memory address or nil pointer dereference [recovered] 2022-11-07T04:45:39.927176450+11:00 panic: runtime error: invalid memory address or nil pointer dereference 2022-11-07T04:45:39.927176450+11:00 [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xd3c385] 2022-11-07T04:45:39.927176450+11:00
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Migrate the existing tests which are located here :
Helper functions/Views location:
Description of problem:
based on bugs from ART team, example: https://issues.redhat.com/browse/OCPBUGS-12347, 4.14 image should be built with go 1.20, but prometheus container image is built by go1.19.6
$ token=`oc create token prometheus-k8s -n openshift-monitoring` $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/label/goversion/values' | jq { "status": "success", "data": [ "go1.19.6", "go1.20.3" ] }
searched from thanos API
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query={__name__=~".*",goversion="go1.19.6"}' | jq { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "prometheus_build_info", "branch": "rhaos-4.14-rhel-8", "container": "kube-rbac-proxy", "endpoint": "metrics", "goarch": "amd64", "goos": "linux", "goversion": "go1.19.6", "instance": "10.128.2.19:9092", "job": "prometheus-k8s", "namespace": "openshift-monitoring", "pod": "prometheus-k8s-0", "prometheus": "openshift-monitoring/k8s", "revision": "fe01b9f83cb8190fc8f04c16f4e05e87217ab03e", "service": "prometheus-k8s", "tags": "unknown", "version": "2.43.0" }, "value": [ 1682576802.496, "1" ] }, ...
prometheus-k8s-0 container name: [prometheus config-reloader thanos-sidecar prometheus-proxy kube-rbac-proxy kube-rbac-proxy-thanos], prometheus image is built with go1.19.6
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- prometheus --version prometheus, version 2.43.0 (branch: rhaos-4.14-rhel-8, revision: fe01b9f83cb8190fc8f04c16f4e05e87217ab03e) build user: root@402ffbe02b57 build date: 20230422-00:43:08 go version: go1.19.6 platform: linux/amd64 tags: unknown $ oc -n openshift-monitoring exec -c config-reloader prometheus-k8s-0 -- prometheus-config-reloader --version prometheus-config-reloader, version 0.63.0 (branch: rhaos-4.14-rhel-8, revision: ce71a7d) build user: root build date: 20230424-15:53:51 go version: go1.20.3 platform: linux/amd64 $ oc -n openshift-monitoring exec -c thanos-sidecar prometheus-k8s-0 -- thanos --version thanos, version 0.31.0 (branch: rhaos-4.14-rhel-8, revision: d58df6d218925fd007e16965f50047c9a4194c42) build user: root@c070c5e6af32 build date: 20230422-00:44:21 go version: go1.20.3 platform: linux/amd64 # owned by oauth team, not responsible by Monitoring $ oc -n openshift-monitoring exec -c prometheus-proxy prometheus-k8s-0 -- oauth-proxy --version oauth2_proxy was built with go1.18.10 # below isssue is tracked by bug OCPBUGS-12821 $ oc -n openshift-monitoring exec -c kube-rbac-proxy prometheus-k8s-0 -- kube-rbac-proxy --version Kubernetes v0.0.0-master+$Format:%H$ $ oc -n openshift-monitoring exec -c kube-rbac-proxy-thanos prometheus-k8s-0 -- kube-rbac-proxy --version Kubernetes v0.0.0-master+$Format:%H$
should fix files
https://github.com/openshift/prometheus/blob/master/.ci-operator.yaml#L4
https://github.com/openshift/prometheus/blob/master/Dockerfile.ocp#L1
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-04-26-154754
How reproducible:
always
Actual results:
4.14 prometheus is built with go1.19.6
Expected results:
4.14 prometheus image should be built with go1.20
Additional info:
no functional impact
In the topology view, if you select any grouping (Application, Helm Release, Operator Backed service, etc), an extraneous blue box is displayed
This is a regression.
Create an application in any way ... but this will do ...
This animated gif shows the issue:
The blue box shouldn't be shown
Always
Seen on 4/26/2021 4.8 daily, but this behavior was discussed in slack last week
This is a regression
This is a clone of issue OCPBUGS-1010. The following is the description of the original issue:
—
Description of problem:
+++ This bug was initially created as a clone of https://issues.redhat.com//browse/OCPBUGS-784 Various CI steps use the upi-installer container for it's access to the aws cli tools among other things. However, most of those steps also curl yq directly from GitHub. We can save ourselves some headaches when GitHub is down by just embedding the binary in the image already. Whenever GitHub has issues or throttles us, YQ hash mismatch error out. The hash mismatch error is because github is probably returning an error page, although our scripts hide it.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When logged in as a non-admin user, I can't silence alerts from the Dev Console.
Version-Release number of selected component (if applicable):
4.10 but the same issue may exist for previous versions.
How reproducible:
Steps to Reproduce:
1. Login to the dev console as a non-admin user. 2. Follow the OCP documentation to deploy the example application including the service monitor and the rule (the user needs to be granted the monitoring-edit role). See https://docs.openshift.com/container-platform/4.9/monitoring/enabling-monitoring-for-user-defined-projects.html for details. 3. Go to the Observe > Alerts page and disable notifications during 30 minutes for the VersionAlert alert.
Actual results:
The alert notification seems to be disabled but on refresh, the notification is still enabled.
Expected results:
The notification is permanently disabled.
Additional info:
The console backend hits the wrong service which results in a 403 response code, it should use the tenancy-aware service.
Copied from https://bugzilla.redhat.com/show_bug.cgi?id=2117608
Description of problem:
hack/check-plugins-supply-chain-change.sh script is not executable when running ci/prow/jenkins-check-plugins-supply-chain-change job
Version-Release number of selected component (if applicable):
4.8.z
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
P-02-TC02 | Script fix required - unable to identify locators |
P-02-TC03 | Script fix required - unable to identify locators |
P-02-TC06 | Script fix required - unable to identify locators |
Description of problem:
When setting the allowedregistries like the example below, the openshift-samples operator is degraded: oc get image.config.openshift.io/cluster -o yaml apiVersion: config.openshift.io/v1 kind: Image metadata: annotations: release.openshift.io/create-only: "true" creationTimestamp: "2020-12-16T15:48:20Z" generation: 2 name: cluster resourceVersion: "422284920" uid: d406d5a0-c452-4a84-b6b3-763abb51d7a5 spec: additionalTrustedCA: name: registry-ca allowedRegistriesForImport: - domainName: quay.io insecure: false - domainName: registry.redhat.io insecure: false - domainName: registry.access.redhat.com insecure: false - domainName: registry.redhat.io/redhat/redhat-operator-index insecure: true - domainName: registry.redhat.io/redhat/redhat-marketplace-index insecure: true - domainName: registry.redhat.io/redhat/certified-operator-index insecure: true - domainName: registry.redhat.io/redhat/community-operator-index insecure: true registrySources: allowedRegistries: - quay.io - registry.redhat.io - registry.rijksapps.nl - registry.access.redhat.com - registry.redhat.io/redhat/redhat-operator-index - registry.redhat.io/redhat/redhat-marketplace-index - registry.redhat.io/redhat/certified-operator-index - registry.redhat.io/redhat/community-operator-index oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.10.21 True False False 5d13h baremetal 4.10.21 True False False 450d cloud-controller-manager 4.10.21 True False False 94d cloud-credential 4.10.21 True False False 624d cluster-autoscaler 4.10.21 True False False 624d config-operator 4.10.21 True False False 624d console 4.10.21 True False False 42d csi-snapshot-controller 4.10.21 True False False 31d dns 4.10.21 True False False 217d etcd 4.10.21 True False False 624d image-registry 4.10.21 True False False 94d ingress 4.10.21 True False False 94d insights 4.10.21 True False False 104s kube-apiserver 4.10.21 True False False 624d kube-controller-manager 4.10.21 True False False 624d kube-scheduler 4.10.21 True False False 624d kube-storage-version-migrator 4.10.21 True False False 31d machine-api 4.10.21 True False False 624d machine-approver 4.10.21 True False False 624d machine-config 4.10.21 True False False 17d marketplace 4.10.21 True False False 258d monitoring 4.10.21 True False False 161d network 4.10.21 True False False 624d node-tuning 4.10.21 True False False 31d openshift-apiserver 4.10.21 True False False 42d openshift-controller-manager 4.10.21 True False False 22d openshift-samples 4.10.21 True True True 31d Samples installation in error at 4.10.21: &errors.errorString{s:"global openshift image configuration prevents the creation of imagestreams using the registry "} operator-lifecycle-manager 4.10.21 True False False 624d operator-lifecycle-manager-catalog 4.10.21 True False False 624d operator-lifecycle-manager-packageserver 4.10.21 True False False 31d service-ca 4.10.21 True False False 624d storage 4.10.21 True False False 113d After applying the fix as described here( https://access.redhat.com/solutions/6547281 ) it is resolved: oc patch configs.samples.operator.openshift.io cluster --type merge --patch '{"spec": {"samplesRegistry": "registry.redhat.io"}}' But according the the BZ this should be fixed in 4.10.3 https://bugzilla.redhat.com/show_bug.cgi?id=2027745 but the issue is still occur in our 4.10.21 cluster: oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.21 True False 31d Error while reconciling 4.10.21: the cluster operator openshift-samples is degraded
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-1523. The following is the description of the original issue:
—
Description of problem:
In a complete disconnected cluster, the dev catalog is taking too much time in loading
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. A complete disconnected cluster
2. In add page go to the All services page
3.
Actual results:
Taking too much time too load
Expected results:
Time taken should be reduced
Additional info:
Attached a gif for reference
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This task adds support for setting socket options SO_REUSEADDR and SO_REUSEPORT to etcd listeners via ListenConfig. These options give the flexibility to cluster admins who wish to more explicit control of these features. What we have found is during etcd process restart there can be a considerable time waiting for the port to release as it is held open by TIME_WAIT which on many systems is 60s.
Hello team,
Raising this bug to backport the fix for the bug (https://bugzilla.redhat.com/show_bug.cgi?id=1986375) in OCP version 4.8
This is a clone of issue OCPBUGS-572. The following is the description of the original issue:
—
This is a clone of Bug 2117557 to track backport to 4.9.z
+++ This bug was initially created as a clone of
Bug #2108021
+++
+++ This bug was initially created as a clone of
Bug #2106733
+++
Description of problem:
During a replacement of worker nodes, we noticed that the machine-controller container, which is deployed as part of the `openshift-machine-api` namespace, would panic when a machine OpenShift was still in "Provisioning" state, but the corresponding AWS instance was already "Terminated".
```
I0628 10:09:02.518169 1 reconciler.go:123] my-super-worker-skghqwd23: deleting machine
I0628 10:09:03.090641 1 reconciler.go:464] my-super-worker-skghqwd23: Found instance by id: i-11111111111111
I0628 10:09:03.090662 1 reconciler.go:138] my-super-worker-skghqwd23: found 1 existing instances for machine
I0628 10:09:03.090669 1 utils.go:231] Cleaning up extraneous instance for machine: i-11111111111111, state: running, launchTime: 2022-06-28 08:56:52 +0000 UTC
I0628 10:09:03.090682 1 utils.go:235] Terminating i-05332b08d4cc3ab28 instance
panic: assignment to entry in nil map
goroutine 125 [running]:
sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine.(*Reconciler).delete(0xc0012df980, 0xc0004bd530, 0x234c4c0)
/go/src/sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine/reconciler.go:165 +0x95b
sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine.(*Actuator).Delete(0xc000a3a900, 0x25db9b8, 0xc0004bd530, 0xc000b9a000, 0x35e0100, 0x0)
/go/src/sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine/actuator.go:171 +0x365
github.com/openshift/machine-api-operator/pkg/controller/machine.(*ReconcileMachine).Reconcile(0xc0007bc960, 0x25db9b8, 0xc0004bd530, 0xc0007c5fc8, 0x15, 0xc0005e4a80, 0x2a, 0xc0004bd530, 0xc000032000, 0x206d640, ...)
/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/github.com/openshift/machine-api-operator/pkg/controller/machine/controller.go:231 +0x2352
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0003b20a0, 0x25db910, 0xc00087e040, 0x1feb8e0, 0xc00009f460)
/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298 +0x30d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0003b20a0, 0x25db910, 0xc00087e040, 0x0)
/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2(0xc000a38790, 0xc0003b20a0, 0x25db910, 0xc00087e040)
/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214 +0x6b
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:210 +0x425
```
What is the business impact? Please also provide timeframe information.
We failed to recover from a major outage due to this bug.
Where are you experiencing the behavior? What environment?
Production and all envs.
When does the behavior occur? Frequency? Repeatedly? At certain times?
It appeared only once so far, but can appear in larger scaling scenarios.
Version-Release number of selected component (if applicable):
4.8.39
Actual results:
With the panicing machine-controller, no new instances could be provisioned, resulting in an unscalable cluster. The solution/workaround to the problem was to delete the offending Machines.
Expected results:
Make the cluster scaleable again without deleting manually.
Additional info:
— Additional comment from
gferrazs@redhat.com
on 2022-07-13 13:34:08 UTC —
Probably the issue is here:
https://github.com/openshift/machine-api-provider-aws/blob/d701bcb720a12bd7d169d79699962c447a1f026d/pkg/actuators/machine/reconciler.go#L416-L426(the
fields referenced are on the file below. Probably duplicate the lines or move here).
Since issue is in machine-api, moving it to correct team.
— Additional comment from
rmanak@redhat.com
on 2022-07-14 08:20:00 UTC —
I am working on a fix for this.
— Additional comment from
aos-team-art-private@redhat.com
on 2022-07-14 19:10:14 UTC —
Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.12 release.
— Additional comment from
jspeed@redhat.com
on 2022-07-18 15:18:16 UTC —
Waiting for the first 4.11.z stream before we merge
— Additional comment from
jspeed@redhat.com
on 2022-08-08 15:07:16 UTC —
Waiting on 4.11 GA to move ahead here
+++ This bug was initially created as a clone of Bug #2106414 +++
+++ This bug was initially created as a clone of Bug #2065749 +++
Description of problem:
Over time the kubelet slowly consumes memory until, at some point, pods are no longer able to start on the node; coinciding with this are container runtime errors. It appears that even rebooting the node does not resolve the issue once it occurs - the node has to be completely rebuilt.
How reproducible: Consistently
Actual results: Pods are eventually unable to start on the node; rebuilding the node is the only workaround
Expected results: kubelet/crio would continue working as expected
— Additional comment from lstanton@redhat.com on 2022-03-18 17:12:39 UTC —
OVERVIEW
========
"Error: relabel failed /var/lib/kubelet/pods/47eb3e14-a631-412b-b07a-b19499faddbb/volumes/kubernetes.io~csi/pvc-05a55ed9-d63b-4c68-b3af-644a4353d9be/mount: lsetxattr /var/lib/kubelet/pods/47eb3e14-a631-412b-b07a-b19499faddbb/volumes/kubernetes.io~csi/pvc-05a55ed9-d63b-4c68-b3af-644a4353d9be/mount: operation not permitted"
— Additional comment from lstanton@redhat.com on 2022-03-18 17:20:39 UTC —
There is a lot of data available on case 2065749, though this is before they upgraded from 4.8 to 4.9. If there's any specific data that we haven't captured yet but would be helpful please let me know.
— Additional comment from lstanton@redhat.com on 2022-03-18 19:08:44 UTC —
The latest data from an occurrence before the customer upgraded to 4.9.21 can be found here (so this is from 4.8):
— sosreport —
https://attachments.access.redhat.com/hydra/rest/cases/03130785/attachments/057aead6-71c7-47ad-8faf-b0ba437625f3
— Other data (including pprof) —
https://attachments.access.redhat.com/hydra/rest/cases/03130785/attachments/b384715b-4189-4f34-bd85-02a718b1314a
Would it possible to get your opinion on the above node data? Does anything look obviously out of line in terms of kubelet or crio behavior that might explain pods failing to start? I've requested newer data in the case.
— Additional comment from rphillips@redhat.com on 2022-03-28 14:47:34 UTC —
This dump shows the kubelet is a state of constant GC. scanobject taking 270ms is high. Syscall6 is LSTAT taking 520ms is
I noticed that pods are failing to be fully cleaned up and created within the kubelet: Failed to remove cgroup (will retry)
This means the pods are staying within Kubelet's memory and the kubelet is retrying the cleanup operation. (Effectively leaking memory).
Additionally, pods are failing to be started from Comment #1, after during the start phase with the selinux rename. The failure in the start phase leads me to believe we are hitting (1) (not backported yet into 4.9). PR 107845 fixes an issue where pods on CRI error are arbitrarily marked as terminated when they should be marked as waiting.
1. https://github.com/kubernetes/kubernetes/pull/107845 .
— Additional comment from rphillips@redhat.com on 2022-03-28 15:40:19 UTC —
What are the custom SCCs? Fixing the issue of the relabel is the first step in solving this.
— Additional comment from rphillips@redhat.com on 2022-03-29 14:31:26 UTC —
Upstream leak https://github.com/kubernetes/kubernetes/pull/109103. The pprof on this BZ shows the same issue.
I'll attach the screenshot.
— Additional comment from rphillips@redhat.com on 2022-03-29 14:32:06 UTC —
Created attachment 1869023
leak pprof
— Additional comment from lstanton@redhat.com on 2022-04-05 15:02:09 UTC —
@rphillips@redhat.com do you still need info from me?
— Additional comment from rphillips@redhat.com on 2022-04-13 14:08:15 UTC —
No thank you!
— Additional comment from hshukla@redhat.com on 2022-04-25 09:37:55 UTC —
Created attachment 1874784
pprof from 3 nodes for Swisscom case 03125278
— Additional comment from rphillips@redhat.com on 2022-04-25 13:50:14 UTC —
@Hradayesh: CU has the same leak as this BZ.
❯ go tool pprof w0-d-altais.corproot.net-heap.pprof
File: kubelet
Build ID: 6eb513a78ba65574e291855722d4efa0a3fc9c23
Type: inuse_space
Time: Apr 25, 2022 at 3:49am (CDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 54.37MB, 75.57% of 71.94MB total
Showing top 10 nodes out of 313
flat flat% sum% cum cum%
26.56MB 36.92% 36.92% 26.56MB 36.92% k8s.io/kubernetes/pkg/kubelet/cm/containermap.ContainerMap.Add
10.50MB 14.60% 51.52% 10.50MB 14.60% k8s.io/kubernetes/vendor/k8s.io/cri-api/pkg/apis/runtime/v1alpha2.(*CreateContainerResponse).Unmarshal
5.05MB 7.02% 58.53% 5.05MB 7.02% k8s.io/kubernetes/pkg/kubelet/volumemanager/populator.(*desiredStateOfWorldPopulator).markPodProcessed
3.50MB 4.87% 63.40% 4MB 5.56% k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/apis/meta/v1.(*ObjectMeta).Unmarshal
2.50MB 3.48% 66.88% 4.01MB 5.57% k8s.io/kubernetes/vendor/github.com/google/cadvisor/container/libcontainer.newContainerStats
1.50MB 2.09% 68.97% 2MB 2.78% k8s.io/kubernetes/vendor/k8s.io/api/core/v1.(*Pod).DeepCopy
1.50MB 2.09% 71.05% 1.50MB 2.09% k8s.io/kubernetes/vendor/k8s.io/api/core/v1.(*Container).Unmarshal
1.13MB 1.57% 72.62% 1.13MB 1.57% k8s.io/kubernetes/vendor/google.golang.org/protobuf/internal/strs.(*Builder).MakeString
1.12MB 1.55% 74.18% 1.62MB 2.25% k8s.io/kubernetes/vendor/github.com/golang/groupcache/lru.(*Cache).Add
1MB 1.39% 75.57% 1MB 1.39% k8s.io/kubernetes/vendor/github.com/aws/aws-sdk-go/aws/endpoints.init
— Additional comment from bugzilla@redhat.com on 2022-05-09 08:31:37 UTC —
Account disabled by LDAP Audit for extended failure
— Additional comment from bsmitley@redhat.com on 2022-05-24 15:17:47 UTC —
Is there any ETA on getting a fix for this? This issue is happening alot on Ford's Tekton pods.
— Additional comment from dahernan@redhat.com on 2022-06-02 14:35:33 UTC —
Once this is verified, is it viable to also backport it to 4.8, 4.9 or 4.10(at least)? I do not observe any cherry pick or backport related to other versions (yet) but we have a customer (Swisscom) pushing for that, as this is impacting them.
— Additional comment from aos-team-art-private@redhat.com on 2022-06-09 06:10:29 UTC —
Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.11 release.
— Additional comment from openshift-bugzilla-robot@redhat.com on 2022-06-13 13:13:11 UTC —
Bugfix included in accepted release 4.11.0-0.nightly-2022-06-11-054027
Bug will not be automatically moved to VERIFIED for the following reasons:
This bug must now be manually moved to VERIFIED by schoudha@redhat.com
— Additional comment from schoudha@redhat.com on 2022-06-15 15:21:28 UTC —
Checked on 4.11.0-0.nightly-2022-06-14-172335 by running pods over a day and don't see unexpectedly high memory usage by kubelet on node.
% oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-06-14-172335 True False 8h Cluster version is 4.11.0-0.nightly-2022-06-14-172335
— Additional comment from errata-xmlrpc@redhat.com on 2022-06-15 17:57:08 UTC —
This bug has been added to advisory RHEA-2022:5069 by OpenShift Release Team Bot (ocp-build/buildvm.openshift.eng.bos.redhat.com@REDHAT.COM)
— Additional comment from bsmitley@redhat.com on 2022-06-27 20:34:28 UTC —
Do we have a timeframe for this into 4.10?
This issue is impacting Ford.
Account name: Ford
Account Number: 5561914
TAM customer: yes
CSE customer: yes
Strategic: yes
— Additional comment from bsmitley@redhat.com on 2022-06-28 15:28:55 UTC —
Ford brought this up in my TAM meeting as a hot issue.
I need an ETA of when this will be in 4.10. This just needs to be a rough date. I know normally we don't share exact dates because timeframes could change.
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/221
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
We have presubmit and periodic jobs failing on : [sig-arch] events should not repeat pathologically for namespace openshift-monitoring { 2 events happened too frequently event happened 21 times, something is wrong: ns/openshift-monitoring statefulset/prometheus-k8s hmsg/6f9bc9e1d7 - pathological/true reason/RecreatingFailedPod StatefulSet openshift-monitoring/prometheus-k8s is recreating failed Pod prometheus-k8s-1 From: 16:11:36Z To: 16:11:37Z result=reject event happened 22 times, something is wrong: ns/openshift-monitoring statefulset/prometheus-k8s hmsg/ecfdd1d225 - pathological/true reason/SuccessfulDelete delete Pod prometheus-k8s-1 in StatefulSet prometheus-k8s successful From: 16:11:36Z To: 16:11:37Z result=reject } The failure occurs when the event happens over 20 times. The RecreatingFailedPod reason shows up in 4.14 and Presubmits and does not show up in 4.13.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Run presubmits or periodics; here are latest examples: 2023-05-24 06:25:52.551883+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-serial/1661210557367193600 | {aws,amd64,sdn,ha,serial} 2023-05-24 10:20:54.91883+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-gcp-sdn-serial/1661267817128792064 | {gcp,amd64,sdn,ha,serial} 2023-05-24 14:17:18.849402+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27899/pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade/1661321663389634560 | {gcp,amd64,ovn,upgrade,upgrade-micro,ha} 2023-05-24 14:17:51.908405+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/1583/pull-ci-openshift-kubernetes-master-e2e-azure-ovn-upgrade/1661324100011823104 | {azure,amd64,ovn,upgrade,upgrade-micro,ha}
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
That event/reason should not show up as a failure in the pathological test
Additional info:
This table shows what variants on 4.14 and Presubmits:
variants | test_count --------------------------------------------------+------------ {aws,amd64,ovn,upgrade,upgrade-micro,ha} | 63 {gcp,amd64,ovn,upgrade,upgrade-micro,ha} | 14 {gcp,amd64,sdn,ha,serial,techpreview} | 12 {azure,amd64,sdn,ha,serial,techpreview} | 7 {aws,amd64,sdn,upgrade,upgrade-micro,ha} | 6 {aws,amd64,ovn,ha} | 6 {vsphere-ipi,amd64,ovn,upgrade,upgrade-micro,ha} | 5 {aws,amd64,sdn,ha,serial} | 5 {azure,amd64,ovn,upgrade,upgrade-micro,ha} | 5 {metal-ipi,amd64,ovn,upgrade,upgrade-micro,ha} | 5 {vsphere-ipi,amd64,ovn,ha,serial} | 4 {gcp,amd64,sdn,ha,serial} | 3 {aws,amd64,ovn,single-node} | 3 {metal-ipi,amd64,ovn,ha,serial} | 2 {aws,amd64,ovn,ha,serial} | 2 {aws,amd64,upgrade,upgrade-micro,ha} | 1 {aws,arm64,sdn,ha,serial} | 1 {aws,arm64,ovn,ha,serial,techpreview} | 1 {vsphere-ipi,amd64,ovn,ha,serial,techpreview} | 1 {aws,amd64,sdn,ha,serial,techpreview} | 1 {libvirt,ppc64le,ovn,ha,serial} | 1 {amd64,upgrade,upgrade-micro,ha} | 1
Just for my record, I'm using this query to check 4.14 and Presubmits:
SELECT rt.created_at, url, variants FROM prow_jobs pj JOIN prow_job_runs r ON r.prow_job_id = pj.id JOIN prow_job_run_tests rt ON rt.prow_job_run_id = r.id JOIN prow_job_run_test_outputs o ON o.prow_job_run_test_id = rt.id JOIN tests ON rt.test_id = tests.id WHERE pj.release IN ('4.14', 'Presubmits') AND rt.status = 12 AND tests.id = 65991 AND o.output LIKE '%RecreatingFailedPod%' ORDER BY rt.created_at, variants DESC;
And this query for checking 4.13:
SELECT rt.created_at, url, variants FROM prow_jobs pj JOIN prow_job_runs r ON r.prow_job_id = pj.id JOIN prow_job_run_tests rt ON rt.prow_job_run_id = r.id JOIN prow_job_run_test_outputs o ON o.prow_job_run_test_id = rt.id JOIN tests ON rt.test_id = tests.id WHERE pj.release IN ('4.13') AND rt.status = 12 AND tests.id IN (65991, 244,245) AND o.output LIKE '%RecreatingFailedPod%' ORDER BY rt.created_at, variants DESC;
This shows jobs beginning on 4/13 to today.
Description of problem:
While installing cluster with assisted installer lately we have cases when one of the master joins very quickly and start all needed pods in order for cluster bootstrap to finish but the second one joins only after that. Keepalived can't start if there is only one joined cluster as it doesn't have enough data to build configuration files. In HA mode cluster bootstrap should wait at least for 2 joined masters before removing bootstrap control plane as without it installation with fail.
Version-Release number of selected component (if applicable):
How reproducible:
Start bm installation and start one master, wait till it starts all required pods and then add others.
Steps to Reproduce:
1. Start bm installation 2. Start one master 3. Wait till it starts all required pods. 4. Add others
Actual results:
no vip, installation fails
Expected results:
installation succeeds, vip moves to master
Additional info:
P-01-TC03 | On Second run, script worked fine |
P-01-TC06 | created seperate functions for docker file page |
P-01-TC09 | Removing this test case, by updating P-04-TC04 test scenario Updating pipelines section title in side bar |
This is a clone of issue OCPBUGS-2523. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-2451. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-2181. The following is the description of the original issue:
—
Description of problem:
E2E test Installs Red Hat Integration - 3scale operator test is failing due to change of Operator name
This is a clone of issue OCPBUGS-4948. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4799. The following is the description of the original issue:
—
[This issue is for a backport to 4.10.z for our CI. This issue was already addressed for 4.11+ in https://github.com/openshift/operator-framework-olm/pull/285]
Description of problem:
When installing an operator, OLM creates an "operator" Custom Resource whose status will be updated to contain a list of resources associated with the operator. This is done by labeling each resource associated with an operator with a label based off this code: https://github.com/operator-framework/operator-lifecycle-manager/blob/7eccf5342199b88f4657b6c996d4e66d9fa978fa/pkg/controller/operators/decorators/operator.go#L92-L105
Version-Release number of selected component (if applicable):
4.8
How reproducible:
Always
Steps to Reproduce:
1. Create a subscription named managed-node-metadata-operator in the openshift-managed-node-metadata-operator namespace, which causes the truncated label to end on `-`, which is an illegal character. 2. Watch the OLM Operator logs.
Actual results:
The adoption controller within OLM continuously fails to adopt the subscription due to an illegal label value: {"level":"error","ts":1670862754.2096953,"logger":"controllers.adoption","msg":"Error adopting Subscription","request":"openshift-managed-node-metadata-operator/managed-node-metadata-operator","error":"Subscription.operators.coreos.com \"managed-node-metadata-operator\" is invalid: metadata.labels: Invalid value: \"operators.coreos.com/managed-node-metadata-operator.openshift-managed-node-metadata-\": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName', or 'my.name', or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"} {"level":"error","ts":1670862754.2097518,"logger":"controller.subscription","msg":"Reconciler error","reconciler group":"operators.coreos.com","reconciler kind":"Subscription","name":"managed-node-metadata-operator","namespace":"openshift-managed-node-metadata-operator","error":"Subscription.operators.coreos.com \"managed-node-metadata-operator\" is invalid: metadata.labels: Invalid value: \"operators.coreos.com/managed-node-metadata-operator.openshift-managed-node-metadata-\": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName', or 'my.name', or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')","errorCauses":[{"error":"Subscription.operators.coreos.com \"managed-node-metadata-operator\" is invalid: metadata.labels: Invalid value: \"operators.coreos.com/managed-node-metadata-operator.openshift-managed-node-metadata-\": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName', or 'my.name', or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')"} ],"stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func2.2\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}
Expected results:
The adoption control creates a label that can be applied to the subscription so it may be "adopted" by the controller.
Additional info:
This was orginially fixed in 4.11 here: https://github.com/operator-framework/operator-lifecycle-manager/pull/2731
The OWNERS file for multiple branches in the openshift/jenkins repository need to be updated to reflect current team members for approvals.
P-09-TC01, P-09-TC04, P-09-TC05, P-09-TC06, P-09-TC07, P-09-TC11 test scripts update required
Page obejcts updated for pipelines
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled
Bump prometheus-operator to 0.77.2
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/260
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
https://github.com/prometheus/prometheus/pull/14446 is a fix for https://github.com/prometheus/prometheus/issues/14087 (see there for details) This was introduced in Prom 2.51.0 https://github.com/openshift/cluster-monitoring-operator/blob/master/Documentation/deps-versions.md
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-bootstrap/pull/100
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/266
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-samples-operator/pull/527
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
In https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-shiftstack-ci-release-4.18-e2e-openstack-ovn-etcd-scaling/1834144693181485056 I noticed the following panic:
Undiagnosed panic detected in pod expand_less 0s { pods/openshift-monitoring_prometheus-k8s-1_prometheus_previous.log.gz:ts=2024-09-12T09:30:09.273Z caller=klog.go:124 level=error component=k8s_client_runtime func=Errorf msg="Observed a panic: &runtime.TypeAssertionError{_interface:(*abi.Type)(0x3180480), concrete:(*abi.Type)(0x34a31c0), asserted:(*abi.Type)(0x3a0ac40), missingMethod:\"\"} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Node)\ngoroutine 13218 [running]:\nk8s.io/apimachinery/pkg/util/runtime.logPanic({0x32f1080, 0xc05be06840})\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x90\nk8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc010ef6000?})\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b\npanic({0x32f1080?, 0xc05be06840?})\n\t/usr/lib/golang/src/runtime/panic.go:770 +0x132\ngithub.com/prometheus/prometheus/discovery/kubernetes.NewEndpoints.func11({0x34a31c0?, 0xc05bf3a580?})\n\t/go/src/github.com/prometheus/prometheus/discovery/kubernetes/endpoints.go:170 +0x4e\nk8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/client-go/tools/cache/controller.go:253\nk8s.io/client-go/tools/cache.(*processorListener).run.func1()\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/client-go/tools/cache/shared_informer.go:977 +0x9f\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00fc92f70, {0x456ed60, 0xc031a6ba10}, 0x1, 0xc015a04fc0)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf\nk8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc011678f70, 0x3b9aca00, 0x0, 0x1, 0xc015a04fc0)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f\nk8s.io/apimachinery/pkg/util/wait.Until(...)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161\nk8s.io/client-go/tools/cache.(*processorListener).run(0xc04c607440)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/client-go/tools/cache/shared_informer.go:966 +0x69\nk8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x52\ncreated by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 12933\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73\n"}
This issue seems relatively common on openstack, these runs seem to very frequently be this failure.
Linked test name: Undiagnosed panic detected in pod
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/prometheus/pull/203
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
issue is found when verify OCPBUGS-21637, so verbose prometheus-operator-admission-webhook logs
$ oc -n openshift-monitoring get pod -l app.kubernetes.io/name=prometheus-operator-admission-webhook NAME READY STATUS RESTARTS AGE prometheus-operator-admission-webhook-5d96cbcbfc-6lx4m 1/1 Running 0 56m prometheus-operator-admission-webhook-5d96cbcbfc-jj66x 1/1 Running 0 53m $ oc -n openshift-monitoring logs prometheus-operator-admission-webhook-5d96cbcbfc-6lx4m level=info ts=2023-11-06T01:50:33.617049649Z caller=main.go:140 address=[::]:8443 msg="Starting TLS enabled server" http2=false ts=2023-11-06T01:50:34.601774794Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:50:40.439015896Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:50:40.43925044Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:50:50.437745065Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:50:50.448362455Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:00.428162615Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:00.428571968Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:10.426317894Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:10.426769416Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:20.426701853Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:20.427289877Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:30.429156675Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:30.429229042Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:40.426522527Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:40.427038656Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:50.428974832Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:50.429036156Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:00.428747039Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:00.42880275Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:10.426871896Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:10.428574666Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:20.428211529Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:20.428638108Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:30.427148775Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:30.427631515Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:40.427167231Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:40.427658789Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:50.427851476Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:50.428319729Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:00.428583783Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:00.429083642Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:10.426258718Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:10.426788637Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:20.430876533Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:20.431510269Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:30.427527316Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:30.428046481Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:40.428449342Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:40.428886681Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:50.426513473Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:50.427038956Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:00.426639171Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:00.427164997Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:10.426804033Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:10.427276217Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:20.427705297Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:20.428214309Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:30.428041006Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:30.428525809Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:40.426257489Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:40.42674803Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:50.42708913Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:50.427155482Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:00.428431788Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:00.428881681Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:10.429549989Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:10.429618004Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:20.427741192Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:20.428196221Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:30.4269946Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:30.427451901Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:40.426994787Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:40.427502475Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:50.426456346Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:50.426610051Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:00.426520596Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:00.426676076Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:10.435077603Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:10.435135319Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:20.427693249Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:20.428171589Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:30.428760772Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:30.428828762Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:40.428545666Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:40.429005303Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:50.426103842Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:50.426578009Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:00.427041793Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:00.427482797Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:10.427963834Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:10.428440451Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:20.428877932Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:20.428945521Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:30.426157935Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:30.426639545Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:40.42875961Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:40.42884264Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:50.426450177Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:50.426939532Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:00.428456873Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:00.428904131Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:10.428931448Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:10.428987646Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:20.429377819Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:20.4294396Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:30.428108184Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:30.428580595Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:40.426962512Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:40.427429076Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:50.429177401Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:50.429637834Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:00.428197981Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:00.428655487Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:10.426418388Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:10.426908577Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:20.426705875Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:20.427197531Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:30.427909675Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:30.428395421Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:40.429100447Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:40.429871853Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:50.4268663Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:50.427329161Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:00.429149297Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:00.429205811Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:10.426857098Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:10.427290243Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:20.42638474Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:20.426901703Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:30.428885162Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:30.429373666Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:40.427093878Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:40.427622056Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:50.428691098Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:50.428743261Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:00.426355861Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:00.42685464Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:10.426208743Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:10.426710363Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:20.426872491Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:20.42731801Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:30.426612427Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:30.427084214Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:40.428796629Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:40.429400491Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:50.427001992Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:50.42827597Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:00.428013056Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:00.428469744Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:10.426711057Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:10.427247058Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:20.429136255Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:20.429208369Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:30.427158806Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:30.427593326Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:40.426389918Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:40.426875768Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:50.429551365Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:50.429619241Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:00.426621326Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:00.427126079Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:10.426301507Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:10.426803336Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:13.952615577Z caller=stdlib.go:105 caller=server.go:3215 msg="http: TLS handshake error from 10.130.0.1:52552: EOF" ts=2023-11-06T02:03:20.426371089Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:20.426852234Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:30.428789504Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:30.428874536Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:40.427028458Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:40.427463333Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:50.429615112Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:50.429679407Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:00.4285878Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:00.429074488Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:10.4279579Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:10.428403727Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:20.426433063Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:20.426940057Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:30.428317498Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:30.428730147Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:40.42911069Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:40.429194383Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:50.42820753Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:50.428643464Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:05:00.427890872Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:05:00.428356508Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from ...
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-11-04-120954
How reproducible:
always
Steps to Reproduce:
1. check prometheus-operator-admission-webhook logs
Actual results:
verbose prometheus-operator-admission-webhook logs
Samples operator in OKD refers to docker.io/openshift/wildfly, which are no longer available. Library sync should update samples to use quay.io links
Please review the following PR: https://github.com/openshift/cluster-samples-operator/pull/517
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/prometheus/pull/195
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The prometheus operator fails to reconcile when proxy settings like no_proxy are set in the Alertmanager configuration secret.
Version-Release number of selected component (if applicable):
4.15.z and later
How reproducible:
Always when AlertmanagerConfig is enabled
Steps to Reproduce:
1. Enable UWM with AlertmanagerConfig enableUserWorkload: true alertmanagerMain: enableUserAlertmanagerConfig: true 2. Edit the "alertmanager.yaml" key in the alertmanager-main secret (see attached configuration file) 3. Wait for a couple of minutes.
Actual results:
Monitoring ClusterOperator goes Degraded=True.
Expected results:
No error
Additional info:
The Prometheus operator logs show that it doesn't understand the proxy_from_environment field. The newer proxy fields are supported since Alertmanager v0.26.0 which is equivalent to OCP 4.15 and above.
Please review the following PR: https://github.com/openshift/prometheus/pull/187
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/289
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Bump prometheus-operator to 0.75.1 CMO and downstream fork
Please review the following PR: https://github.com/openshift/oauth-proxy/pull/265
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/258
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/prometheus-alertmanager/pull/75
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/270
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
While debugging a problem, I noticed some containers lack FallbackToLogsOnError. This is important for debugging via the API. Found via https://github.com/openshift/origin/pull/28547
Remove the github handle of Haoyu Sun from OWNER files of monitoring components.
Bump prometheus-operator to 0.73.2
Description of problem:
Cluster and userworkload alertmanager instances inadvertenly become peered during upgrade
Version-Release number of selected component (if applicable):
How reproducible:
infrequently - customer observed this on 3 cluster out of 15
Steps to Reproduce:
Deploy userworkload monitoring ~~~ config.yaml: | enableUserWorkload: true prometheusK8s: ~~~ Deploy user workload alertmanager ~~~ name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | alertmanager: enabled: true ~~~ upgrade the cluster verify the state of the alertmanager clusters: ~~~ $ oc exec -n openshift-monitoring alertmanager-main-0 -- amtool cluster show -o json --alertmanager.url=http://localhost:9093 ~~~
Actual results:
alertmanager show 4 peers
Expected results:
we should have 2 pairs
Additional info:
Mitigation steps: Scaling down one of the alertmanager statefulsets to 0 and then scaling up again restores the expected configuration (i.e. 2 separate alertmanager clusters) - the customer then added networkpolicies to prevent alertmanager gossip between namespaces.
While debugging a problem, I noticed some containers lack FallbackToLogsOnError. This is important for debugging via the API. Found via https://github.com/openshift/origin/pull/28547
Please review the following PR: https://github.com/openshift/cluster-bootstrap/pull/107
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
We’re unable to find a stable and accessible OAuth-proxy image, which is causing a bug that we haven’t fully resolved yet. Krzys made a PR to address this , but it’s not a complete solution since the image path doesn’t seem consistently available. Krzys tried referencing the OAuth-proxy image from the OpenShift openshift namespace, but it didn’t work reliably.There’s an imagestream for OAuth-proxy in the openshift namespace, which we might be able to reference in tests, but not certain of the correct Docker URL format for it. Also, it’s possible that there are permission issues, which could be why the image isn’t accessible when referenced this way.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/244
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/oauth-proxy/pull/270
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/242
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
4.15 and 4.16
$ oc explain prometheus.spec.remoteWrite.sendExemplars GROUP: monitoring.coreos.com KIND: Prometheus VERSION: v1FIELD: sendExemplars <boolean>DESCRIPTION: Enables sending of exemplars over remote write. Note that exemplar-storage itself must be enabled using the `spec.enableFeature` option for exemplars to be scraped in the first place. It requires Prometheus >= v2.27.0.
no `spec.enableFeature` option
$ oc explain prometheus.spec.enableFeature
GROUP: monitoring.coreos.com
KIND: Prometheus
VERSION: v1
error: field "enableFeature" does not exist
should be `spec.enableFeatures`
$ oc explain prometheus.spec.enableFeatures GROUP: monitoring.coreos.com KIND: Prometheus VERSION: v1FIELD: enableFeatures <[]string> DESCRIPTION: Enable access to Prometheus feature flags. By default, no features are enabled. Enabling features which are disabled by default is entirely outside the scope of what the maintainers will support and by doing so, you accept that this behaviour may break at any time without notice. For more information see https://prometheus.io/docs/prometheus/latest/feature_flags/
Version-Release number of selected component (if applicable):
4.15 and 4.16
How reproducible:
always
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/292
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
See https://github.com/prometheus/prometheus/issues/14503 for more details
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Steps to Reproduce:
1. Make Prometheus scrape a target that exposes multiple samples of the same series with different explicit timestamps, for example:
# TYPE requests_per_second_requests gauge # UNIT requests_per_second_requests requests # HELP requests_per_second_requests test-description requests_per_second_requests 16 1722466225604 requests_per_second_requests 14 1722466226604 requests_per_second_requests 40 1722466227604 requests_per_second_requests 15 1722466228604 # EOF
2. Not all the samples will be ingested
3. If Prometheus continues scraping that target for a moment, the PrometheusDuplicateTimestamps will fire.
Actual results:
Expected results: all the samples should be considered (or course if the timestamps are too old or are too in the future, Prometheus may refuses them.)
Additional info:
Regression introduced in Prometheus 2.52. Proposed upstream fixes: https://github.com/prometheus/prometheus/pull/14683 https://github.com/prometheus/prometheus/pull/14685
Please review the following PR: https://github.com/openshift/prometheus/pull/226
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/259
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-13-084629
How reproducible:
100%
Steps to Reproduce:
1.apply configmap ***** apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | prometheusK8s: remoteWrite: - url: "http://invalid-remote-storage.example.com:9090/api/v1/write" queue_config: max_retries: 1 ***** 2. check logs % oc logs -c prometheus prometheus-k8s-0 -n openshift-monitoring ... ts=2024-06-14T01:28:01.804Z caller=dedupe.go:112 component=remote level=warn remote_name=5ca657 url=http://invalid-remote-storage.example.com:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://invalid-remote-storage.example.com:9090/api/v1/write\": dial tcp: lookup invalid-remote-storage.example.com on 172.30.0.10:53: no such host" 3.query after 15mins % oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=ALERTS{alertname="PrometheusRemoteStorageFailures"}' | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 145 100 78 100 67 928 797 --:--:-- --:--:-- --:--:-- 1726 { "status": "success", "data": { "resultType": "vector", "result": [], "analysis": {} } } % oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=prometheus_remote_storage_failures_total' | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 124 100 78 100 46 1040 613 --:--:-- --:--:-- --:--:-- 1653 { "status": "success", "data": { "resultType": "vector", "result": [], "analysis": {} } }
Actual results:
alert did not triggeted
Expected results:
alert triggered, able to see the alert and metrics
Additional info:
below metrics show as `No datapoints found.` prometheus_remote_storage_failures_total prometheus_remote_storage_samples_dropped_total prometheus_remote_storage_retries_total
`prometheus_remote_storage_samples_failed_total` value is 0
Description of problem:
Same for OCP 4.14.
In OCP 4.13 when trying to reach prometheus UI via port-forward, e.g. `oc port-forward prometheus-k8s-0` the UI url($HOST:9090/graph) is returning `Error opening React index.html: open web/ui/static/react/index.html: no such file or directory`
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-01-24-061922
How reproducible:
100%
Steps to Reproduce:
1. oc -n openshift-monitoring port-forward prometheus-k8s-0 9090:9090 --address='0.0.0.0' 2. curl http://localhost:9090/graph
Actual results:
Error opening React index.html: open web/ui/static/react/index.html: no such file or directory
Expected results:
Prometheus UI is loaded
Additional info:
The UI loads fine when following the same steps in 4.12.
Description of problem:
4.16.0-0.nightly-2024-04-07-182401, Prometheus Operator 0.73.0, too many warnings for "'bearerTokenFile' is deprecated, use 'authorization' instead.", see below
$ oc -n openshift-monitoring logs -c prometheus-operator deploy/prometheus-operator level=info ts=2024-04-08T07:06:17.191301889Z caller=main.go:186 msg="Starting Prometheus Operator" version="(version=0.73.0, branch=rhaos-4.16-rhel-9, revision=3541f90)" level=info ts=2024-04-08T07:06:17.195797026Z caller=main.go:187 build_context="(go=go1.21.7 (Red Hat 1.21.7-1.el9) X:loopvar,strictfipsruntime, platform=linux/amd64, user=root, date=20240405-12:29:19, tags=strictfipsruntime)" level=info ts=2024-04-08T07:06:17.195888428Z caller=main.go:198 msg="namespaces filtering configuration " config="{allow_list=\"\",deny_list=\"\",prometheus_allow_list=\"openshift-monitoring\",alertmanager_allow_list=\"openshift-monitoring\",alertmanagerconfig_allow_list=\"\",thanosruler_allow_list=\"openshift-monitoring\"}" level=info ts=2024-04-08T07:06:17.212735844Z caller=main.go:227 msg="connection established" cluster-version=v1.29.3+e994e5d level=warn ts=2024-04-08T07:06:17.228748881Z caller=main.go:75 msg="resource \"scrapeconfigs\" (group: \"monitoring.coreos.com/v1alpha1\") not installed in the cluster" level=info ts=2024-04-08T07:06:17.2563