Back to index

4.8.0-0.ci-2023-07-15-003738

Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |

Changes from 4.7.60

Note: this page shows the Feature-Based Change Log for a release

Complete Features

These features were completed when this image was assembled

Epic Goal

  • Complete the implementation for AWS STS, including support and documentation.

Why is this important?

  • Many customers want to follow best security practices for handling credentials.
  • This is the way recommended by AWS. 
  • Customer interest: EMEA, AMER

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

Open questions:

  1. Will this cover existing OCP deployments or only new OCP deployments?
  2. Is there a migration path for existing customers to start using AWS STS?
  3. Are there considerations that apply to Operators so they can work with limited privilege credentials?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

The tool should be able to upload an OpenID Connect (OIDC) configuration to an S3 bucket, and create an AWS IAM Identity Provider that trusts identities from the OIDC provider. It should take infra name as input so that user can identify all the resources created in AWS. Make sure that resources created in AWS are tagged appropriately.

Sample command with existing key pair:

tool-name create identity-provider <infra-name> --public-key ./path/to/public/key

 

Ensure the Identity Provider includes audience config for both the in-cluster components ('openshift') and the pod-identity-webhook ('sts.amazonaws.com').

Epic Goal

  • Support running console in single-node OpenShift configurations for production use in edge computing use cases.
  • Support disabling the console entirely in some of these configurations to reduce overhead in constrained environments.

Why is this important?

  • Some bare metal edge customers, especially in the telco market, want to use kubernetes at physically remote sites with minimal hardware.

Scenarios

  1. As a user, I want to deploy a fully supported instance of OpenShift on a single node.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • console can be deployed with a single replica

Dependencies (internal and external)

  1. CORS-1589

Previous Work (Optional):

  1. https://github.com/openshift/enhancements/pull/504
  2. https://github.com/openshift/enhancements/pull/560

Open questions::

  1. Should the console configuration API have a separate option for this setting, or should it use the API created from CORS-1589?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

https://github.com/openshift/enhancements/pull/555
https://github.com/openshift/api/pull/827

The console operator will need to support single-node clusters.

We have a console deployment and downloads deployment. Each will to be updated so that there's only a single replica when high availability mode is disabled in the Infrastructure config. We should also remove the anti-affinity rule in the console deployment that tries to spread console pods across nodes.

The downloads deployment is currently a static manifest. That likely needs to be created by the console operator instead going forward.

Acceptance Criteria:

  • Console operator deploys console with 1 replica and no anti-affinity rules when not in high availability mode
  • Console operator deploys the downloads deployment with 1 replica when not in high availability mode
  • The console and downloads deployments do not change when in high availability mode
  • The feature is well-covered by tests

Epic Goal

  • Support running the image registry services in single-node OpenShift configurations for production use in edge computing use cases.

Why is this important?

  • Some bare metal edge customers, especially in the telco market, want to use kubernetes at physically remote sites with minimal hardware.

Scenarios

  1. As a user, I want to deploy a fully supported instance of OpenShift on a single node.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://github.com/openshift/enhancements/pull/504
  2. https://github.com/openshift/enhancements/pull/560
  3. OCP Single Node Production Edge Profile
  4. We're pretty sure we need the node-ca deployed since our first few customers are using disconnected environments.
  5. It's not clear if we need the image-registry and image-pruner.

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

As a OpenShift administrator
I want the registry operator to use topology mode from Infrastructure (HighAvailable = 2 replicas, SingleReplica = 1 replica)
so that it the operator is not spending resources for high-availability purpose when it's not needed.

See also:

https://github.com/openshift/enhancements/blob/master/enhancements/cluster-high-availability-mode-api.md
https://github.com/openshift/api/pull/827/files

Number of replicas on different platforms 

Platform SingleReplica HighAvailable
AWS 1 replica 2 replicas
Azure 1 replica 2 replicas
GCP 1 replica 2 replicas
OpenStack (swift) 1 replica 2 replicas
OpenStack (cinder) 1 replica 1 replica (PVC)
oVirt 1 replica 1 replica (PVC)
bare metal Removed Removed
vSphere Removed Removed

 

 Feature Overview

This will be phase 1 of Internationalization of the OpenShift Console.

 Phase 1 will include the following:

  1. UI based language Selector instead of using browser detection
  2. Externalize all hard coded strings in the client code including all OpenShift static plugins
    1. Admin Console
    2. Dev Console
    3. Serverless
    4. Pipelines
    5. CNV
    6. OCS
    7. CSO
  3. Localized Date\Time
  4. Setup all processes, infrastructure, and testing required
  5. We will start with support for Chinese and Japanese lang

Phase 1 will not include:

  1. Dynamically generated UI (Operator, OpenAPIV3Schema)
    1. Operators that surface informational messages may not have translations available
  2. Strings from non client code
    1. This may include items such as events surfaced from Kuberenetes, alerts, and error messages displayed to the user or in logs
  3. Localization of logging messages at any level is not in scope
  4. Any CLI
  5. Language support for left to right languages ie Arabic

Initial List of Languages to Support

---------- 4.7* ----------

  1. Japanese - Code: ja 
  2. Chinese - Code: zh_CN, zh_TW 
  3. Korean - Code: ko

*This will be based on the ability to get all the strings externalized, there is a good chance this gets pushed to 4.8.

---------- Post 4.7 ----------

  1. Spanish: - Code: es_419, es 
  2. German: - Code: de
  3. French - Code: fr
  4. Italian - Code: it
  5. Portuguese - Code: pt_BR
  6. Korean - Code: ko
  7. Hindi - Code: hi

POC

 Initial POC PR

Goals

Internationalization has become table stakes. OpenShift Console needs to support different languages in each of the major markets. This is key functionality that will help unlock sales in different regions.

 

Requirements

 

Requirement Notes isMvp?
Language Selector   YES
Localized Date. + Time   YES
Externalization and translation of all client side strings   YES
Translation for Chinese and Japanese   YES
Process, infra, and testing capabilities put into place   YES
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
     

  

Out of Scope

  1. Dynamically generated UI (Operator, OpenAPIV3Schema)
    1. Operators that surface informational messages may not have translations available
  2. Strings from non client code
    1. This may include items such as events surfaced from Kuberenetes, alerts, and error messages displayed to the user or in logs
  3. Localization of logging messages at any level is not in scope
  4. Any CLI support
  5. Language support for left to right languages ie. Arabic

 

Assumptions

  • Each static plugin team will be responsible for externalizing all their client code strings.
  • Quick Starts will need to be translated.

Customer Considerations

We are rolling this feature in phases, based on customer feedback, there may be no phase 2.

Documentation Considerations

I believe documentation already supports a large language set.

Epic Goal

  • This is the continuation of the Internationalization work... the following items remain:
    • All existing QuickStarts get Translated
    • Automation Completed
    • Any remaining items cleaned up

Why is this important?

  • Automating as much as possible with the detecting duplicate strings, building, translation drops will ensure we will be successful for all future releases
  • Quick Start are important part of the product that enable our users to maximize usage of the Console
  • Best to clean up anything left over to reduce future Tech Debt

Acceptance Criteria

  • Quick Starts are translated 
  • Everything is automated for building, and pushing translation drops to the globalization team
  • Source code should be up to quality standards

Previous Work (Optional):

  1. https://issues.redhat.com/browse/CONSOLE-2325

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

We need to automate how we send and receive updated translations using Memsource for the Red Hat Globalization team. The Ansible Tower team already has automation in place that we might be able to reuse.

Acceptance Criteria:

  • We have a script that takes the current messages from the console repos and pushes them to Memsource
  • We have a script that pulls the updated translations from Memsource and creates a PR against openshift/console
  • We work with the DPTP team to determine if this process can be automated such that it runs periodically (e.g. once a sprint)

Feature Overview

Openshift Sandboxed Containers provide the ability to add an additional layer of isolation through virtualization for many workloads. The main way to enable the use of katacontainers on an Openshift Cluster is by first installing the Operator (for more information about operator enablement check [1]).

Once the feature is enabled on the cluster, it just a matter of a one-liner YAML modification on the pod/deployment level to run the workload using katacontianers. That might sound easy for some, but for others who don't care about YAML they might want more abstractions on how to use katacontainers for their workloads.

This feature covers all the efforts required to integrate and present Kata in Openshift UI (console) to cater to all user personas.

Background, and strategic fit

To enable for users to adopt Kata as a runtime, it is important to make it easy to use. Adding hook-points in the UI with ease-of-use as a goal in mind is one way to bring in more users.

Goal(s)

The main goal of this feature is to make sure that:

  1. It is easy for users to find out how to use/enable Openshift Sandboxed Containers on their clusters (e.g., Getting started guide in the UI).
  2. Cluster-admins are able to differentiate between normal pods and Kata pods.
  3. Developers (application, CNF, ...) have an easy way to create Katacontiners (without peeking at YAMLs)
  4. Application End-users are able to collectively activate kata on their app packages/content (e.g., Helm, odo,...)

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as a reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

References

[1] https://issues.redhat.com/browse/KATA-429?jql=project %3D KATA AND issuetype %3D Feature

Goal

The grand goal is to improve the usability of Kata from Openshift UI. This EPIC aims to cover only a subset that would help:

  • Make it easy to differentiate between native cluster runtime (e.g., runC) and kata.
  • Enable Kata as a runtime without modifying YAMLs.

To use a different runtime e.g., Kata, the "runtimeClassName" will be set to the desired low-level runtime. Also please see [1]

"RuntimeClassName refers to a RuntimeClass object in the node.k8s.io group, which should be used to run this pod. If no RuntimeClass resource matches the named class, the pod will not be run. If unset or empty, the "legacy" RuntimeClass will be used, which is an implicit class with an empty definition that uses the default runtime handler. More info: https://git.k8s.io/enhancements/keps/sig-node/runtime-class.md This is a beta feature as of Kubernetes v1.14.." 

 

Pod-Runtimeclass.yaml
apiVersion: v1 
kind: Pod 
metadata:
  name: nginx-runc 
spec:     
  runtimeClassName: runC 

 

The value of the runtime class cannot be changed on the pod level, but it can be changed on the deployment level 

 

Deployment-Runtimeclass.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sandboxed-nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: sandboxed-nginx
  template:
    metadata:
      labels:
        app: sandboxed-nginx
    spec:
      runtimeClassName: kata. # ---> This can be changed
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
          protocol: TCP
          

User-stories

  • As a cluster-admin, I would like to be able to differentiate between a normal pod and a Katacontainer pod from the UI.
  • As a developer, I would like to create katacontainers-based pods without dealing with YAML, i.e., from the UI.
  • As a developer, I would like to switch my deployments to use Kata instead on runC (native). 

Requirements

  • Kata runtime MUST be viewable when checking running workloads.
  • A checkbox or a similar method to create Katacontainers from the UI MUST be provided.
  • The above two requirements MUST be tested.

 

 

References

 [1] https://docs.openshift.com/container-platform/4.6/rest_api/workloads_apis/pod-core-v1.html 
 

We should show the runtime class on workloads pages and add a badge to the heading in the case a workload uses Kata. A workload uses Kata if its pod template has `runtimeClassName` set to `kata`.

Acceptance Criteria:

  • Kata runtime must be viewable when checking running workloads, including Pods, ReplicaSets, ReplicationControllers, StatefulSets, Deployments, and DeploymentConfigs.
  • Automated test must be written to verify coverage

 

Andrew Ronaldson indicated that adding a "kata" badge in the heading would be too much noise around other heading badges (ContainerCreating, Failed, etc).

 

 

Why?

  • Decouple control and data plane. 
    • Customers do not pay Red Hat more to run HyperShift control planes and supporting infrastructure than Standalone control planes and supporting infrastructure.
  • Improve security
    • Shift credentials out of cluster that support the operation of core platform vs workload
  • Improve cost
    • Allow a user to toggle what they don’t need.
    • Ensure a smooth path to scale to 0 workers and upgrade with 0 workers.

 

Assumption

  • A customer will be able to associate a cluster as “Infrastructure only”
  • E.g. one option: management cluster has role=master, and role=infra nodes only, control planes are packed on role=infra nodes
  • OR the entire cluster is labeled infrastructure , and node roles are ignored.
  • Anything that runs on a master node by default in Standalone that is present in HyperShift MUST be hosted and not run on a customer worker node.

 

 

Doc: https://docs.google.com/document/d/1sXCaRt3PE0iFmq7ei0Yb1svqzY9bygR5IprjgioRkjc/edit 

Epic Goal

  • To improve debug-ability of ovn-k in hypershift
  • To verify the stability of of ovn-k in hypershift
  • To introduce a EgressIP reach-ability check that will work in hypershift

Why is this important?

  • ovn-k is supposed to be GA in 4.12. We need to make sure it is stable, we know the limitations and we are able to debug it similar to the self hosted cluster.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Dependencies (internal and external)

  1. This will need consultation with the people working on HyperShift

Previous Work (Optional):

  1. https://issues.redhat.com/browse/SDN-2589

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

CNCC was moved to the management cluster and it should use proxy settings defined for the management cluster.

Review the OVN Interconnect proposal, figure out the work that needs to be done in ovn-kubernetes to be able to move to this new OVN architecture. 

Why is this important?

OVN IC will be the model used in Hypershift. 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Review the OVN Interconnect proposal, figure out the work that needs to be done in ovn-kubernetes to be able to move to this new OVN architecture. 

Why is this important?

OVN IC will be the model used in Hypershift. 

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

RHEL CoreOS should be updated to RHEL 9.2 sources to take advantage of newer features, hardware support, and performance improvements.

 

Requirements

  • RHEL 9.x sources for RHCOS builds starting with OCP 4.13 and RHEL 9.2.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

  • 9.2 Preview via Layering No longer necessary assuming we stay the course of going all in on 9.2

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

PROBLEM

We would like to improve our signal for RHEL9 readiness by increasing internal engineering engagement and external partner engagement on our community OpehShift offering, OKD.

PROPOSAL

Adding OKD to run on SCOS (a CentOS stream for CoreOS) brings the community offering closer to what a partner or an internal engineering team might expect on OCP.

ACCEPTANCE CRITERIA

Image has been switched/included: 

DEPENDENCIES

The SCOS build payload.

RELATED RESOURCES

OKD+SCOS proposal: https://docs.google.com/presentation/d/1_Xa9Z4tSqB7U2No7WA0KXb3lDIngNaQpS504ZLrCmg8/edit#slide=id.p

OKD+SCOS work draft: https://docs.google.com/document/d/1cuWOXhATexNLWGKLjaOcVF4V95JJjP1E3UmQ2kDVzsA/edit

 

Acceptance Criteria

A stable OKD on SCOS is built and available to the community sprintly.

 

This comes up when installing ipi-on-aws on arm64 with the custom payload build at quay.io/aleskandrox/okd-release:4.12.0-0.okd-centos9-full-rebuild-arm64 that is using scos as machine-content-os image

 

```

[root@ip-10-0-135-176 core]# crictl logs c483c92e118d8
2022-08-11T12:19:39+00:00 [cnibincopy] FATAL ERROR: Unsupported OS ID=scos
```

 

The probable fix has to land on https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/multus/multus.yaml#L41-L53

 

Feature Overview

  • Kubernetes offers different ways to consume, one could request persistent volumes that survive pod termination or ask for a ephemeral storage space that will be consumed during the lifetime of the pod.
  • This feature tracks the improvements around ephemeral storage as some workloads rely on reliable temporary storage space such as batch jobs, caching services or any app that does not care whether the data is stored persistently across restarts

Goals

 

As described in the kubernetes "ephemeral volumes" documentation this features tracks GA and improvements in

OCPPLAN-9193 Implemented local ephemeral capacity management as well as CSI Generic ephemeral volume. This feature tracks the remaining work to GA CSI ephemeral in-inline volume, specially the admission plugin to make the feature secure and prevent any insecure driver from using it. Ephemeral in-line is required by some CSI as key feature to operate (e.g SecretStore CSI), ODF is also planning to GA ephemeral in-line with ceph CSI. 

Requirements

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

Use Cases

This Section:

  • As an OCP user I want to consume ephemeral storage for my workload
  • As an OCP user I would like to include my PV definition directly in my app definition
  • As an OCP admin I would like to offer ephemeral volumes to my users though CSI
  • As a partner I would like to onboard a driver that relies on CSI inline volumes

Customer Considerations

  • Make sure each ephemeral volume option is clearly identified and documented for each purpose.
  • Make sure we highlight ephemeral volume options that require a specific driver support

Goal: 

The goal is to provide inline volume support (also known as Ephemeral volumes) via a CSI driver/operator. This epic also track the dev of the new admission plugin required to make inline volumes safe.

 

Problem: 

  • The only practical way to extend pods such that node local integrations can happen is with inline volumes. So if we want to integrate with IAM for per pod credentials, we need inline csi volumes. If we want to do better build cache integration, we need inline csi. 

 

Why is this important: 

  • (from https://kubernetes-csi.github.io/docs/ephemeral-local-volumes.html) Traditionally, volumes that are backed by CSI drivers can only be used with a PersistentVolume and PersistentVolumeClaim object combination. This feature will support ephemeral storage use cases and allows CSI volumes to be specified directly in the pod specification. At runtime, nested inline volumes follow the ephemeral lifecycle of their associated pods where the driver handles all phases of volume operations as pods are created and destroyed.
  • Vault integration can be implemented via in-line volumes (see https://github.com/deislabs/secrets-store-csi-driver/blob/master/README.md).
  • Inline volumes would allow us to give out tokens for cloud integration and nuke cloud credential operator’s use of secrets.
  • In OpenShift we already have Shared Resource CSI driver, which uses in-line CSI volumes to distribute cluster-wide secrets and/or config maps.

 

Dependencies (internal and external):

  • CSI API

 

Prioritized epics + deliverables (in scope / not in scope):

  • In Scope
    • A working CSI based inline volume
    • Documentation
    • Admision plugin
  • Not in Scope
    • Implementing the use cases for inline volumes (i.e. integration with IAM)

Estimate (XS, S, M, L, XL, XXL):

 

Previous Work:

Customers:

Open questions:

 

Notes:

 

Create a validating admission plugin that allows pods to be created if:

  • The pod references CSI volume(s)
  • The CSI Driver(s) for the referenced volume(s) are allowed based on the PodSecurity namespace labels.

Enhancement: https://github.com/openshift/enhancements/blob/master/enhancements/storage/csi-inline-vol-security.md 

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Create a severity warning alert to alert to admin that there is packet loss occurring due to failed ovs vswitchd lookups. This may occur if vswitchd is cpu constrained and there are also numerous lookups.

Use metric  ovs_vswitchd_netlink_overflow which shows netlink messages dropped by the vswitchd daemon due to buffer overflow in userspace.

For the kernel equivalent, use metric ovs_vswitchd_dp_flows_lookup_lost . Both metrics usually have the same value but may differ if vswitchd may restart.

Both these metrics should be aggregate into a single alert if the value has increased recently.

 

DoD: QE test case, code merged to CNO, metrics document updated ( https://docs.google.com/document/d/1lItYV0tTt5-ivX77izb1KuzN9S8-7YgO9ndlhATaVUg/edit )

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

ovnk manifests in CNO is not up-to-date, we want to sync it with manifests in microshift repo .

Note: Replace text in red with details of your feature request.

Feature Overview

Extend the Workload Partitioning feature to support multi-node clusters.

Goals

Customers running RAN workloads on C-RAN Hubs (i.e. multi-node clusters) that want to maximize the cores available to the workloads (DU) should be able to utilize WP to isolate CP processes to reserved cores.

Requirements

A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts.  If a non MVP requirement slips, it does not shift the feature.

requirement Notes isMvp?
     
     
     

 

Describe Use Cases (if needed)

< How will the user interact with this feature? >

< Which users will use this and when will they use it? >

< Is this feature used as part of current user interface? >

Out of Scope

 

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

< Are there Upgrade considerations that customers need to account for or that the feature should address on behalf of the customer?>

<Does the Feature introduce data that could be gathered and used for Insights purposes?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

< What does success look like?>

< Does this feature have doc impact?  Possible values are: New Content, Updates to existing content,  Release Note, or No Doc Impact>

< If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>

  • <What concepts do customers need to understand to be successful in [action]?>
  • <How do we expect customers will use the feature? For what purpose(s)?>
  • <What reference material might a customer want/need to complete [action]?>
  • <Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available. >
  • <What is the doc impact (New Content, Updates to existing content, or Release Note)?>

Interoperability Considerations

< Which other products and versions in our portfolio does this feature impact?>

< What interoperability test scenarios should be factored by the layered product(s)?>

Questions

Question Outcome
   

 

 

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Update admission controller to remove check for SNO

Repo Link

Add Node Admission controller to stop nodes from joining that do not have CPU Partitioning turned on.

Incomplete Features

When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release

The goal of this effort is to leverage OVN Kubernetes SDN to satisfy networking requirements of both traditional and modern virtualization. This Feature describes the envisioned outcome and tracks its implementation.

Current state

In its current state, OpenShift Virtualization provides a flexible toolset allowing customers to connect VMs to the physical network. It also has limited secondary overlay network capabilities and Pod network support.

It suffers from several gaps: Topology of the default pod network is not suitable for typical VM workload - due to that we are missing out on many of the advanced capabilities of OpenShift networking, and we also don't have a good solution for public cloud. Another problem is that while we provide plenty of tools to build a network solution, we are not very good in guiding cluster administrators configuring their network, making them rely on their account team.

Desired outcome

Provide:

  • Networking solution for public cloud
  • Advanced SDN networking functionality such as IPAM, routed ingress, DNS and cloud-native integration
  • Ability to host traditional VM workload imported from other virtualization platforms

... while maintaining networking expectations of a typical VM workload:

  • Sticky IPs allowing seamless live migration
  • External IP reflected inside the guest, i.e. no NAT for east-west traffic

Additionally, make our networking configuration more accessible to newcomers by providing a finite list of user stories mapped to recommended solutions.

Timeline

Complete milestones

Next actions (tentative)

4.17 user defined networks TP:

4.17 user defined networks DP:

  • CNV-42490 Networks binding (integration between user-defined networks and KubeVirt)
  • CNV-33753 Egress

4.17 other work:

4.18 user-defined networks for public cloud GA:

  • CNV-42637 Live-migration (not to be confused with seamless live migration CNV-27147)
  • CNV-41302 IPAM (graduation on both CNV and OCP)
  • CNV-44233 Network binding (graduation, removing feature gates and deploying it with CNV)
  • CNV-44224 Egress (graduation, just testing)
  • CNV-24258 LoadBalancer ingress (straight to GA unless we see there are integration points with CNV)

4.18 user-defined networks, other, GA:

4.18 localnet enhancements:

User stories

You can find more info about this effort in https://docs.google.com/document/d/1jNr0E0YMIHsHu-aJ4uB2YjNY00L9TpzZJNWf3LxRsKY/edit

Goal

Provide IPAM to customers connecting VMs to OVN Kubernetes secondary networks.

User Stories

  • As a developer running VMs,
    I want to offload IPAM to somebody else,
    so I don't need to manage my own IP pools, DHCP server, or static IP configuration.

Non-Requirements

  • IPv6 support is not required.

Notes

  • KubeVirt cannot support CNI IPAM. For that reason we cannot utilize the current implementation of IP management in OVN Kubernetes
  • OVN supports IPAM, where an IP range is defined per port, and the port then offers assigned IP to the client using DHCP. We can use this

Done Checklist

Who What Reference
DEV Upstream roadmap issue <link to GitHub Issue>
DEV Upstream code and tests merged <link to meaningful PR>
DEV Upstream documentation merged <link to meaningful PR>
DEV gap doc updated <name sheet and cell>
DEV Upgrade consideration <link to upgrade-related test or design doc>
DEV CEE/PX summary presentation label epic with cee-training and add a <link to your support-facing preso>
QE Test plans in Polarion https://polarion.engineering.redhat.com/polarion/#/project/CNV/workitem?id=CNV-10864
QE Automated tests merged <link or reference to automated tests>
DOC Downstream documentation merged <link to meaningful PR>

 

Add a knob to CNO to control the installation of the IPAMClaim CRD.

Requires a new OpenShift feature gate only allowing the feature to be installed in Dev / Tech preview.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

Epic Goal

  • Support live-migration of HyperShift VMs connected to the pod network provided through OVN Kubernetes

Why is this important?

  • HyperShift on KubeVirt should GA in 4.14. Worker nodes there run on KubeVirt VMs and are interconnected using OVN Kubernetes. These VMs must be able to live-migrate in case their current nodes need to undergo maintenance. With the current OVN Kubernetes implementation, live-migration leads into a new IP being assigned to the Pod hosting the VM and by that, breaking connectivity.

Scenarios

Stable IPs for migration:

  1. A Pod running a VM is given an IP by OVN Kubernetes
  2. The VM is being migrated to a new Pod running on a different Node
  3. The new Pod should obtain the same IP and use the same gateway IP

IP negotiated using DHCP:

  1. A Pod running a VM is being started
  2. An IP is allocated for the Pod
  3. The IP is not set on an interface inside the netns
  4. The IP is offered through OVN LP's DHCP server

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • VM must be able to maintain an IP between migrations (from one Pod to another)
  • VM must be assigned an IP through DHCP, without the IP being set inside the Pod netns
  • The VM is a first class citizen of the Pod network, it can utilize NetworkPolices and Services, and communicate with other Pods
  • The IP seen within the VM should be reachable from other VMs and Pods

Dependencies (internal and external)

  1. This depends on Proxy ARP being available in OVN. This work is unerway, tracked via BZ#2155306 and targeted for OVN 23.06.00
  2. Depends on kubernetes endpoitslices with same IP fix https://github.com/kubernetes/kubernetes/pull/116084
  3. Depends on new annotation at kubevirt to implement post-copy live migration correctly https://github.com/kubevirt/kubevirt/pull/9290

Previous Work (Optional):

  1. Supporting this scenario using secondary OVN Kubernetes networks was considered, documented, PoC'd and presented to "OCP Networking Architecture" team.
  2. An alternative approach using the primary network was suggested by the team. This was documented, successfully PoC'd and presented back to the group.
  3. An Enhancement proposal was published.

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Market Problem

As a stakeholder aiming to adopt KubeSaw as a Namespace-as-a-Service solution, I want the project to provide streamlined tooling and a clear code-base, ensuring seamless adoption and integration into my clusters.

Why it Matters

Efficient adoption of KubeSaw, especially as a Namespace-as-a-Service solution, relies on intuitive tooling and a transparent codebase. Improving these aspects will empower stakeholders to effortlessly integrate KubeSaw into their Kubernetes clusters, ensuring a smooth transition to enhanced namespace management.

Illustrative User Stories

As a Stakeholder, I want streamlined setup of the KubeSaw project and fully automated way of upgrading this setup aling with the updates of the installation.

Expected Outcomes

  • Intuitive and user-friendly tooling for seamless configuration and management of KubeSaw instance.
  • A transparent and well-documented codebase, facilitating a quick understanding of KubeSaw internals.

Effect

The expected outcome within the market is both growth and retention. The improved tooling and codebase will attract new stakeholders (growth) and enhance the experience for existing users (retention) by providing a straightforward path to adopting KubeSaw's Namespace-as-a-Service features in their clusters.

Partner

  • Developer Sandbox
  • Konflux

Additional/tangential areas for future development

  • Integration with popular Kubernetes management platforms and tooling for enhanced interoperability.
  • Regular updates to compatibility matrices to support evolving Kubernetes technologies.
  • Collaboration with stakeholders to gather feedback and continuously improve the integration experience, including advanced namespace management features tailored to user needs.

This epic is to track all the unplanned work related to security incidents, fixing flaky e2e tests, and other urgent and unplanned efforts that may arise during the sprint.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Feature Overview

We drive OpenShift cross-market customer success and new customer adoption with constant improvements and feature additions to the existing capabilities of our OpenShift Core Networking (SDN and Network Edge). This feature captures that natural progression of the product.

Goals

  • Feature enhancements (performance, scale, configuration, UX, ...)
  • Modernization (incorporation and productization of new technologies)

Requirements

  • Core Networking Stability
  • Core Networking Performance and Scale
  • Core Neworking Extensibility (Multus CNIs)
  • Core Networking UX (Observability)
  • Core Networking Security and Compliance

In Scope

  • Network Edge (ingress, DNS, LB)
  • SDN (CNI plugins, openshift-sdn, OVN, network policy, egressIP, egress Router, ...)
  • Networking Observability

Out of Scope

There are definitely grey areas, but in general:

  • CNV
  • Service Mesh
  • CNF

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

We need to verify that no new CoreDNS dual stack features require any configuration changes or feature flags.
(All dual stack changes should just work once we rebase to coredns v1.8.1).

See https://github.com/coredns/coredns/pull/4339 .

We also need to verify that cluster DNS works for both v4 and v6 for a dual stack cluster IP service. (ie request via A and AAAA, make sure you get the desired response, and not just one or the other). A brief CI test on our dual stack metal CI might make the most sense here (KNI Might have a job like this already, need to investigate our options to add dual stack coverage to openshift/coredns).

This story is for actually updating the version of CoreDNS in github.com/openshift/coredns. Our fork will need to be rebased onto https://github.com/coredns/coredns/releases/tag/v1.8.1, which may involve some git fu. Refer to previous CoreDNS Rebase PR's for any pointers there.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Create a PR in openshift/cluster-ingress-operator to specify the random balancing algorithm if the feature gate is enabled, and to specify the leastconn balancing algorithm (the current default) otherwise.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

The multiple destinations provided as a part of the allowedDestinations field is not working as it used to on OCP4: https://github.com/openshift/images/blob/master/egress/router/egress-router.sh#L70-L109

 

We need to parse this from the NAD and modify the iptables here to support them:

https://github.com/openshift/egress-router-cni/blob/master/pkg/macvlan/macvlan.go#L272-L349

 

Testing:

1) Created NAD:

[dsal@bkr-hv02 surya_multiple_destinations]$ cat nad_multiple_destination.yaml 
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
 name: egress-router
spec:
 config: '{
     "cniVersion": "0.4.0",
     "type": "egress-router",
     "name": "egress-router",
 "ip": {
     "addresses": [
         "10.200.16.10/24"
     ],
     "destinations": [
         "80 tcp 10.100.3.200",
         "8080 tcp 203.0.113.26 80",
         "8443 tcp 203.0.113.26 443"
     ],
     "gateway": "10.200.16.1"
  }
}'

2) Created pod:

[dsal@bkr-hv02  surya_multiple_destinations]$ cat egress-router-pod.yaml 
---
apiVersion: v1
kind: Pod
metadata:
  name: egress-router-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: egress-router
spec:
  containers:
    - name: openshift-egress-router-pod
      command: ["/bin/bash", "-c", "sleep 999999999"]
      image: centos/tools
      securityContext:
        privileged: true

3) Checked IPtables:

[root@worker-1 core]# iptables-save -t nat 
Generated by iptables-save v1.8.4 on Mon Feb 1 12:08:05 2021
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A POSTROUTING -o net1 -j SNAT --to-source 10.200.16.10
COMMIT # Completed on Mon Feb 1 12:08:05 2021

As we can see, only the SNAT rule is added. The DNAT doesn't get picked up because of the syntax difference.

Feature Overview

Plugin teams need a mechanism to extend the OCP console that is decoupled enough so they can deliver at the cadence of their projects and not be forced in to the OCP Console release timelines.

The OCP Console Dynamic Plugin Framework will enable all our plugin teams to do the following:

  • Extend the Console
  • Deliver UI code with their Operator
  • Work in their own git Repo
  • Deliver at their own cadence

Goals

    • Operators can deliver console plugins separate from the console image and update plugins when the operator updates.
    • The dynamic plugin API is similar to the static plugin API to ease migration.
    • Plugins can use shared console components such as list and details page components.
    • Shared components from core will be part of a well-defined plugin API.
    • Plugins can use Patternfly 4 components.
    • Cluster admins control what plugins are enabled.
    • Misbehaving plugins should not break console.
    • Existing static plugins are not affected and will continue to work as expected.

Out of Scope

    • Initially we don't plan to make this a public API. The target use is for Red Hat operators. We might reevaluate later when dynamic plugins are more mature.
    • We can't avoid breaking changes in console dependencies such as Patternfly even if we don't break the console plugin API itself. We'll need a way for plugins to declare compatibility.
    • Plugins won't be sandboxed. They will have full JavaScript access to the DOM and network. Plugins won't be enabled by default, however. A cluster admin will need to enable the plugin.
    • This proposal does not cover allowing plugins to contribute backend console endpoints.

 

Requirements

 

Requirement Notes isMvp?
 UI to enable and disable plugins    YES 
 Dynamic Plugin Framework in place    YES 
Testing Infra up and running   YES 
 Docs and read me for creating and testing Plugins    YES 
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 
 Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Related to CONSOLE-2380

We need a way for cluster admins to disable a console plugin when uninstalling an operator if it's enabled in the console operator config. Otherwise, the config will reference a plugin that no longer exists. This won't prevent console from loading, but it's something that we can clean up during uninstall.

The UI will always remove the console plugin when an operator is uninstalled. There will not be an option to keep the operator. We should have a sentence in the dialog letting the user know that the plugin will disabled when the operator is uninstalled (but only if the CSV has the plugin annotation).

If the user doesn't have authority to patch the operator config, we should warn them that the operator config can't be updated to remove the plugin.

cc Peter Kreuser Tony Wu Robb Hamilton

Feature Overview

  • This Section:* High-Level description of the feature ie: Executive Summary
  • Note: A Feature is a capability or a well defined set of functionality that delivers business value. Features can include additions or changes to existing functionality. Features can easily span multiple teams, and multiple releases.

 

Goals

  • This Section:* Provide high-level goal statement, providing user context and expected user outcome(s) for this feature

 

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

(Optional) Use Cases

This Section: 

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

 

Questions to answer…

  • ...

 

Out of Scope

 

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

 

Assumptions

  • ...

 

Customer Considerations

  • ...

 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

The work on this story is dependent on following changes:

 

The console already supports custom routes on the operator config. With the new proposed CustomDomains API introduces a unified way how to stock install custom domains for routes, which both names and serving cert/keys, customers want to customise. From the console perspective those are:

  • openshift-console / console
  • openshift-console / downloads (CLI downloads)

 

The setup should be done on the Ingress config. There two new fields are introduced:

  • ComponentRouteSpec - contains configuration of the for the custom domain(name, namespace, custom hostname, TLS secret reference)
  • ComponentRouteStatus - contains status of the custom domain(condition, previous hostname, rbac needed to read the TLS secret, ...)

 

Console-operator will be only consuming the API and check for any changes. If a custom domain is set for either `console` or `downloads` route in the `openshift-console` namespace, console-operator will read the setup set a custom route accordingly. When a custom route will be setup for any of console's route, the default route wont be deleted, but instead it will updated so it redirects to the custom one. This is done because of two reasons:

  1. we want to prevent somebody from stealing the default hostname of both routes (console, downloads)
  2. we want to prevent users from having unusable bookmarks that are pointing to the default hostname

 

Console-operator will still need to support the CustomDomain API that is available on it's config.

Acceptance criteria:

  • Console supports the new CustomDomains API for configuring a custom domain for both `console` and `downloads` routes
  • Console falls back to the deprecated API in the console operator config if present
  • Console supports the original default domains and redirects to the new ones

 

Questions:

  • Which CustomDomain API takes precedens? Ingress config vs. Console-operator config. Can upgrade cause any issues?

Story:
As a user viewing the pod logs tab with a selected container, I want the ability to view past logs if they are available for the container.

Acceptance Criteria:

  • Provide a mechanism to expose past logs, if they are available.

 

Design doc: https://docs.google.com/document/d/1PB8_D5LTWhFPFp3Ovf85jJTc-zAxwgFR-sAOcjQCSBQ/edit#

When moving to OCP 4 we didn't port the metrics charts for Deployments, Deployment Configs, StatefulSets, DaemonSets, ReplicaSets, and ReplicationControllers. These should be the same charts that we show on the Pods page: Memory, CPU, Filesystem, Network In and Out.

This was only done for pods.

We need to decide if we want use a multiline chart or some other representation.

This would let us import YAML with multiple resources and add YAML templates that create related resources like image streams and build configs together.

See CONSOLE-580

Acceptance criteria:

  • Users should be able to drag multiple files into the import yaml.
    • the yaml files should be displayed in the editor separated by "- - -"
  • After clicking create
    • a dry run will be initiated and will report back any errors
    • upon receiving no errors from the dry run, the resources will be created
    • the results page will appear showing links for each resource

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for OpenStack deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision OpenStack infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

  • The OpenStack Installer no longer contains or uses Terraform.
  • The new provider should aim to provide the same results and have parity with the existing OpenStack Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Goal

  • Create cluster and OpenStackCluster resource for the install-config.yaml
  • Create OpenStackMachine
  • Remove terraform dependency for OpenStack

Why is this important?

  • To have a CAPO cluster functionally equivalent to the installer

Scenarios

\

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Essentially: bring the upstream-master branch of shiftstack/cluster-api-provider-openstack under the github.com/openshift organisation.

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision OpenShift on the existing supported providers' infrastructure without the use of Terraform.

This feature will be used to track all the CAPI preparation work that is common for all the supported providers

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Epic Goal

  • Day 0 Cluster Provisioning
  • Compatibility with existing workflows that do not require a container runtime on the host

Why is this important?

  • This epic would maintain compatibility with existing customer workflows that do not have access to a management cluster and do not have the dependency of a container runtime

Scenarios

  1. openshift-install running in customer automation

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

 I want hack/build.sh to embed the kube-apiserver and etcd dependencies in openshift-install without making external network calls so that ART/OSBS can build the installer with CAPI dependencies.

Acceptance Criteria:

Description of criteria:

  • dependencies are not obtained over the internet
  • gated by OPENSHIFT_INSTALL_CLUSTER_API env var
  • should work when building for various architectures

(optional) Out of Scope:

Engineering Details:

  • Currently the dependencies are obtained through the sync_envtest function in build-cluster-api.sh
  • Cluster API provider dependencies are vendored and built here

This requires/does not require a design proposal.
This requires/does not require a feature gate.

The 100.88.0.0/14 IPv4 subnet is currently reserved for the transit switch in OVN-Kubernetes for east west traffic in the OVN Interconnect architecture. We need to make this value configurable so that users can avoid conflicts with their local infrastructure.  We need to support this config both prior to installation and post installation (day 2).

This epic will include stories for the upstream ovn-org work, getting that work downstream, an api change, and a cno change to consume the new api

Scope of this card is to track the work around getting the required pieces in for transit switch subnet in CNO that will let users do custom configurations to transit switch subnet on both day0 (install) and day2 (post-install).

This card will complement https://issues.redhat.com/browse/SDN-4156 

You can create the cluster-bot cluster with Ben's PR and do CNO changes locally and test them out.

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

Support EgressIP feature with ExternalTrafficPolicy=Local and External2Pod direct routing in OVNKubernetes.

Why is this important?

We see a lot of customers using Multi-Egress Gateway with EgressIP. 

Currently,  connections which reaches pod via the OVN routing gateway are send back via EgressIP if  it is associated with the specific namespace. 

Multiple bugs have been reported by customers: 

https://issues.redhat.com/browse/OCPBUGS-16792 

https://issues.redhat.com/browse/OCPBUGS-7454

https://issues.redhat.com/browse/OCPBUGS-18400

Also, resulting in filing RFE, as it was too complicated to be fixed via a bug.

https://issues.redhat.com/browse/RFE-4614

https://issues.redhat.com/browse/RFE-3944

This is observed by multiple customers using MetalLB and F5 load balancers. We haven't really tested this combination.

From the initial discussion, looks like the fix is needed in OVN. Request the team to expedite this fix, given it has bunch of customers hitting it.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

  1. OVN team has to do https://issues.redhat.com/browse/FDP-42 and only then can we consume that into OVNKubernetes
  2. Design discussions Doc: https://docs.google.com/document/d/1VgDuEhkDzNOjIlPtwfIhEGY1Odatp-rLF6Pmd7bQtt0/edit 

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

In 4.15, before conducting the live migration, CNO will check if a cluster is managed by the SD team. We need to remove this checking for supporting unmanaged clusters.

Epic Goal*

Provide a long term solution to SELinux context labeling in OCP.

 
Why is this important? (mandatory)

As of today when selinux is enabled, the PV's files are relabeled when attaching the PV to the pod, this can cause timeout when the PVs contains lot of files as well as overloading the storage backend.

https://access.redhat.com/solutions/6221251 provides few workarounds until the proper fix is implemented. Unfortunately these workaround are not perfect and we need a long term seamless optimised solution.

This feature tracks the long term solution where the PV FS will be mounted with the right selinux context thus avoiding to relabel every file.

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. Apply new context when there is none
  2. Change context of all files/folders when changing context
  3. RWO & RWX PVs
    1. ReadWriteOncePod PVs first
    2. RWX PV in a second phase

As we are relying on mount context there should not be any relabeling (chcon) because all files / folders will inherit the context from the mount context

More on design & scenarios in the KEP  and related epic STOR-1173

Dependencies (internal and external) (mandatory)

None for the core feature

However the driver will have to set SELinuxMountSupported to true in the CSIDriverSpec to enable this feature. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - STOR
  • Documentation - STOR
  • QE - STOR
  • PX - 
  • Others -

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Epic Goal

Support upstream feature "SELinux relabeling using mount options (CSIDriver API change)"" in OCP as Beta, i.e. test it and have docs for it (unless it's Alpha upstream).

Summary: If Pod has defined SELinux context (e.g. it uses "resticted" SCC) and it uses ReadWriteOncePod PVC and CSI driver responsible for the volume supports this feature, kubelet + the CSI driver will mount the volume directly with the correct SELinux labels. Therefore CRI-O does not need to recursive relabel the volume and pod startup can be significantly faster. We will need a thorough documentation for this.

This upstream epic actually will be implemented by us!

Why is this important?

  • We get this upstream feature through Kubernetes rebase. We should ensure it works well in OCP and we have docs for it.

Upstream links

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. External: the feature is currently scheduled for Beta in Kubernetes 1.27, i.e. OCP 4.14, but it may change before Kubernetes 1.27 GA.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Test that the metrics described in the KEP provide useful data. I.e. check that volume_manager_selinux_volume_context_mismatch_warnings_total increases when there are two Pods that have two different SELinux contexts and use the same volume and different subpath of it.

Goal of this feature is to add support to 

  • telemetry
  • nmstate ipv6
  • nmstate net2net

Why is this important?

  • without API, customers are forced to use MCO. this brings with it a set of limitations (mainly reboot per change and the fact that config is shared among each pool, can't do per node configuration)
  • better upgrade solution will give us the ability to support a single host based implementation
  • telemetry will give us more info on how widely is ipsec used.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Must allow for the possibility of offloading the IPsec encryption to a SmartNIC.

 

  • nmstate
  • k8s-nmstate
  • easier mechanism for cert injection (??)
  • telemetry
  • improve ci and test coverage
     

Dependencies (internal and external)

  1.  nmstate tasks

Related:

  • ITUP-44 - OpenShift support for North-South OVN IPSec
  • HATSTRAT-33 - Encrypt All Traffic to/from Cluster (aka IPSec as a Service)

Previous Work (Optional):

  1. SDN-717 - Support IPSEC on ovn-kubernetes
  2. SDN-3604 - Fully supported non-GA N-S IPSec implementation using machine config.

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

  • telemetry
  • nmstate ipv6
  • nmstate net2net

Why is this important?

  • without API, customers are forced to use MCO. this brings with it a set of limitations (mainly reboot per change and the fact that config is shared among each pool, can't do per node configuration)
  • better upgrade solution will give us the ability to support a single host based implementation
  • telemetry will give us more info on how widely is ipsec used.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Must allow for the possibility of offloading the IPsec encryption to a SmartNIC.

 

  • nmstate
  • k8s-nmstate
  • easier mechanism for cert injection (??)
  • telemetry
  • improve ci and test coverage
     

Dependencies (internal and external)

  1.  nmstate tasks

Related:

  • ITUP-44 - OpenShift support for North-South OVN IPSec
  • HATSTRAT-33 - Encrypt All Traffic to/from Cluster (aka IPSec as a Service)

Previous Work (Optional):

  1. SDN-717 - Support IPSEC on ovn-kubernetes
  2. SDN-3604 - Fully supported non-GA N-S IPSec implementation using machine config.

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal*

Drive the technical part of the Kubernetes 1.29 upgrade, including rebasing openshift/kubernetes repositiry and coordination across OpenShift organization to get e2e tests green for the OCP release.

 
Why is this important? (mandatory)

OpenShift 4.17 cannot be released without Kubernetes 1.30

 
Scenarios (mandatory) 

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

PRs:

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

Why is this important?

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

Support network isolation and multiple primary networks (with the possibility of overlapping IP subnets) without having to use Kubernetes Network Policies.

Goals (aka. expected user outcomes)

  • Provide a configurable way to indicate that a pod should be connected to a unique network of a specific type via its primary interface.
  • Allow networks to have overlapping IP address space.
  • The primary network defined today will remain in place as the default network that pods attach to when no unique network is specified.
  • Support cluster ingress/egress traffic for unique networks, including secondary networks.
  • Support for ingress/egress features where possible, such as:
    • EgressQoS
    • EgressService
    • EgressIP
    • Load Balancer Services

Requirements (aka. Acceptance Criteria):

  • Support for 10,000 namespaces
  •  

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Design Document

Use Cases (Optional):

  • As an OpenStack or vSphere/vCenter user, who is migrating to OpenShift Kubernetes, I want to guarantee my OpenStack/vSphere tenant network isolation remains intact as I move into Kubernetes namespaces.
  • As an OpenShift Kubernetes user, I do not want to have to rely on Kubernetes Network Policy and prefer to have native network isolation per tenant using a layer 2 domain.
  • As an OpenShift Network Administrator with multiple identical application deployments across my cluster, I require a consistent IP-addressing subnet per deployment type. Multiple applications in different namespaces must always be accessible using the same, predictable IP address.

Questions to Answer (Optional):

  •  

Out of Scope

  • Multiple External Gateway (MEG) Support - support will remain for default primary network.
  • Pod Ingress support - support will remain for default primary network.
  • Cluster IP Service reachability across networks. Services and endpoints will be available only within the unique network.
  • Allowing different service CIDRs to be used in different networks.
  • Localnet will not be supported initially for primary networks.
  • Allowing multiple primary networks per namespace.
  • Allow connection of multiple networks via explicit router configuration. This may be handled in a future enhancement.
  • Hybrid overlay support on unique networks.

Background

OVN-Kubernetes today allows multiple different types of networks per secondary network: layer 2, layer 3, or localnet. Pods can be connected to different networks without discretion. For the primary network, OVN-Kubernetes only supports all pods connecting to the same layer 3 virtual topology.

As users migrate from OpenStack to Kubernetes, there is a need to provide network parity for those users. In OpenStack, each tenant (analog to a Kubernetes namespace) by default has a layer 2 network, which is isolated from any other tenant. Connectivity to other networks must be specified explicitly as network configuration via a Neutron router. In Kubernetes the paradigm is the opposite; by default all pods can reach other pods, and security is provided by implementing Network Policy.

Network Policy has its issues:

  • it can be cumbersome to configure and manage for a large cluster
  • it can be limiting as it only matches TCP, UDP, and SCTP traffic
  • large amounts of network policy can cause performance issues in CNIs

With all these factors considered, there is a clear need to address network security in a native fashion, by using networks per user to isolate traffic instead of using Kubernetes Network Policy.

Therefore, the scope of this effort is to bring the same flexibility of the secondary network to the primary network and allow pods to connect to different types of networks that are independent of networks that other pods may connect to.

Customer Considerations

  •  

Documentation Considerations

  •  

Interoperability Considerations

Test scenarios:

  • E2E upstream and downstream jobs covering supported features across multiple networks.
  • E2E tests ensuring network isolation between OVN networked and host networked pods, services, etc.
  • E2E tests covering network subnet overlap and reachability to external networks.
  • Scale testing to determine limits and impact of multiple unique networks.

In order for the nework API related CRDs be installed and usable out-of-the-box, the new CRDs manifests should be replicated to CNO repository in a way it will install them along with other OVN-K CRDs.

Example https://github.com/openshift/cluster-network-operator/pull/1765

See https://github.com/ovn-org/ovn-kubernetes/pull/4276#discussion_r1628111584 for more details

  1. Currently we seem to be handling the same network from multiple threads when different NADs refer to the same network
  2. This leads to race conditions
  3. we need a level driven single threaded way of handling networks
  4. this card tracks the refactoring needed for this as the 1st step
  • Make the necessary API changes if needed in telling people its modifable
  • make any CNO changes in ensuring new range in day2 is not conflicting with other ranges
  • Make OVNK changes to allow for day2 config changes - disruption is not an issue
  • Ensure users can provide any value to this not just 169.x.x.x along with allowing expansion of current range

Goal of this task is to simply add a feature gate both upstream to OVNK and to downstream in ocp/api to then leverage via CNO once the entire feature merges. This is going to be a huge EPIC so with the break down, this card is intentionally ONLY tracking the glue work to have a feature gate piece done in both places.

  1. Controller changes must leverage this feature gate,
  2. test changes must leverage this
  3. all topology changes that depend on "specific ifs for this feature" need this feature gate
  4. be smart about the naming cause it will be user facing in docs
  5. also expose it via KIND

This card DOES NOT HAVE TO USE THE FEATURE GATE. It is meant to allow other cards to use this.

Epic Goal*

Drive the technical part of the Kubernetes 1.31 upgrade, including rebasing openshift/kubernetes repositiry and coordination across OpenShift organization to get e2e tests green for the OCP release.

 
Why is this important? (mandatory)

OpenShift 4.18 cannot be released without Kubernetes 1.31

 
Scenarios (mandatory) 

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

PRs:

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

Track the stories that cannot be completed before live migration GA.

Why is this important?

These tasks shall not block the live migration GA, but we still need to get them done.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the "Discussion Needed: Service Delivery Architecture Overview" checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the "Discussion Needed: Service Delivery Architecture Overview" checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn't have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. ...

Open questions::

1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

The SDN live migration can not work properly in a cluster with specific configurations. CNO shall refuse proceeding the live migration in such a case. We need to add the pre-migration validation to CNO

The live migration shall be blocked for clusters with the following configuration

  • OpenShiftSDN multitenat mode.
  • Egress Router
  • cluster network or service network ranges conflict with the OVN-K internal subnets

As, the live migration process may take hours for a large cluster. The workload in the cluster may trigger cluster extension by adding new nodes. We need to support adding new nodes when an SDN live migration is running in progress.

We need to backport this to 4.15.

SD team manage many clusters. Metrics can help them to monitor status of many cluster at the time. There is something which has been done for the cluster upgrade, we may want to follow the same recipe.

Elaborate more dashboards (monitoring dashboards, accessible from menu Observe > Dashboards ; admin perspective) related to networking.

Start with just a couple of areas:

  • Host network dashboard (using node-exporter network / netstat metrics - related to CMO)
  • OVN/OVS health dashboard (using ovn/ovs metrics)
  • Ingress dashboard (routes, shards stats) related to Ingress operator / netedge team
    (- DNS dashboard, if time)

More info/discussion in this work doc: https://docs.google.com/document/d/1ByNIJiOzd6w5csFYpC27NdOydnBg8Tx45uL4-7v-aCM/edit

Elaborate more dashboards (monitoring dashboards, accessible from menu Observe > Dashboards ; admin perspective) related to networking.

Start with just a couple of areas:

  • Host network dashboard (using node-exporter network / netstat metrics - related to CMO)
  • OVN/OVS health dashboard (using ovn/ovs metrics)

More info/discussion in this work doc: https://docs.google.com/document/d/1ByNIJiOzd6w5csFYpC27NdOydnBg8Tx45uL4-7v-aCM/edit

Martin Kennelly is our contact point from the SDN team

Create a dashboard from the CNO

cf https://docs.google.com/document/d/1ByNIJiOzd6w5csFYpC27NdOydnBg8Tx45uL4-7v-aCM/edit#heading=h.as5l4d8fepgw

Current metrics documentation:

Include metrics for:

  • pod/svc/netpol setup latency
  • ovs/ovn CPU and memory
  • network stats: rx/tx bytes, drops, errs per interface (not all interfaces are monitored by default, but they're going to be more configurable via another task: NETOBSERV-1021)

Feature Overview (aka. Goal Summary)  

Customers have requested the ability to have the ability to apply tolerations to the HCP control plane pods. This provides the flexibility to have the HCP pods scheduled to nodes with taints applied to them that are not currently tolerated by default.

API 

Add new field to HostedCluster. hc.Spec.Tolerations

Tolerations   []corev1.Toleration `json:"tolerations,omitempty"` 

Implementation

In support/config/deployment.go, add hc.spec.tolerations from hc when generating the default config. This will cause the toleration to naturally get spread to the deployments and statefulsets.

CLI

Add new cli argument called –tolerations to the hcp cli tool during cluster creation. This argument should be able to be set multiple times. The syntax of the field should follow the convention set by the kubectl client tool when setting a taint on a node.

For example, the kubectl client tool can be used to set the following taint on a node.

kubectl taint nodes node1 key1=value1:NoSchedule

And then the hcp cli tool should be able to add a toleration for this taint during creation with the following cli arg.

hcp cluster create kubevirt –toleration “key1=value1:noSchedule” …

Goals (aka. expected user outcomes)

  • Support for customer defined tolerations for HCP pods

    Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Customers have requested the ability to have the ability to apply tolerations to the HCP control plane pods. This provides the flexibility to have the HCP pods scheduled to nodes with taints applied to them that are not currently tolerated by default.

API 

Add new field to HostedCluster. hc.Spec.Tolerations

Tolerations   []corev1.Toleration `json:"tolerations,omitempty"` 

Implementation

In support/config/deployment.go, add hc.spec.tolerations from hc when generating the default config. This will cause the toleration to naturally get spread to the deployments and statefulsets.

CLI

Add new cli argument called –tolerations to the hcp cli tool during cluster creation. This argument should be able to be set multiple times. The syntax of the field should follow the convention set by the kubectl client tool when setting a taint on a node.

For example, the kubectl client tool can be used to set the following taint on a node.

kubectl taint nodes node1 key1=value1:NoSchedule

And then the hcp cli tool should be able to add a toleration for this taint during creation with the following cli arg.

hcp cluster create kubevirt –toleration “key1=value1:noSchedule” …

The cluster-network-operator needs to be HCP tolerations aware, otherwise controllers (like multus and ovn) won't be deployed by the CNO with the correct tolerations.

The code that looks at the HostedControlPlane within the CNO can be found in pkg/hypershift/hypershift.go. https://github.com/openshift/cluster-network-operator/blob/33070b57aac78118eea34060adef7f2fb7b7b4bf/pkg/hypershift/hypershift.go#L134

Continue scale testing and performance improvements for ovn-kubernetes

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

Manage Openshift Virtual Machines IP addresses from within the SDN solution provided by OVN-Kubernetes.

Why is this important?

Customers want to offload IPAM from their custom solutions (e.g. custom DHCP server running on their cluster network) to SDN.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

Enable CPU manager on s390x.

Why is this important?

CPU manager is an important component to manage performance of OpenShift and utilize the respective platforms.

Goals (aka. expected user outcomes)

Enable CPU manager on s390x.

Requirements (aka. Acceptance Criteria):

CPU manager works on s390x.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Y
Classic (standalone cluster) Y
Hosted control planes Y
Multi node, Compact (three node), or Single node (SNO), or all Y
Connected / Restricted Network Y
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) IBM Z
Operator compatibility n/a
Backport needed (list applicable versions) n/a
UI need (e.g. OpenShift Console, dynamic plugin, OCM) n/a
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Feature Overview (aka. Goal Summary)  

OVN Kubernetes Developer's Preview for BGP as a routing protocol for User Defined Network (Segmentation) pod and VM addressability via common data center networking removing the need to negotiate NAT at the cluster's edge.

Goals (aka. expected user outcomes)

OVN-Kubernetes currently has no native routing protocol integration, and relies on a Geneve overlay for east/west traffic, as well as third party operators to handle external network integration into the cluster. The purpose of this Developer's Preview enhancement is to introduce BGP as a supported routing protocol with OVN-Kubernetes. The extent of this support will allow OVN-Kubernetes to integrate into different BGP user environments, enabling it to dynamically expose cluster scoped network entities into a provider’s network, as well as program BGP learned routes from the provider’s network into OVN. In a follow-on release, this enhancement will provide support for EVPN, which is a common data center networking fabric that relies on BGP.

Requirements (aka. Acceptance Criteria):

  • Provide a user-facing API to allow configuration of iBGP or eBGP peers, along with typical BGP configurations to include communities, route targets, vpnv4/v6, etc
  • Support for advertising Egress IP addresses
  • Enable BFD to BGP peers
  • Support EVPN configuration and integration with a user’s DC fabric, along with MAC-VRFs and IP-VRFs
  • ECMP routing support within OVN for BGP learned routes
     
    Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
    Deployment considerations List applicable specific needs (N/A = not applicable)
    Self-managed, managed, or both  
    Classic (standalone cluster)  
    Hosted control planes  
    Multi node, Compact (three node), or Single node (SNO), or all  
    Connected / Restricted Network  
    Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
    Operator compatibility  
    Backport needed (list applicable versions)  
    UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
    Other (please specify)  

Design Document

Use Cases (Optional):

  • Integration with 3rdparty load balancers that send packets directly to OpenShift nodes with the destination IP address of a targeted pod, without needing custom operators to detect which node a pod is scheduled to and then add routes into the load balancer to send the packet to the right node.

Questions to Answer (Optional):

Out of Scope

  • EVPN integration
  • Support of any other routing protocol
  • Running separate BGP instances per VRF network
  • Support for any other type of L3VPN with BGP, including MPLS
  • Providing any type of API or operator to automatically connect two Kubernetes clusters via L3VPN
  • Replacing the support that MetalLB provides today for advertising service IPs
  • Asymmetric Integrated Routing and Bridging (IRB) with EVPN

Background

BGP

Importing Routes from the Provider Network
Today in OpenShift there is no API for a user to be able to configure routes into OVN. In order for a user to change how cluster traffic is routed egress into the cluster, the user leverages local gateway mode, which forces egress traffic to hop through the Linux host's networking stack, where a user can configure routes inside of the host via NM State. This manual configuration would need to be performed and maintained across nodes and VRFs within each node.

Additionally, if a user chooses to not manage routes within the host and use local gateway mode, then by default traffic is always sent to the default gateway. The only other way to affect egress routing is by using the Multiple External Gateways (MEG) feature. With this feature the user may choose to have multiple different egress gateways per namespace to send traffic to.

As an alternative, configuring BGP peers and which route-targets to import would eliminate the need to manually configure routes in the host, and would allow dynamic routing updates based on changes in the provider’s network.

Exporting Routes into the Provider Network
There exists a need for provider networks to learn routes directly to services and pods today in Kubernetes. Metal LB is already one solution whereby load balancer IPs are advertised by BGP to provider networks, and this feature development does not intend to duplicate or replace the function of Metal LB. Metal LB should be able to interoperate with OVN-Kubernetes, and be responsible for advertising services to a provider’s network.

However, there is an alternative need to advertise pod IPs on the provider network. One use case is integration with 3rd party load balancers, where they terminate a load balancer and then send packets directly to OCP nodes with the destination IP address being the pod IP itself. Today these load balancers rely on custom operators to detect which node a pod is scheduled to and then add routes into its load balancer to send the packet to the right node.

By integrating BGP and advertising the pod subnets/addresses directly on the provider network, load balancers and other entities on the network would be able to reach the pod IPs directly.

EVPN (to be integrated with BGP in a follow-on release targeting 4.18)

Extending OVN-Kubernetes VRFs into the Provider Network
This is the most powerful motivation for bringing support of EVPN into OVN-Kubernetes. A previous development effort enabled the ability to create a network per namespace (VRF) in OVN-Kubernetes, allowing users to create multiple isolated networks for namespaces of pods. However, the VRFs terminate at node egress, and routes are leaked from the default VRF so that traffic is able to route out of the OCP node. With EVPN, we can now extend the VRFs into the provider network using a VPN. This unlocks the ability to have L3VPNs that extend across the provider networks.

Utilizing the EVPN Fabric as the Overlay for OVN-Kubernetes
In addition to extending VRFs to the outside world for ingress and egress, we can also leverage EVPN to handle extending VRFs into the fabric for east/west traffic. This is useful in EVPN DC deployments where EVPN is already being used in the TOR network, and there is no need to use a Geneve overlay. In this use case, both layer 2 (MAC-VRFs) and layer 3 (IP-VRFs) can be advertised directly to the EVPN fabric. One advantage of doing this is that with Layer 2 networks, broadcast, unknown-unicast and multicast (BUM) traffic is suppressed across the EVPN fabric. Therefore the flooding domain in L2 networks for this type of traffic is limited to the node.

Multi-homing, Link Redundancy, Fast Convergence
Extending the EVPN fabric to OCP nodes brings other added benefits that are not present in OCP natively today. In this design there are at least 2 physical NICs and links leaving the OCP node to the EVPN leaves. This provides link redundancy, and when coupled with BFD and mass withdrawal, it can also provide fast failover. Additionally, the links can be used by the EVPN fabric to utilize ECMP routing.

Customer Considerations

  • For customers using MetalLB, it will continue to function correctly regardless of this development.

Documentation Considerations

Interoperability Considerations

  • Multiple External Gateways (MEG)
  • Egress IP
  • Services
  • Egress Service
  • Egress Firewall
  • Egress QoS

 

Epic Goal

OVN Kubernetes support for BGP as a routing protocol.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

  • Review design and development PRs that require feedback from NE team.

Why is this important?

  • Customer requires certificates to be managed by cert-manager on configured/newly added routes.

Acceptance Criteria

  • All PRs are reviewed and merged.

Dependencies (internal and external)

  1. CFE team dependency for addressing review suggestions.

Done Checklist

  • DEV - All related PRs are merged.

In the current version, router does not support to load secrets directly and uses route resource to load private key and certificates exposing the security artifacts.

 

Acceptance criteria :

  1. Support router to load secrets from secret reference.
  2. E2E testcases (Targeted for GA)

Description of problem:

    should reduce error message details for Not Found secret when edit/patch route with spec.tls.externalCertificate

Version-Release number of selected component (if applicable):

    4.16.0-0.ci.test-2024-05-13-005506-ci-ln-05s0z32-latest

How reproducible:

    100%

Steps to Reproduce:

    1. enable TP feature "RouteExternalCertificate"
    2. create pod,svc and route
    3. oc -n hongli patch route myedge --type=merge --patch='{"spec":{"tls":{"externalCertificate":{"name": "newtls"}}}}'     

Actual results:

the error message:    
The Route "myedge" is invalid: spec.tls.externalCertificate: Not found: errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:"", RemainingItemCount:(*int64)(nil)}, Status:"Failure", Message:"secrets \"newtls\" not found", Reason:"NotFound", Details:(*v1.StatusDetails)(0xc0077e25a0), Code:404}}

Expected results:

    something like: `spec.tls.externalCertificate: Not found: "secrets \"newtls\" not found"`

Additional info:

    discuss in thread of https://redhat-internal.slack.com/archives/C06EK9ZH3Q8/p1715243443244879

 

Goal:
Support enablement of dual-stack VIPs on existing clusters created as dual-stack but at a time when it was not possible to have both v4 and v6 VIPs at the same time.

Why is this important?
This is a followup to SDN-2213 ("Support dual ipv4 and ipv6 ingress and api VIPs").

We expect that customers with existing dual stack clusters will want to make use of the new dual stack VIPs fixes/enablement, but it's unclear how this will work because we've never supported modifying on-prem networking configuration after initial deployment. Once we have dual stack VIPs enabled, we will need to investigate how to alter the configuration to add VIPs to an existing cluster.

We will need to make changes to the VIP fields in the Infrastructure and/or ControllerConfig objects. Infrastructure would be the first option since that would make all of the fields consistent, but that relies on the ability to change that object and have the changes persist and be propagated to the ControllerConfig. If that's not possible, we may need to make changes just in ControllerConfig.

For epics https://issues.redhat.com/browse/OPNET-14 and https://issues.redhat.com/browse/OPNET-80 we need a mechanism to change configuration values related to our static pods. Today that is not possible because all of the values are put in the status field of the Infrastructure object.

We had previously discussed this as part of https://issues.redhat.com/browse/OPNET-21 because there was speculation that people would want to move from internal LB to external, which would require mutating a value in Infrastructure. In fact, there was a proposal to put that value in the spec directly and skip the status field entirely, but that was discarded because a migration would be needed in that case and we need separate fields to indicate what was requested and what the current state actually is.

There was some followup discussion about that with Joel Speed from the API team (which unfortunately I have not been able to find a record of yet) where it was concluded that if/when we want to modify Infrastructure values we would add them to the Infrastructure spec and when a value was changed it would trigger a reconfiguration of the affected services, after which the status would be updated.

This means we will need new logic in MCO to look at the spec field (currently there are only fields in the status, so spec is ignored completely) and determine the correct behavior when they do not match. This will mean the values in ControllerConfig will not always match those in Infrastructure.Status. That's about as far as the design has gone so far, but we should keep the three use cases we know of (internal/external LB, VIP addition, and DNS record overrides) in mind as we design the underlying functionality to allow mutation of Infrastructure status values.

Depending on how the design works out, we may only track the design phase in this epic and do the implementation as part of one of the other epics. If there is common logic that is needed by all and can be implemented independently we could do that under this epic though.

For clusters that are installed as fresh 4.15 o/installer will propagate Infrastructure.Spec and Infrastructure.Status based on the install-config. However for clusters that are upgraded this code in o/installer will never run.

In order to have a consistent state at upgrade, we will make CNO to propagate Status back to Spec when cluster is upgraded to OCP 4.15.

As we have already done it when introducing multiple VIPs (API change that created plural field next to the singular), all the necessary code scaffolding is already in place.

Infrastructure.Spec will be modified by end-user. CNO needs to validate those changes and if valid, propagate them to Infrastructure.Status

BU Priority Overview

Create custom roles for GCP with minimal set of required permissions.

Goals

Enable customers to better scope credential permissions and create custom roles on GCP that only include the minimum subset of what is needed for OpenShift.

State of the Business

Some of the service accounts that CCO creates, e.g. service account with role  roles/iam.serviceAccountUser provides elevated permissions that are not required/used by the requesting OpenShift components. This is because we use predefined roles for GCP that come with bunch of additional permissions. The goal is to create custom roles with only the required permissions. 

Execution Plans

TBD

 

Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster Network Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.

The new GCP provider spec for credentials request CR is as follows:

type GCPProviderSpec struct {
   metav1.TypeMeta `json:",inline"`
   // PredefinedRoles is the list of GCP pre-defined roles
   // that the CredentialsRequest requires.
   PredefinedRoles []string `json:"predefinedRoles"`
   // Permissions is the list of GCP permissions required to
   // create a more fine-grained custom role to satisfy the
   // CredentialsRequest.
   // When both Permissions and PredefinedRoles are specified
   // service account will have union of permissions from
   // both the fields
   Permissions []string `json:"permissions"`
   // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions
   // have the necessary services enabled
   // +optional
   SkipServiceCheck bool `json:"skipServiceCheck,omitempty"`
} 

we can use the following command to check permissions associated with a GCP predefined role

gcloud iam roles describe <role_name>

 

The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.

[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer

Update GCP Credentials Request manifest of the Cluster Network Operator to use new API field for requesting permissions.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

This enhances EgressQoS CRD with status information and provide implementation to update this field with relevant information while creating/updating EgressQoS.

 

Review the OVN Interconnect proposal, figure out the work that needs to be done in ovn-kubernetes to be able to move to this new OVN architecture. 

Phase-2 of this project in continuation of what was delivered in the earlier release. 

Why is this important?

OVN IC will be the model used in Hypershift. 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

See https://docs.google.com/presentation/d/17wipFv5wNjn1KfFZBUaVHN3mAKVkMgGWgQYcvss2yQQ/edit#slide=id.g547716335e_0_220 

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
  1. We need to be strict about what's at level 5 and what's below that, adding too much logs can make it difficult to catch the important logs while adding too less logs can make it hard to debug. There needs to be a balance.
  2. nbctl logs were smaller than the libovsdb transact logs, as a result in real deployments we see log rotation happening very fast and there is a limit to number of log files stored on nodes from a specific pod. -> we need to make an effort to reduce the log size as much as possible.

 

One idea that @kyrtapz had was to remove all nil fields in the transact and configure logs, since they anyways don't provide any useful info and they can cut down the size of a single transact line.

See https://github.com/ovn-org/ovn-kubernetes/issues/3183 

Epic Goal

The OCP Console needs to detect if the ACM Operator has been installed, if detected then a new multi-cluster perspective option shows up in the perspective chooser.

As a user I need the ability to to switch to the the ACM UI from the OCP Console and vice versa without requiring the user to login multiple times.

This option also needs to be hidden if the user doesn't have the correct RBAC.

Marvel design mockup

Why is this important?

  • Multi-cluster functionality is very important to our users. We need to provide a seamless experience for users.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

The console should detect the presence of the ACM operator and add an Advanced Cluster Management item to the perspective switcher. We will need to work with the ACM team to understand how to detect the operator and how to discover the ACM URL.

Additionally, we will need to provide a query parameter or URL fragment to indicate which perspective to use. This will allow ACM to link back to the a specific perspective since it will share the same perspective switcher in its UI. ACM will need to be able to discover the console URL.

This story does not include handling SSO, which will be tracked in a separate story.

We need to determine what RBAC checks to make before showing the ACM link.

Acceptance Criteria

1. Console shows a link to ACM in its perspective switcher
2. Console provides a way for ACM to link back to a specific perspective
3. The ACM option only appears when the ACM operator is installed
4. ACM should open in the same browser tab to give the appearance of it being one application
5. Only users with appropriate RBAC should see the link (access review TBD)

During the migration, a node will start as an SDN node (a hybrid overlay node from OVN-K perspective), then become an OVN-K node. So OVN-K needs to support such dynamical role switching.

We need to enhance cluster network operator to automate the whole SDN live-migration.

  • New API will be introduced to CNO to facilitate the migration
  • CNO shall be able to deploy OVN-K and SDN in 'migration mode'
  • CNO shall be able to annotate nodes to bypassing the IP allocation of OVN-K, when MCO is updating the MachineConfig of nodes
  • CNO shall be able to redeploy the OVN-K and SDN in 'regular mode' after the migration is done.

Goal

This goals of this features are:

  • optimize and streamline the operations of HyperShift Operator (HO) on Azure Kubernetes Service (AKS) clusters
  • Enable auto-detectopm of the underlying environment (managed or self-managed) to optimize the HO accordingly.

Goal

We need to be able to install the HO with external DNS and create HCPs on AKS clusters

Why is this important?

  • AKS clusters will serve as the management clusters on ARO HCP.

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

The cloud-network-config-operator is being deployed on HyperShift with `runAsNonRoot` set to true. When HCP is deployed on non-OpenShift management clusters, such as AKS, this needs to be unset so the pod can run as root.

This is currently causing issues deploying this pod on HCP on AKS with the following error:

      state:
        waiting:
          message: 'container has runAsNonRoot and image will run as root (pod: "cloud-network-config-controller-59d4677589-bpkfp_clusters-brcox-hypershift-arm(62a4b447-1df7-4e4a-9716-6e10ec55d8fd)", container: hosted-cluster-kubecfg-setup)'
          reason: CreateContainerConfigError 

We drive OpenShift cross-market customer success and new customer adoption with constant improvements and feature additions to the existing capabilities of our OpenShift Core Networking (SDN, Network Edge, Network Observability). This feature captures that natural progression of the product for that development that does not align neatly to an existing Jira Feature.

Goals

  • Feature enhancements (performance, scale, configuration, UX, ...)
  • Modernization (incorporation and productization of new technologies)

Requirements

  • Core Networking Stability
  • Core Networking Performance and Scale
  • Core Neworking Extensibility (e.g. Multus CNIs)
  • Core Networking UX (Observability)
  • Core Networking Security and Compliance

In Scope

  • SDN (CNI plugins: openshift-sdn, ovn-kubernetes, network plumbing, network policy, bare metal networking, traffic routing, ...)

Out of Scope

There are grey areas, but in general:

  • CNV
  • Service Mesh
  • CNF
  • Telco-specific customer solutions (versus in-product features)

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission. 

With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.

With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn". 

Epic Goal

Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.

Modify the PodSecurityViolation alert to show namespace information. To prevent cardinality explosion on the namespace label, only limit the values of the label to platform namespaces ("openshift", "default", "kube-").

 

This will also need a carry patch in o/k

When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.

To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).

Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).

The following tables track progress.

Progress summary

# namespaces 4.18 4.17 4.16 4.15
monitored 82 82 82 82
fix needed 69 69 69 69
fixed 34 30 30 39
remaining 35 39 39 30
~ remaining non-runlevel 15 19 19 10
~ remaining runlevel (low-prio) 20 20 20 20
~ untested 2 2 2 82

Progress breakdown

# namespace 4.18 4.17 4.16 4.15
1 oc debug node pods #1763 #1816 #1818
2 openshift-apiserver-operator #573 #581
3 openshift-authentication #656 #675
4 openshift-authentication-operator #656 #675
5 openshift-catalogd #50 #58
6 openshift-cloud-credential-operator #681 #736
7 openshift-cloud-network-config-controller #2282 #2490 #2496  
8 openshift-cluster-csi-drivers     #170 #459 #484
9 openshift-cluster-node-tuning-operator #968 #1117
10 openshift-cluster-olm-operator #54 n/a
11 openshift-cluster-samples-operator #535 #548
12 openshift-cluster-storage-operator     #459 #196 #484 #211
13 openshift-cluster-version     #1038 #1068
14 openshift-config-operator #410 #420
15 openshift-console #871 #908 #924
16 openshift-console-operator #871 #908 #924
17 openshift-controller-manager #336 #361
18 openshift-controller-manager-operator #336 #361
19 openshift-e2e-loki #56579 #56579 #56579 #56579
20 openshift-image-registry     #1008 #1067
21 openshift-infra        
22 openshift-ingress #1031      
23 openshift-ingress-canary #1031      
24 openshift-ingress-operator #1031      
25 openshift-insights     #915 #967
26 openshift-kni-infra #4504 #4542 #4539 #4540
27 openshift-kube-storage-version-migrator #107 #112
28 openshift-kube-storage-version-migrator-operator #107 #112
29 openshift-machine-api   #407 #315 #282 #1220 #73 #50 #433 #332 #326 #1288 #81 #57 #443
30 openshift-machine-config-operator   #4219 #4384 #4393
31 openshift-manila-csi-driver #234 #235 #236
32 openshift-marketplace     #561 #570
33 openshift-metallb-system #238 #240 #241  
34 openshift-monitoring     #2335 #2420
35 openshift-network-console        
36 openshift-network-diagnostics #2282 #2490 #2496  
37 openshift-network-node-identity #2282 #2490 #2496  
38 openshift-nutanix-infra #4504 #4504 #4539 #4540
39 openshift-oauth-apiserver #656 #675
40 openshift-openstack-infra #4504 #4504 #4539 #4540
41 openshift-operator-controller #100 #120
42 openshift-operator-lifecycle-manager #703 #828
43 openshift-route-controller-manager #336 #361
44 openshift-service-ca #235 #243
45 openshift-service-ca-operator #235 #243
46 openshift-sriov-network-operator #754 #995 #999 #1003
47 openshift-storage        
48 openshift-user-workload-monitoring #2335 #2420
49 openshift-vsphere-infra #4504 #4542 #4539 #4540
50 (runlevel) kube-system        
51 (runlevel) openshift-cloud-controller-manager        
52 (runlevel) openshift-cloud-controller-manager-operator        
53 (runlevel) openshift-cluster-api        
54 (runlevel) openshift-cluster-machine-approver        
55 (runlevel) openshift-dns        
56 (runlevel) openshift-dns-operator        
57 (runlevel) openshift-etcd        
58 (runlevel) openshift-etcd-operator        
59 (runlevel) openshift-kube-apiserver        
60 (runlevel) openshift-kube-apiserver-operator        
61 (runlevel) openshift-kube-controller-manager        
62 (runlevel) openshift-kube-controller-manager-operator        
63 (runlevel) openshift-kube-proxy        
64 (runlevel) openshift-kube-scheduler        
65 (runlevel) openshift-kube-scheduler-operator        
66 (runlevel) openshift-multus        
67 (runlevel) openshift-network-operator        
68 (runlevel) openshift-ovn-kubernetes        
69 (runlevel) openshift-sdn        

Feature Overview

  • Customers want to create and manage OpenShift clusters using managed identities for Azure resources for authentication.

Goals

  • A customer using ARO wants to spin up an OpenShift cluster with "az aro create" without needing additional input, i.e. without the need for an AD account or service principal credentials, and the identity used is never visible to the customer and cannot appear in the cluster.
  • As an administrator, I want to deploy OpenShift 4 and run Operators on Azure using access controls (IAM roles) with temporary, limited privilege credentials.

Requirements

  • Azure managed identities must work for installation with all install methods including IPI and UPI, work with upgrades, and day-to-day cluster lifecycle operations.
  • Support HyperShift and non-HyperShift clusters.
  • Support use of Operators with Azure managed identities.
  • Support in all Azure regions where Azure managed identity is available. Note: Federated credentials is associated with Azure Managed Identity, and federated credentials is not available in all Azure regions.

More details at ARO managed identity scope and impact.

 

This Section: A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

References

Epic Goal

  • Build list of specific permissions to run Openshift on Azure - Components grant roles, but we need more granularity.
  • Determine and document the Azure roles and required permissions for Azure managed identity.

Why is this important?

  • Many of our customers have security policies in their organization that restrict credentials to only minimal permissions that conflict with the documented list of permissions needed for OpenShift. Customers need to know the explicit list of permissions minimally needed for deploying and running OpenShift and what they're used for so they can request the right permissions. Without this information, it can/will block adoption of OpenShift 4 in many cases.

Scenarios

  1. ...

Acceptance Criteria

  • Document explicit list of required credential permissions for installing (Day 1) OpenShift on Azure using the IPI and UPI deployment workflows and what each of the permissions are used for.
  • Document explicit list of required role and credential permissions for the operation (Day 2) of an OpenShift cluster on Azure and what each of the permissions are used for
  • Verify minimum list of permissions for Azure with IPI and UPI installation workflows
  • (Day 2) operations of OpenShift on Azure - MUST complete successfully with automated tests
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. Installer [both UPI & IPI Workflows]
  2. Control Plane
    • Kube Controller Manager
  3. Compute [Managed Identity]
  4. Cloud API enabled components
    • Cloud Credential Operator
    • Machine API
    • Internal Registry
    • Ingress
  5. ?
  6.  

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

 

Epic Overview

  • Enable customers to create and manage OpenShift clusters using managed identities for Azure resources for authentication.
  • A customer using ARO wants to spin up an OpenShift cluster with "az aro create" without needing additional input, i.e. without the need for an AD account or service principal credentials, and the identity used is never visible to the customer and cannot appear in the cluster.

Epic Goal

  • A customer creates an OpenShift cluster ("az aro create") using Azure managed identity.
  • Azure managed identities must work for installation with all install methods including IPI and UPI, work with upgrades, and day-to-day cluster lifecycle operations.
  • After Azure failed to implement workable golang API changes after deprecation of their old API, we have removed mint mode and work entirely in passthrough mode. Azure has plans to implement pod/workload identity similar to how they have been implemented in AWS and GCP, and when this feature is available, we should implement permissions similar to AWS/GCP
  • This work cannot start until Azure have implemented this feature - as such, this Epic is a placeholder to track the effort when available.

Why is this important?

  • Microsoft and the customer would prefer that we use Managed Identities vs. Service Principal (which requires putting the Service Principal and principal password in clear text within the azure.conf file).

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

 

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

While trying to block requests going from the pods to different domain names, for example:

  • registry.access.redhat.com
  • registry.access.redhat.com.edgekey.net
  • registry-1.docker.io

Here, the egressnetworkpolicy is working out for `registry.access.redhat.com` and `registry.access.redhat.com.edgekey.net`, however, for `registry-1.docker.io`, it is not denying access despite giving the deny entry.

"Domain name updates are polled based on the TTL (time to live) value of the domain returned by the local non-authoritative servers. The pod should also resolve the domain from the same local nameservers when necessary, otherwise, the IP addresses for the domain perceived by the egress network policy controller and the pod will be different, and the egress network policy may not be enforced as expected. Since egress network policy controller and pod are asynchronously polling the same local nameserver, there could be a race condition where pod may get the updated IP before the egress controller. Due to this current limitation, domain name usage in EgressNetworkPolicy is only recommended for domains with infrequent IP address changes."

[1] https://docs.openshift.com/container-platform/3.11/admin_guide/managing_networking.html#admin-guide-limit-pod-access-egress

Aim of this feature is to fix this and also support wildcard entries for EgressNetwork Policy

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Make the changes as per the proposed enhancement https://github.com/openshift/enhancements/pull/1335

  • To add the support of DNSNameResolver CRD in OVN-K, add the flag --enable-dns-name-resolver to the corresponding OVN-K pods.

Note: The flag should be added to OVN-K after checking if the feature-gate DNSNameResolver is enabled.

  • Add rbac permissions for DNSNameResolver resources to ovn-kubernetes. The following permissions should be added to 002-rbac-node.yaml, 003-rbac-controller.yaml and 004-rbac-control-plane.yaml files in bindata/network/ovn-kubernetes/common/ directory
- apiGroups: ["network.openshift.io"]
  resources:
  - dnsnameresolvers
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch 
  • Update the 001-crd.yaml file in bindata/network/ovn-kubernetes/common/ directory with the latest EgressFirewall CRD.

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

 
Why is this important? (mandatory)

What are the benefits to the customer or Red Hat?   Does it improve security, performance, supportability, etc?  Why is work a priority?

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

 
Why is this important? (mandatory)

What are the benefits to the customer or Red Hat?   Does it improve security, performance, supportability, etc?  Why is work a priority?

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

With ovn-ic we have multiple actors (zones) setting status on some CRs. We need to make sure individual zone statuses are reported and then optionally merged to a single status

Why is this important?

Without that change zones will overwrite each others statuses.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

This card is about:

  • making 4.13 -> 4.14 upgrade compulsory, that is we shouldn't allow an upgrade from a version < 4.14 to a version > 4.14. in order to ensure that the IC upgrade logic on 4.14 is executed.
  • removing in CNO 4.15 all the code related to the non-IC to IC upgrade: the logic itself in ovn_kubernetes.go, the operator status hack in pod_status.go, the yamls in the single-zone folder and the yamls (which are mostly symlinks) in the multizone-tmp folder.

Feature Overview (aka. Goal Summary)

Migrate every occurrence of iptables in OpenShift to use nftables, instead.

Goals (aka. expected user outcomes)

Implement a full migration from iptables to nftables within a series of "normal" upgrades of OpenShift with the goal of not causing any more network disruption than would normally be required for an OpenShift upgrade. (Different components may migrate from iptables to nftables in different releases; no coordination is needed between unrelated components.)

Requirements (aka. Acceptance Criteria):

  • Discover what components are using iptables (directly or indirectly, e.g. via ipfailover) and reduce the “unknown unknowns”.
  • Port components away from iptables.

Use Cases (Optional):

Questions to Answer (Optional):

  • Do we need a better “warning: you are using iptables” warning for customers? (eg, per-container rather than per-node, which always fires because OCP itself is using iptables). This could help provide improved visibility of the issue to other components that aren't sure if they need to take action and migrate to nftables, as well.

Out of Scope

  • Non-OVN primary CNI plug-in solutions

Background

Customer Considerations

  • What happens to clusters that don't migrate all iptables use to nftables?
    • In RHEL 9.x it will generate a single log message during node startup on every OpenShift node. There are Insights rules that will trigger on all OpenShift nodes.
    • In RHEL 10 iptables will just no longer work at all. Neither the command-line tools nor the kernel modules will be present.

Documentation Considerations

Interoperability Considerations

Template:

 

Networking Definition of Planned

Epic Template descriptions and documentation 

 

Epic Goal

  • OCP needs to detect when customer workloads are making use of iptables, and present this information to the customer (e.g. via alerts, metrics, insights, etc)
  • The RHEL 9 kernel logs a warning if iptables is used at any point anywhere in the system, but this is not helpful because OCP itself still uses iptables, so the warning is always logged.
  • We need to avoid false positives due to OCP's own use of iptables in pod namespaces (e.g. the rules to block access to the MCS). Porting those rules to nftables sooner rather than later is one solution.

Why is this important?

  • iptables will not exist in RHEL 10, so if customers are depending on it, they need to be warned.
  • Contrariwise, we are getting questions from customers who are not using iptables in their own workload containers, who are confused about the kernel warning. Clearer messaging should help reduce confusion here.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)

Additional information on each of the above items can be found here: Networking Definition of Planned

 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

 

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Consume the newly introduced API and apply the scheduling configuration (taints and node selectors) to network-check-source and network-check-target.

Feature Overview (aka. Goal Summary)

A guest cluster can use an external OIDC token issuer.  This will allow machine-to-machine authentication workflows

Goals (aka. expected user outcomes)

A guest cluster can configure OIDC providers to support the current capability: https://kubernetes.io/docs/reference/access-authn-authz/authentication/#openid-connect-tokens and the future capability: https://github.com/kubernetes/kubernetes/blob/2b5d2cf910fd376a42ba9de5e4b52a53b58f9397/staging/src/k8s.io/apiserver/pkg/apis/apiserver/types.go#L164 with an API that 

  1. allows fixing mistakes
  2. alerts the owner of the configuration that it's likely that there is a misconfiguration (self-service)
  3. makes distinction between product failure (expressed configuration not applied) from configuration failure (the expressed configuration was wrong), easy to determine
  4. makes cluster recovery possible in cases where the external token issuer is permanently gone
  5. allow (might not require) removal of the existing oauth server

 

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

A guest cluster can use an external OIDC token issuer.  This will allow machine-to-machine authentication workflows

Goals (aka. expected user outcomes)

A guest cluster can configure OIDC providers to support the current capability: https://kubernetes.io/docs/reference/access-authn-authz/authentication/#openid-connect-tokens and the future capability: https://github.com/kubernetes/kubernetes/blob/2b5d2cf910fd376a42ba9de5e4b52a53b58f9397/staging/src/k8s.io/apiserver/pkg/apis/apiserver/types.go#L164 with an API that 

  1. allows fixing mistakes
  2. alerts the owner of the configuration that it's likely that there is a misconfiguration (self-service)
  3. makes distinction between product failure (expressed configuration not applied) from configuration failure (the expressed configuration was wrong), easy to determine
  4. makes cluster recovery possible in cases where the external token issuer is permanently gone
  5. allow (might not require) removal of the existing oauth server

Epic Goal

  • Add an API extension for North-South IPsec.
  • close gaps from SDN-3604 - mainly around upgrade
  • add telemetry

Why is this important?

  • without API, customers are forced to use MCO. this brings with it a set of limitations (mainly reboot per change and the fact that config is shared among each pool, can't do per node configuration)
  • better upgrade solution will give us the ability to support a single host based implementation
  • telemetry will give us more info on how widely is ipsec used.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Must allow for the possibility of offloading the IPsec encryption to a SmartNIC.

 

  • nmstate
  • k8s-nmstate
  • easier mechanism for cert injection (??)
  • telemetry
  •  

Dependencies (internal and external)

  1.  

Related:

  • ITUP-44 - OpenShift support for North-South OVN IPSec
  • HATSTRAT-33 - Encrypt All Traffic to/from Cluster (aka IPSec as a Service)

Previous Work (Optional):

  1. SDN-717 - Support IPSEC on ovn-kubernetes
  2. SDN-3604 - Fully supported non-GA N-S IPSec implementation using machine config.

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://redhat-internal.slack.com/archives/GQ0CU2623/p1692107036750429?thread_ts=1689276746.185269&cid=GQ0CU2623 

This card adds support for implementing ANP.Egress.Networks Peer in OVNKubernetes:

  1. vendoring in api from netpol api repo
  2. designing the ovnk pieces into the existing controller
  3. writing unit tests
  4. bringing in the conformance tests from upstream

Feature Overview (aka. Goal Summary)

Migrate every occurrence of iptables in OpenShift to use nftables, instead.

Goals (aka. expected user outcomes)

Implement a full migration from iptables to nftables within a series of "normal" upgrades of OpenShift with the goal of not causing any more network disruption than would normally be required for an OpenShift upgrade. (Different components may migrate from iptables to nftables in different releases; no coordination is needed between unrelated components.)

Requirements (aka. Acceptance Criteria):

  • Discover what components are using iptables (directly or indirectly, e.g. via ipfailover) and reduce the “unknown unknowns”.
  • Port components away from iptables.

Use Cases (Optional):

Questions to Answer (Optional):

  • Do we need a better “warning: you are using iptables” warning for customers? (eg, per-container rather than per-node, which always fires because OCP itself is using iptables). This could help provide improved visibility of the issue to other components that aren't sure if they need to take action and migrate to nftables, as well.

Out of Scope

  • Non-OVN primary CNI plug-in solutions

Background

Customer Considerations

  • What happens to clusters that don't migrate all iptables use to nftables?
    • In RHEL 9.x it will generate a single log message during node startup on every OpenShift node. There are Insights rules that will trigger on all OpenShift nodes.
    • In RHEL 10 iptables will just no longer work at all. Neither the command-line tools nor the kernel modules will be present.

Documentation Considerations

Interoperability Considerations

Template:

 

Networking Definition of Planned

Epic Template descriptions and documentation 

 

Epic Goal

  • Replace the random bits of iptables glue in ovn-kubernetes with exactly equivalent nftables versions

Why is this important?

  • iptables will not be supported in RHEL 10, so we need to replace all uses of it in OCP with nftables. See OCPSTRAT-873.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)

Additional information on each of the above items can be found here: Networking Definition of Planned

 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

 

Feature Overview (aka. Goal Summary)

When the internal oauth-server and oauth-apiserver are removed and replaced with an external OIDC issuer (like azure AD), the console must work for human users of the external OIDC issuer.

Goals (aka. expected user outcomes)

An end user can use the openshift console without a notable difference in experience.  This must eventually work on both hypershift and standalone, but hypershift is the first priority if it impacts delivery

Requirements (aka. Acceptance Criteria):

  1. User can log in and use the console
  2. User can get a kubeconfig that functions on the CLI with matching oc
  3. Both of those work on hypershift
  4. both of those work on standalone.

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • When installed with external OIDC, the clientID and clientSecret need to be configurable to match the external (and unmanaged) OIDC server

Why is this important?

  • Without a configurable clientID and secret, I don't think the console can identify the user.
  • There must be a mechanism to do this on both hypershift and openshift, though the API may be very similar.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Goal of this epic is to capture all the amount of required work and efforts that take to update the openshift control plane with the upstream kubernetes v1.29

Why is this important?

  • Rebase is a must process for every ocp release to leverage all the new features implemented upstream

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. Following epic captured the previous rebase work of k8s v1.28
    https://issues.redhat.com/browse/STOR-1425 

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

While monitoring the payload job failures, open a parallel openshift/origin bump.

Note: There is a high chance of job failures in openshift/origin bump until the openshift/kubernetes PR merges as we only update the test and not the actual kube.

 

Benefit of opening this PR before ocp/k8s merge is to identify and fix the issues beforehand.

Prev Ref: https://github.com/openshift/origin/pull/28097 

Goal Summary

This feature aims to make sure that the HyperShift operator and the control-plane it deploys uses Managed Service Identities (MSI) and have access to scoped credentials (also via access to AKS's image gallery potentially). Additionally, for operators deployed in customers account (system components), they would be scoped with Azure workload identities. 

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

Support Managed Service Identity (MSI) authentication in Azure.

Why is this important?

Controllers that require cloud access and run on the control plane side in ARO hosted clusters will need to use MSI to acquire tokens to interact with the hosted cluster's cloud resources.

The cluster network operator runs the following pods that require cloud credentials:

  • cloud-network-config-controller

The following components use the token-minter but do not require cloud access:

  • network-node-identity
  • ovnkube-control-plane

 

These pods will need to use MSI when running in hosted control plane mode.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

openshift-sdn is no longer part of OCP in 4.17, so remove references to it in the networking APIs.

Consider whether we can remove the entire network.openshift.io API, which will now be no-ops.

In places where both sdn and ovn-k are supported, remove references to sdn.

In some places (notably the migration API), we will probably leave an API in place that currently has no purpose.

Why is this important?

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

openshift-sdn is no longer part of OCP in 4.17, so CNO must stop referring to its image

Why is this important?

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)

Stop generating long-lived service account tokens. Long-lived service account tokens are currently generated in order to then create an image pull secret for the internal image registry. This feature calls for using the TokenRequest API to generate a bound service account token for use in the image pull secret.

Goals (aka. expected user outcomes)

Use TokenRequest API to create image pull secrets. 
{}Performance benefits:

One less secret created per service account. This will result in at least three less secrets generated per namespace.

Security benefits:

Long lived tokens which are no longer recommended as they present a possible security risk.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

The upstream test `ServiceAccounts no secret-based service account token should be auto-generated` was previously patched to allow for the internal image registry's managed image pull secret to be present in the `Secrets` field. This will not longer be the case as of 4.16.

Post merge of API-1644, we can remove the patch entirely.

Problem statement

DPDK applications require dedicated CPUs, and isolated any preemption (other processes, kernel threads, interrupts), and this can be achieved with the “static” policy of the CPU manager: the container resources need to include an integer number of CPUs of equal value in “limits” and “request”. For instance, to get six exclusive CPUs:

spec:

  containers:

  - name: CNF

    image: myCNF

    resources:

      limits:

        cpu: "6"

      requests:

        cpu: "6"

 

The six CPUs are dedicated to that container, however non trivial, meaning real DPDK applications do not use all of those CPUs as there is always at least one of the CPU running a slow-path, processing configuration, printing logs (among DPDK coding rules: no syscall in PMD threads, or you are in trouble). Even the DPDK PMD drivers and core libraries include pthreads which are intended to sleep, they are infrastructure pthreads processing link change interrupts for instance.

Can we envision going with two processes, one with isolated cores, one with the slow-path ones, so we can have two containers? Unfortunately no: going in a multi-process design, where only dedicated pthreads would run on a process is not an option as DPDK multi-process is going deprecated upstream and has never picked up as it never properly worked. Fixing it and changing DPDK architecture to systematically have two processes is absolutely not possible within a year, and would require all DPDK applications to be re-written. Knowing that the first and current multi-process implementation is a failure, nothing guarantees that a second one would be successful.

The slow-path CPUs are only consuming a fraction of a real CPU and can safely be run on the “shared” CPU pool of the CPU Manager, however containers specifications do not accept to request two kinds of CPUs, for instance:

 

spec:

  containers:

  - name: CNF

    image: myCNF

    resources:

      limits:

        cpu_dedicated: "4"

        cpu_shared: "20m"

      requests:

        cpu_dedicated: "4"

        cpu_shared: "20m"

Why do we care about allocating one extra CPU per container?

  • Allocating one extra CPU means allocating an additional physical core, as the CPUs running DPDK application should run on a dedicated physical core, in order to get maximum and deterministic performances, as caches and CPU units are shared between the two hyperthreads.
  • CNFs are built with a minimum of CPUs per container. This is still between 10 and 20, sometime more, today, but the intent is to decrease this number of CPU and increase the number of containers as this is the “cloud native” way to waste resources by having too large containers to schedule, like in the VNF days (tetris effect)

Let’s take a realistic example, based on a real RAN CNF: running 6 containers with dedicated CPUs on a worker node, with a slow Path requiring 0.1 CPUs means that we waste 5 CPUs, meaning 3 physical cores. With real life numbers:

  • For a single datacenter composed of 100 nodes, we waste 300 physical cores
  • For a single datacenter composed of 500 nodes, we waste 1500 physical cores
  • For a single node OpenShift deployed on 1 Millions of nodes, we waste 3 Millions of physical cores

Intel public CPU price per core is around 150 US$, not even taking into account the ecological aspect of the waste of (rare) materials and the electricity and cooling…

 

Goals

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

Questions to answer…

  • Would an implementation based on annotations be possible rather than an implementation requiring a container (so pod) definition change, like the CPU pooler does?

Out of Scope

Background, and strategic fit

This issue has been addressed lately by OpenStack.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

  • The feature needs documentation on how to configure OCP, create pods, and troubleshoot

Epic Goal

  • An NRI plugin that invoked by CRI-O right before the container creation, and updates the container's cpuset and quota to match the mixed-cpus request. 
  • The cpu pinning reconciliation operation must also execute the NRI API call on every update (so we can intercept kubelet and it does not destroy our changes)
  • Dev Preview for 4.15

Why is this important?

  • This would unblock lots of options including mixed cpu workloads where some CPUs could be shared among containers / pods CNF-3706
  • This would also allow further research on dynamic (simulated) hyper threading CNF-3743

Scenarios

  1. ...

Acceptance Criteria

  • Have an NRI plugin which called by the runtime and updates the container with mutual cpus.
  • The plugin must be able to override CPU manager conciliation loop and immune to future CPU manager changes.
  • The plugin must be robust and handle node reboot/kubelet/crio restart scenarios
  • upstream CI - MUST be running successfully with tests automated.
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • OCP adoption in relevant OCP version 
  • NTO shall be able to deploy the new plugin

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://issues.redhat.com/browse/CNF-3706 : Spike - mix of shared and pinned/dedicated cpus within a container
  2. https://issues.redhat.com/browse/CNF-3743 : Spike: Dynamic offlining of cpu siblings to simulate no-smt
  3. upstream Node Resource Interface project - https://github.com/containerd/nri 
  4. https://issues.redhat.com/browse/CNF-6082: [SPIKE] Cpus assigned hook point in CRI-O
  5. https://issues.redhat.com/browse/CNF-7603 

Open questions::

  N/A

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

We need to extend the node admission plugin to support the shared cpus.

The admission should provide the following functionalities: 
1. In case a user specifies more than a single `openshift.io/enabled-shared-cpus` resource, it rejects the pod request with an error explaining the user how to fix its pod spec.
2. It adds an annotation `cpu-shared.crio.io` that will be used to tell the runtime that shared cpus were requested.
For every container requested for shared cpus, it adds an annotation with the following scheme:
`cpu-shared.crio.io/<container name>`
 

Example of how it's done for core pinning: https://github.com/openshift/kubernetes/commit/04ff5090bae1cb181a2464696adde8709cdd0a93 

We need to add support to Kubelet to advertise the shared-cpu as `openshift.io/enabled-shared-cpus` through extended resources

This should be off by default and only activated when a configuration file is being supplied. 

Feature Overview - Problem statement

Any Telco deployment seeks for best performances, determinism, and low TCO. Kubernetes has been thought for cloud usage, where pods run on vCPUs. In a telco deployment, vCPU can be either be:

  1. a dedicated physical core, meaning a pair of hyperthreads
  2. High determinism and performances (dedicated)
  3. High cost (dedicated)
  1. a single dedicated hyperthread
  2. Half of the above in term of determinism and cost
  1. a quantum of a shared hyperthread
  2. Non Deterministic performance, but still approximated by the quantum value
  3. Low cost

Another parameter greatly impact performance: NUMA

  1. best performances require NUMA alignment
  2. best TCO ensure NUMA alignment only when required

 

OCP as of today (4.7) partitions server CPUs in multiple shared pools and a dedicated pool, without hyperthreading and NUMA awareness.

Detailed status on OCP4.12 and OCP4.14, and the key missing item for OCP4.14+ is how we do spread/pack NIC interrupts in order to get a maximum of parallelism: https://docs.google.com/presentation/d/1Aet59myjjSIesubSKZyD5Ty6pVrd0SftbVbUFbbAK8w/edit#slide=id.g290f9655170_0_903

Goals

Permit an efficient CPU usage on OCP servers: share as much as possible when possible, and dedicate only what needs to be really dedicated, at the hyperthread granularity, taking into account NUMA locality.

Non Goals

Multiple NUMA per socket CPU systems like AMD Rome with NPS>1 (Node Per Socket) are out of this feature scope. More details on NPS in this gdoc.

Requirements

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES
Must work on Real Time and non real time RHCOS   YES

 

Use Cases

  • Sidecars containers: just provide what they need in term of CPU allocation and NUMA locality
  • The sidecar may need or not require to be on the same NUMA node as the others containers of the pods
  • Pack systemd services and OpenShift infrastructure pods on the same CPUs, as well as the user pods not requiring dedicated CPUs

Out of Scope

N/A

Background, and strategic fit

Today’s implementation split the available CPUs in sets dedicated to either systemd, either OCP pods (OCP infrastructure pods and applications). This permit to avoid noisy neighbor syndromes across the pools, but not within a pool, and lead to overconsumption of CPUs as each and every pool has its margin.

Assumptions

N/A

Customer Considerations

N/A

Documentation Considerations

N/A

Epic Goal

  • Figure out how to increase the traffic handling capability for kernel networking workloads on clusters that do not use all cpus for guaranteed workloads.

Why is this important?

  • Telco Core has only handful of guaranteed pods, but a lot of burstable kernel networking services. So they need cpu partitioning, but the networking stack needs to handle pretty high traffic too.

Scenarios

  1. High traffic over kernel and OVS with a small guaranteed pod running on the node. Reserved using the least amount of cpu threads (4/8).

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Kernel networking must be a supported use case
  • OVS must get the necessary amount of cpus either automatically or by configuration
  • All logic must support overrides for emergencies and manual tweaks

Dependencies (internal and external)

  1. cri-o / OCI hooks
  2. systemd / OVS slice configuration
  3. kernel / RPS mask tunables

Previous Work (Optional):

  1. IRQ balancing https://github.com/cri-o/cri-o/blob/9ed9393df13cee1bb056be0f2068ed972e5cc05d/internal/runtimehandlerhooks/high_performance_hooks.go#L76
  2. RPS mask https://github.com/openshift/cluster-node-tuning-operator/blob/master/assets/performanceprofile/scripts/set-rps-mask.sh
  3. https://issues.redhat.com/browse/CNF-1360
  4. Dynamic OVS pinning: https://docs.google.com/document/d/18BtBkB3tHldt-zLLqNWA94JSd7aDUGnd1XwmYElde4E/edit

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

  • OVN-K provides the (primary) kubernetes  IP address to pods, and “connects” this IP address to the kubernetes service IP, configured via metalLB.
  • On a given OpenShift cluster, services IP can be reached via multiple host interfaces (in this example wee have a single bond topped by VLANs, but we could also have multiple physical NICs, or multiple VFs,... any kind of kernel netdevice should be supported), on a 5G Core typical deployment, this translates to:

 

 

 

Goals

  • Note1: the service IP is not configured on the host when using metalLB with Flannel, the implementation is solely based on nftable NAT rules in kube-proxy (the example is taken from Flannel/metalLB as an upstream reference implementation)

Example with kube-proxy (not OVN-K) of existing nftable rules (nat table) with two services, 10.10.10.1 and 9.9.9.1:

 

Chain KUBE-SERVICES (2 references)

 pkts bytes target     prot opt in     out     source               destination         

    0     0 KUBE-SVC-MZWLROVU74UDI3HJ  tcp  --        **       0.0.0.0/0            10.103.205.40        /* default/nodeportamsterdam cluster IP */ tcp dpt:80

    0     0 KUBE-SVC-RQWI4M5IL64FQFRX  tcp  --        **       0.0.0.0/0            10.97.176.133        /* default/laposte cluster IP */ tcp dpt:443

    0     0 KUBE-FW-RQWI4M5IL64FQFRX  tcp  --        **       0.0.0.0/0            9.9.9.1              /* default/laposte loadbalancer IP */ tcp dpt:443

    0     0 KUBE-SVC-DGNTJ5HIKQEDKIWG  tcp  --        **       0.0.0.0/0            10.111.241.130       /* default/anyboss cluster IP */ tcp dpt:443

    0     0 KUBE-FW-DGNTJ5HIKQEDKIWG  tcp  --        **       0.0.0.0/0            10.10.10.1           /* default/anyboss loadbalancer IP */ tcp dpt:443

    0     0 KUBE-SVC-NPX46M4PTMTKRN6Y  tcp  --        **       0.0.0.0/0            10.96.0.1            /* default/kubernetes:https cluster IP */ tcp dpt:443

    0     0 KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --        **       0.0.0.0/0            10.96.0.10           /* kube-system/kube-dns:dns cluster IP */ udp dpt:53

    0     0 KUBE-SVC-ERIFXISQEP7F7OF4  tcp  --        **       0.0.0.0/0            10.96.0.10           /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:53

    0     0 KUBE-SVC-JD5MR3NA4I4DYORP  tcp  --        **       0.0.0.0/0            10.96.0.10           /* kube-system/kube-dns:metrics cluster IP */ tcp dpt:9153

 2734  164K KUBE-NODEPORTS  all  --        **       0.0.0.0/0            0.0.0.0/0            /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

  • Note2: current OVN-K implementation with "OVN-K local gateway" with requires all traffic to be routed via br-ex, this is a known bug to be fixed in January 2022, and that will permit part of this feature to work, with "OVN-K local gateway" mode. This feature is intended to work also with "OVN-K shared gateway" mode or any other mode, but this is not MVP.
  • Note3: overlapping CIDRs are referenced as Network VRFs by partners/customers, which can be implemented via kernel VRF but not necessarily, as the Service IP doesn’t need to be configured on any host interface, it is rather implemented as a set of NAT rules (OpenFlow or nftable)
  • IPv4 and IPv6 and dual stack Services should be supported
  • Service spec.externalTrafficPolicy Cluster and Local should be supported, and in both cases the pod external traffic should use the Service IP if the pod specifications requires it: such option doesn’t exist today, and also the pod to pod communication should use this IP address only if required in the pod specifications, pod annotations such as the following ones could be used (how it will be implemented is not constrained, what is constrained is the overall behavior)
  • External <-> pod communication (main use case of metalLB for Telcos, here we use a supervision app as an example but it can be any kind of app):  PodUseServiceIPForEgress=True, default to False

  • Internal pod 2 pod communication (allowed by Kubernetes, but unlikely to be relevant for Telco): PodToPoUseServiceIPForEgress=True, default to False

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES
metalLB BGP to be validated with BFD as well YES
metalLB L2 not part of the MVP, but should be implemented as well NO
OVN-K shared gateway local Gateway is enough for MVP, but ultimately any mode will have to be supported NO
Should work with other load balancers as metalLB This should not be tested but the implementation should not be metalLB specific on OVN-K side, so it can be reused with F5 SPK for instance NO
PodToPodUseServiceIPForEgress To be implemented if this is a low hanging fruit or when we get explicit demand. But the implementation of the MVP should permit its later implementation. NO

 

Out of Scope

  • OCP SDN
  • Kubernetes services on secondary interfaces: this Jira feature is about Kubernetes primary networking on any host interface (aka kernel netdevice: VLAN, bond, interface, VF, ....)

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Having a way to mark the return traffic coming from a service with a given mark
  • Having a way to mark the traffic originating from a pod belonging to a service with the "k8s.ovn.org/egress-service" annotation with a given mark

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

  1. Do we want to expose the mark to be applied to the service's traffic or an higher level construct?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Complete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

The console requires to know the network type capabilities to show/hide some Network Policy form fields.

As a result of https://issues.redhat.com/browse/NETOBSERV-27, this logic is implemented as a features document inside the console code. The console fetches the network type from the network operator and checks the supported features towards this document.

However, this limits the feature to admin users, as other logged-in users do not have permissions to fetch the network type.

This task aims to modify the current Cluster Network Operator to expose the network capabilities as an `sdn-public` Config Map, writeable only by the SDN, readable by any `system:authenticated` user.

Enhancement Proposal PR: https://github.com/openshift/enhancements/pull/875

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Rebase OpenShift components to k8s v1.22
  • Rebase Jenkins and plugins to latest long term support versions

Why is this important?

  • Rebasing ensures components work with the upcoming release of Kubernetes
  • Address tech debt related to upstream deprecations and removals.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. k8s 1.22 release - expected August 4th 2021

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

Rebase samples operator to k8s 1.22

Acceptance Criteria

  • Samples operator deploys with k8s 1.22 libraries
  • Core components continue to function (CI tests pass, including build suite).

Docs Impact

None

Notes

Description of problem:

Enable default sysctls for kubelet.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Goal:

This epic is mainly focused to track the dev console QE automation activities for 4.8 release
1. Identify the scenarios for automation
2. Segregate the test cases into smoke, Regression and user stories
3. Designing the gherkin scripts with below priority

  • Update the Smoke test suite
  • Update the Regression test suite

4. Create the automation scripts using cypress
5. Implement CI

Why is it important?

This improves the quality of the product

Note:

This is not related to any UI features. It is mainly focused on UI automation

Description

This story is mainly related to push the pipelines code from dev console to pipelines plugin folder for extensibility purpose
Verify the pipelines regression test suite

As a operator qe, I should be able to execute them on my operator folder

Acceptance Criteria

1. All pipelines scripts should be able to execute in the pipelines plugin folder
2. Pipelines operator installation needs to be done by the script

Additional Details:

Description

CI implementation for pipelines, knative, devconsole

Acceptance Criteria

update package.json file
CI for pipelines:
Any update related to pipelines should execute pipelines smoke tests
on nightly builds, pipelines regression should be executed [TBD]

CI for devconsole:
Any update related to devconsole should execute devconsole smoke tests
on nightly builds, devconsole regression should be executed [TBD]

Ci for knative
Any update related to knative should execute knative smoke tests
on nightly builds, knative regression should be executed [TBD]

Additional Details:

  1. Update the pacakge.json file with below scripts
    • test-cypress-pipelines-headless & test-cypress-pipelines
    • test-cypress-knative-headless & test-cypress-knative
    • test-cypress-gitops-headless & test-cypress-gitops
  2. Update the .gherkinlintrc file 
    • updated knative related tags
    • max scenarios per file
    • file name style
  3. Updated the OWNERS file for gitops, knative and pipelines

Fixing the lint feature file lint issues and moving the topology features to topology folder which is occurring on executing `yarn run test-cypress-devconsole-headless`

Description

This story is mainly related to push the pipelines code from dev console to gitops plugin folder for extensibility purpose

As a operator qe, I should be able to execute them on my operator folder

Acceptance Criteria

1. All pipelines scripts should be able to execute in the gitops plugin folder
2. gitops operator installation needs to be done by the script

Description

This story is to move the existing scripts from Dev Console to Topology plugin folder

As a user,

Acceptance Criteria

  1. I should be able to execute the topology scripts

Additional Details:

Description

Merging the existing knative smoke suite scripts

As a tester, I should be able to execute the smoke test scripts

Acceptance Criteria

  1. <criteria>

Additional Details:

Currently the PR looks too large, To reduce the size, creating these sub tasks
Updating the ReadMe documentation for knative plugin folder

Description

Update the smoke test cases related to kantive
Remove the duplicate scenarios

As a user,

Acceptance Criteria

  1. <criteria>

Additional Details:

Description

updated all automation scripts and verify the execution on remote cluster

As a user,

Acceptance Criteria

  1. Follow the automation testing practices while designing scripts
  2. Execute the scripts on remote cluster
  3. In Automation PR, follow the PR Hygiene guidelines

Additional Details:

Execute them on Chrome browser and 4.8 release cluster

Description

Design the cypress scripts for the epic ODC-3991
Refer the Gherkin scripts https://issues.redhat.com/browse/ODC-5430

As a user,

Acceptance Criteria

All automation possible test scenarios related to EPIC ODC-3991 should be automated

Additional Details:

Pipelines operator needs to be installed

Description

Setup the cypress cucumber in helm plugin folder

Acceptance Criteria

  • Helm scenarios should move to helm folder and their respective pageObjects, page functions and step definitions
  • Able to execute smoke and regression test execution

Additional Details:

By adding the owners file to service mesh, helps us to add the automatic reviewers on this gherkin scripts update

Create Github templates with certain criteria to met the Gherkin script standards, Automation script standards

As this .gherkin-lintrc is mainly used by QE team. so it's not necessary to be in frontend folder, So I am moving it to dev-console/integration-tests folder

Adding all necessary tags and modifying below rules due to recently observed scenarios

  1. While uploading scenarios to upload to polarion, switch off the "no-homogenous-tags" is useful
  2. Updated allowed tags w.r.to new feature files

This epic tracks network tooling improvements for 4.12

New framework and process should be developed to make sharing network tools with devs, support and customers convenient. We are going to add some tools for ovn troubleshooting before ovn-k goes default, also some tools that we got from customer cases, and some more to help analyze and debug collected logs based on stable must-gather/sosreport format we get now thanks to 4.11 Epic.

Our estimation for this Epic is 1 engineer * 2 Sprints

WHY:
This epic is important to help improve the time it takes our customers and our team to understand an issue within the cluster.
A focus of this epic is to develop tools to quickly allow debugging of a problematic cluster. This is crucial for the engineering team to help us scale. We want to provide a tool to our customers to help lower the cognitive burden to get at a root cause of an issue.

 

Alert if any of the ovn controllers disconnected for a period of time from the southbound database using metric ovn_controller_southbound_database_connected.

The metric updates every 2 minutes so please be mindful of this when creating the alert.

If the controller is disconnected for 10 minutes, fire an alert.

DoD: Merged to CNO and tested by QE

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Come up with a consistent way to detect node down on OCP and hypershift. Current mechanism for OCP (probe port 9) does not work for hypershift, meaning, hypershift node down detection will be longer (~40 secs). We should aim to have a common mechanism for both. As well, we should consider alternatives to the probing port 9. Perhaps BFD, or other detection.
  • Get clarification on node down detection times. Some customers have (apparently) asked for detection on the order of 100ms, recommendation is to use multiple Egress IPs, so this may not be a hard requirement. Need clarification from PM/Customers.

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Add sock proxy to cluster-network-operator so egressip can use grpc to reach worker nodes.
 
With the introduction of grpc as means for determining the state of a given egress node, hypershift should
be able to leverage socks proxy and become able to know the state of each egress node.
 
References relevant to this work:
1281-network-proxy
[+https://coreos.slack.com/archives/C01C8502FMM/p1658427627751939+]
[+https://github.com/openshift/hypershift/pull/1131/commits/28546dc587dc028dc8bded715847346ff99d65ea+]

Incomplete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled

Epic Goal

  • Port all remaining Protractor tests to Cypress

Why is this important?

  • Protractor is very hard to debug when tests fail/flake
  • Once all protractor tests are ported we can remove all Protractor dependencies, scripts, and configuration files.
  • Cypress has better debugging, plug-ins, and reporting tools

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Please read: migrating-protractor-tests-to-cypress

Protractor test to migrate:  `frontend/integration-tests/tests/storage.scenario.ts`

Loops through 6 storage kinds:

15) Add storage is applicable for all workloads

   16) replicationcontrollers
      ✔ create a replicationcontrollers resource
      ✔ add storage to replicationcontrollers

   17) daemonsets
      ✔ create a daemonsets resource
      ✔ add storage to daemonsets

   18) deployments
      ✔ create a deployments resource
      ✔ add storage to deployments

   19) replicasets
      ✔ create a replicasets resource
      ✔ add storage to replicasets

   20) statefulsets
      ✔ create a statefulsets resource
      ✔ add storage to statefulsets

   21) deploymentconfigs
      ✔ create a deploymentconfigs resource
      ✔ add storage to deploymentconfigs

 

Accpetance Criteria

  • Protractor test ported to cypress
  • Remove any unused legacy data-test-id`s
  • Protractor test deleted, and non longer referenced in `frontend/integration-tests/protractor.conf.ts`

Please read: migrating-protractor-tests-to-cypress

Protractor test to migrate:  `frontend/integration-tests/tests/filter.scenario.ts`

4) Filtering 
   ✔ filters Pod from object detail 
   ✔ filters invalid Pod from object detail 
   ✔ filters from Pods list 
   ⚠ CONSOLE-1503 - searches for object by label 
   ✔ searches for pod by label and filtering by name 
   ✔ searches for object by label using by other kind of workload

 

Accpetance Criteria

  • Protractor test ported to cypress
  • Remove any unused legacy data-test-id`s
  • Protractor test deleted, and non longer referenced in `frontend/integration-tests/protractor.conf.ts`