diff --git a/AGENTS.md b/AGENTS.md index a3f088c..b63f6fa 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -11,6 +11,7 @@ This file is the execution guide for `ocdp-workload-manifests`. CRs created through `ocdp-server`. - Keep reusable Kubernetes resources under `apps/*/base`. - Keep optional Kustomize components under `apps/*/components`. +- Keep Helm post-render presets under `packages/*/presets/*`. - Keep per-deployment runtime values out of this repository. They belong in temporary source files or runtime specs generated by `ocdp-server`. - Use `tests/kustomize/*` only for generic validation overlays, never for real @@ -21,10 +22,10 @@ This file is the execution guide for `ocdp-workload-manifests`. ## Runtime Rules - Do not add a global catalog index unless the server explicitly needs one later. -- `ocdp-server` WorkloadTemplate records should reference apps with - `repositoryUrl`, `ref`, and `path`. -- Runtime source generation may create Secret, ConfigMap, and patch files, then - run Kustomize. +- `ocdp-server` WorkloadTemplate records may reference Helm charts plus + `source.preset.repositoryUrl/ref/path`. +- Runtime source generation may render Helm, create Secret, ConfigMap, and patch + files, then run Kustomize. - Bases should stay template-free YAML. - App bases should keep Services internally reachable; expose apps from WorkloadClaim top-level intent such as `exposure=internal` or diff --git a/README.md b/README.md index 063a107..42efe6c 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,11 @@ # ocdp-workload-manifests -Standalone Kubernetes manifests for OCDP workloads. +Standalone Kubernetes manifests and post-render presets for OCDP workloads. This repository is intentionally just a Git repository of app manifests. There is -no global catalog index and no dependency on the Gitea API. `ocdp-server` can -read a workload by building a Kustomize target directly from Git. +no global catalog index and no dependency on the Gitea API. `ocdp-server` stores +the template contract; the operator renders Helm sources and then applies +Kustomize presets from this repository. ## Architecture @@ -14,12 +15,15 @@ OCDP keeps the responsibilities split: Git repo apps//base apps//components + packages//presets/ + kustomization.yaml + userInputs.yaml ocdp-server PostgreSQL WorkloadTemplate metadata - WorkloadTemplate source.repositoryUrl/ref/path + WorkloadTemplate Helm source + WorkloadTemplate source.preset.repositoryUrl/ref/path user-facing values schema / parameters - environment overlay and policy metadata access bindings no WorkloadClaim instance storage @@ -29,35 +33,50 @@ target cluster runtime Kubernetes resources ``` -This repository stores only the Git base and reusable components. It does not -store `WorkloadTemplate` records and does not store user `WorkloadClaim` -instances. +This repository stores Git bases, reusable components, and Kustomize presets. +For Helm-backed workloads, the operator runs `helm template` first, writes that +output as `rendered.yaml`, then renders the selected preset with Workload +`spec.values`. A preset may include `userInputs.yaml` so admins can see which +values should become the WorkloadTemplate user-facing parameter schema. This +metadata file is not a Kubernetes resource and is not referenced by Kustomize. An admin creates or updates a `WorkloadTemplate` in `ocdp-server`. That template -can point at one of these Git paths: +can point at a Helm chart plus one of these Git presets: ```yaml templateType: kustomize source: - type: gitKustomize - repositoryUrl: https://gitea.example.com/ocdp/ocdp-workload-manifests.git - ref: code-server-v0.1.0 - path: apps/code-server/base + type: helm + repositoryUrl: https://kuoss.github.io/helm-charts + chart: code-server + version: 3.16.1 + releaseName: "{{ name }}" + values: | + fullnameOverride: "{{ name }}" + serviceAccount: + create: false + persistence: + enabled: true + size: 20Gi + preset: + type: kustomize + repositoryUrl: https://gitea.example.com/ocdp/ocdp-workload-manifests.git + ref: main + path: packages/code-server/presets/k3s-hami ``` After the template is stored in PostgreSQL and assigned to users or groups, users call `ocdp-server` to create a claim by submitting `templateId`, `workspaceId`, -and values. `ocdp-server` resolves the template from PostgreSQL, reads the Git -base, generates any temporary source files or runtime specs outside this repo, -then writes a `WorkloadClaim` CR into the target Kubernetes cluster. The final -user-created claim lives in Kubernetes, not in PostgreSQL. +and values. `ocdp-server` resolves the template from PostgreSQL and writes a +`WorkloadClaim` CR into the target Kubernetes cluster. The operator renders Helm +with the resolved values, applies the Kustomize preset, and then applies the +final Kubernetes resources. The final user-created claim lives in Kubernetes, +not in PostgreSQL. ## Layout ```text apps/ - earth2studio-demo/ - base/ code-server/ base/ components/ @@ -67,32 +86,46 @@ apps/ litellm/ base/ components/ +packages/ + code-server/ + presets/ + k3s-hami/ + kustomization.yaml + userInputs.yaml tests/ kustomize/ ``` ## Server Usage -`ocdp-server` should store a normal `gitKustomize` reference: +`ocdp-server` should store a Helm source plus a Kustomize preset reference: ```yaml source: - type: gitKustomize - repositoryUrl: https://gitea.example.com/ocdp/ocdp-workload-manifests.git - ref: code-server-v0.1.0 - path: apps/code-server/base + type: helm + repositoryUrl: https://kuoss.github.io/helm-charts + chart: code-server + version: 3.16.1 + releaseName: "{{ name }}" + preset: + type: kustomize + repositoryUrl: https://gitea.example.com/ocdp/ocdp-workload-manifests.git + ref: main + path: packages/code-server/presets/k3s-hami ``` -For one deployment, `ocdp-server` should generate temporary source files outside -this repository. Those files can point at the Git base and add generated -Secrets, ConfigMaps, components, and patches. +For one deployment, `ocdp-server` stores the Helm chart reference, static Helm +values, preset reference, and the template value contract. The preset renders +with Workload `spec.values`, so environment-specific implementation details +should live in the selected preset instead of in the WorkloadTemplate create +form. ## Exposure -Base services should stay internally reachable. User-facing exposure choices -belong on the WorkloadClaim top-level intent such as `exposure=internal` or -`exposure=external`; the agent/template renderer turns that intent into runtime -Service resources. +Base services should stay internally reachable unless an environment preset +intentionally changes the Service shape. For standard self-service workloads, +exposure is an admin preset decision and a user read view, not a user claim +input. Reusable components are still useful implementation building blocks: @@ -101,8 +134,7 @@ Reusable components are still useful implementation building blocks: - `components/service-loadbalancer`: change the app Service to `LoadBalancer`. - `components/service-nodeport`: change the app Service to `NodePort`. -When `exposure=external` is selected by the user, the agent/template renderer -may generate a NodePort Service and leave the concrete nodePort for Kubernetes +When a preset needs NodePort, it may leave the concrete nodePort for Kubernetes to allocate: ```yaml @@ -113,10 +145,13 @@ to allocate: Do not hard-code shared NodePort values in app bases or reusable components. -Environment overlays are different from user choices. They are selected by the +Environment overlays are different from user values. They are selected by the platform from cluster, workspace, or customer policy information and can carry -things like StorageClass, IngressClass, GPU runtime class, registry prefix, -pull-secret wiring, node selectors, tolerations, and site-specific labels. +things like Service type, StorageClass, IngressClass, GPU runtime class, +registry prefix, pull-secret wiring, node selectors, tolerations, and +site-specific labels. The user-facing exposure view is derived after reconcile: +ClusterIP is hidden from ordinary users, NodePort uses the agent access host and +observed nodePort, and LoadBalancer uses observed external IP/hostname and port. ## Validate diff --git a/apps/code-server/README.md b/apps/code-server/README.md index cd235d4..c2b8653 100644 --- a/apps/code-server/README.md +++ b/apps/code-server/README.md @@ -1,17 +1,22 @@ # code-server -The base deploys code-server from `harbor.bwgdi.com/library/earth2studio-demo:v6` -with a Service and password Secret reference. +`base/` and `components/` stay plain Kustomize YAML for reusable validation and +composition. -The Secret is generated by the instance overlay. Workspace storage is mounted -from top-level WorkloadClaim `storage`, and exposure is rendered from top-level -`exposure`. Do not commit real passwords or tokens to this catalog. +The OCDP self-service code-server flow is Helm-backed now: -The WorkloadTemplate exposes storage intent (`temporary`, `retained`, -`existing`) and exposure intent (`internal`, `external`) as claim-time choices. -Workspace storage defaults to retained. The template also mounts a hidden -retained `weight` StorageClass PVC at `/models` for model weights; this -StorageClass detail is platform-owned and is not exposed as a user parameter. +```text +Helm chart + -> operator helm template + -> packages/code-server/presets/k3s-hami + -> final Kubernetes resources +``` -The Deployment keeps HAMi resource keys in `resources.limits` at all times: -`nvidia.com/gpu` for GPU count and `nvidia.com/gpumem` for GPU memory in MiB. +The environment-specific preset lives in +`packages/code-server/presets/k3s-hami`. That preset owns the platform image +`harbor.bwgdi.com/library/earth2studio-demo:v6`, registry pull secret, HAMi +scheduler, Service shape, code-server auth mode, and GPU resource keys. + +Users only fill CPU, memory, GPU count, and GPU memory. See +`packages/code-server/presets/k3s-hami/userInputs.yaml` for the values schema +that the console should render into the WorkloadClaim form. diff --git a/apps/code-server/base/deployment.yaml b/apps/code-server/base/deployment.yaml index c5e6894..0db3196 100644 --- a/apps/code-server/base/deployment.yaml +++ b/apps/code-server/base/deployment.yaml @@ -56,9 +56,9 @@ spec: resources: requests: cpu: "500m" - memory: 1Gi + memory: 1024Mi limits: - cpu: "2" - memory: 4Gi + cpu: "2000m" + memory: 4096Mi nvidia.com/gpu: "1" nvidia.com/gpumem: "8192" diff --git a/packages/code-server/README.md b/packages/code-server/README.md new file mode 100644 index 0000000..414a6fe --- /dev/null +++ b/packages/code-server/README.md @@ -0,0 +1,29 @@ +# code-server Helm postRender + +This package is consumed by OCDP as a Kustomize postRender for a Helm rendered +code-server chart. + +The platform chain is: + +```text +Helm chart + resolved values + -> helm template + -> packages/code-server/post-renders/k3s-hami + -> final Kubernetes resources +``` + +`post-renders/k3s-hami/kustomization.yaml` patches the Helm output with +environment-managed choices: + +- image: `harbor.bwgdi.com/library/earth2studio-demo:v6` +- pull secret: `regcred` +- scheduler: `hami-scheduler` +- HAMi resource limit keys: `nvidia.com/gpu` and `nvidia.com/gpumem` +- NodePort Service on port `80` +- `weight` StorageClass PVC mounted at `/models` + +`post-renders/k3s-hami/userInputs.yaml` is the user-facing value contract. Users +only choose CPU, memory, GPU count, and GPU memory. The console renders these +fields as the WorkloadClaim form; the operator receives the resolved values on +the Workload CR and applies the postRender patches. Storage, exposure, image, +scheduler, pull secret, and code-server auth mode stay in the admin postRender. diff --git a/packages/code-server/post-renders/k3s-hami/kustomization.yaml b/packages/code-server/post-renders/k3s-hami/kustomization.yaml new file mode 100644 index 0000000..1a3b334 --- /dev/null +++ b/packages/code-server/post-renders/k3s-hami/kustomization.yaml @@ -0,0 +1,119 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization +namespace: "{{ namespace }}" +resources: + - rendered.yaml + - weights-pvc.yaml +patches: + - target: + group: apps + version: v1 + kind: Deployment + name: "{{ name }}" + patch: | + apiVersion: apps/v1 + kind: Deployment + metadata: + name: "{{ name }}" + labels: + app.kubernetes.io/component: ide + app.kubernetes.io/part-of: ocdp-workload + spec: + template: + metadata: + labels: + app.kubernetes.io/component: ide + app.kubernetes.io/part-of: ocdp-workload + spec: + imagePullSecrets: + - name: regcred + schedulerName: hami-scheduler + securityContext: + fsGroup: 1000 + volumes: + - name: data + persistentVolumeClaim: + claimName: "{{ name }}" + - name: model-weights + persistentVolumeClaim: + claimName: "{{ name }}-weights" + containers: + - name: code-server + image: harbor.bwgdi.com/library/earth2studio-demo:v6 + imagePullPolicy: IfNotPresent + securityContext: + runAsUser: 1000 + command: + - code-server + args: + - --bind-addr + - 0.0.0.0:8080 + - --auth + - none + - /workspace + ports: + - name: http + containerPort: 8080 + protocol: TCP + env: + - name: HOME + value: /workspace + - name: XDG_CONFIG_HOME + value: /workspace/.config + readinessProbe: + httpGet: + path: / + port: http + livenessProbe: + httpGet: + path: / + port: http + resources: + requests: + cpu: "{{ cpuRequestMillicores }}m" + memory: "{{ memoryRequestMiB }}Mi" + limits: + cpu: "{{ cpuLimitMillicores }}m" + memory: "{{ memoryLimitMiB }}Mi" + nvidia.com/gpu: "{{ gpuCount }}" + nvidia.com/gpumem: "{{ gpuMemoryMiB }}" + volumeMounts: + - name: data + mountPath: /workspace + - name: model-weights + mountPath: /models + - target: + version: v1 + kind: Secret + name: "{{ name }}" + patch: | + apiVersion: v1 + kind: Secret + metadata: + name: "{{ name }}" + labels: + app.kubernetes.io/component: auth + app.kubernetes.io/part-of: ocdp-workload + annotations: {} + type: Opaque + data: + password: dW51c2Vk + - target: + version: v1 + kind: Service + name: "{{ name }}" + patch: | + apiVersion: v1 + kind: Service + metadata: + name: "{{ name }}" + labels: + app.kubernetes.io/component: ide + app.kubernetes.io/part-of: ocdp-workload + spec: + type: NodePort + ports: + - name: http + port: 80 + targetPort: http + protocol: TCP diff --git a/packages/code-server/post-renders/k3s-hami/userInputs.yaml b/packages/code-server/post-renders/k3s-hami/userInputs.yaml new file mode 100644 index 0000000..4fdf2ab --- /dev/null +++ b/packages/code-server/post-renders/k3s-hami/userInputs.yaml @@ -0,0 +1,35 @@ +cpuRequestMillicores: + label: CPU request + type: number + default: 500 + minimum: 0 + +cpuLimitMillicores: + label: CPU limit + type: number + default: 2000 + minimum: 0 + +memoryRequestMiB: + label: Memory request + type: number + default: 1024 + minimum: 0 + +memoryLimitMiB: + label: Memory limit + type: number + default: 4096 + minimum: 0 + +gpuCount: + label: GPU count + type: number + default: 1 + minimum: 0 + +gpuMemoryMiB: + label: GPU memory + type: number + default: 8192 + minimum: 0 diff --git a/packages/code-server/post-renders/k3s-hami/weights-pvc.yaml b/packages/code-server/post-renders/k3s-hami/weights-pvc.yaml new file mode 100644 index 0000000..b3e0706 --- /dev/null +++ b/packages/code-server/post-renders/k3s-hami/weights-pvc.yaml @@ -0,0 +1,18 @@ +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: "{{ name }}-weights" + namespace: "{{ namespace }}" + labels: + app.kubernetes.io/name: "{{ name }}" + app.kubernetes.io/component: model-weights + app.kubernetes.io/part-of: ocdp-workload + annotations: + platform.ocdp.io/storage-role: model-weights +spec: + accessModes: + - ReadWriteMany + storageClassName: weight + resources: + requests: + storage: 100Gi diff --git a/tests/kustomize/code-server-nodeport/deployment-patch.yaml b/tests/kustomize/code-server-nodeport/deployment-patch.yaml index e518fe4..4180ef7 100644 --- a/tests/kustomize/code-server-nodeport/deployment-patch.yaml +++ b/tests/kustomize/code-server-nodeport/deployment-patch.yaml @@ -1,15 +1,15 @@ - op: replace path: /spec/template/spec/containers/0/resources/requests/cpu - value: "1" + value: "1000m" - op: replace path: /spec/template/spec/containers/0/resources/requests/memory - value: 2Gi + value: 2048Mi - op: replace path: /spec/template/spec/containers/0/resources/limits/cpu - value: "2" + value: "2000m" - op: replace path: /spec/template/spec/containers/0/resources/limits/memory - value: 4Gi + value: 4096Mi - op: replace path: /spec/template/spec/containers/0/resources/limits/nvidia.com~1gpu value: "1"