Files

Ivan087 7f238a3168 refactor: full-stack restructure with multi-tenancy, workspace management, and K8s diagnostics

- Add Workspace domain (entity, repository, service, handler, DTO)
- Add multi-tenant K8s client with tenant binding and quota management
- Add K8s diagnostics client (instance diagnostics)
- Add authorization middleware (authz package)
- Restructure frontend to feature-based architecture (features/)
- Add User Management page in configuration
- Add AccessDenied page and route guards
- Refactor shared components (form inputs, layout, UI)
- Update Tailwind config for new design system
- Add comprehensive documentation (docs/, tasks/, plans)
- Improve cluster service with better kubeconfig handling
- Add tests for crypto, config, helm client, tenant binding

2026-05-12 16:15:14 +08:00

6.3 KiB

Raw Blame History

Resource Quota Enforcement Test Report

Date: 2026-05-11 Tester: test-user-b Namespace: ocdp-u-test-b User Quota: cpu=2, memory=4Gi, gpu=0, gpumem=0

Test Summary

Test	Description	Expected	Actual	Result
A	Deploy nginx (default, within quota)	Success	Deployed (status: `deployed`)	✅ PASS
B	Deploy nginx (cpu=4, mem=8Gi, replicas=5, exceeds quota)	Blocked by quota	Helm release created, Service created, all pods blocked by ResourceQuota (status: `pending-install`)	⚠️ PARTIAL
C	Deploy vllm-serve with gpu=1 (gpu quota = 0)	Blocked by quota	Helm release created, all pods blocked by ResourceQuota (status: `pending-install`)	⚠️ PARTIAL

Detailed Results

Test A: Deploy nginx within quota limits

Instance: quota-test-nginx (ed846c33-3631-4d54-adce-c7f00210176f)
Chart: charts/nginx:22.1.1
Values: defaults
API Response: HTTP 200, status: pending-install
Final Status after 21s: deployed ("Instance deployed successfully")
K8s Resource Usage: requests.cpu=100m/2, requests.memory=128Mi/4Gi

Test B: Deploy nginx exceeding quota

Instance: quota-test-nginx-2 (36c0350f-089c-41c2-a66e-e93539c00d52)
Chart: charts/nginx:22.1.1
Values: replicaCount=5, resources.limits.cpu=4/memory=8Gi, resources.requests.cpu=2/memory=4Gi
API Response: HTTP 200, status: pending-install
Final Status (observed for 90s+): pending-install (never transitioned to deployed or failed)
K8s Behavior:
- Helm release created: sh.helm.release.v1.quota-test-nginx-2.v1
- TLS secret created
- Service created, IP assigned
- Deployment created, ReplicaSet scaled up
- All pod creations FAILED with: Error creating: pods "..." is forbidden: exceeded quota: tenant-quota, requested: requests.cpu=2,requests.memory=4Gi, used: requests.cpu=100m,requests.memory=128Mi, limited: requests.cpu=2,requests.memory=4Gi

Test C: Deploy GPU instance (gpu quota = 0)

Instance: quota-test-gpu (a0d692c8-cdf8-4248-a6d4-1468ad4a7cc7)
Chart: charts/vllm-serve:0.6.0
Values: resources.gpuLimit=1, resources.gpuMem=5000
API Response: HTTP 200, status: pending-install
Final Status (observed for 30s+): pending-install
K8s Behavior:
- vllm-serve chart defaults: requests.cpu=8, requests.memory=16Gi, requests.nvidia.com/gpu=1, requests.nvidia.com/gpumem=5k
- All pods blocked: exceeded quota: tenant-quota, requested: requests.cpu=8,requests.memory=16Gi,requests.nvidia.com/gpu=1,..., limited: requests.cpu=2,requests.memory=4Gi,requests.nvidia.com/gpu=0

Key Findings

1. No API-Level (Pre-flight) Quota Enforcement

The backend API accepts all deployment requests regardless of whether they exceed the user's quota. There is no validation at the API layer that checks:

Whether the requested resources exceed the user's quota limits
Whether the user's quota is already fully consumed by existing deployments

Evidence: All three deployments returned HTTP 200 with status: pending-install. The backend logs contain zero quota-related entries.

2. Kubernetes ResourceQuota Enforces at Pod Level

The Kubernetes ResourceQuota object tenant-quota in namespace ocdp-u-test-b does enforce limits, but only at the pod creation level:

spec:
  hard:
    requests.cpu: "2"
    requests.memory: 4Gi
    requests.nvidia.com/gpu: "0"
    requests.nvidia.com/gpumem: "0"

When pods exceed quota, Kubernetes explicitly refuses to create them with a clear error message. However, Helm releases, Services, Deployments, and ReplicaSets are still created even when pods are blocked.

3. Stuck at "pending-install"

Instances that exceed quota remain stuck in pending-install status indefinitely — they never transition to deployed, failed, or any error status. The OCDP platform does not detect the ResourceQuota rejection and update the instance status accordingly. The only way to know about the failure is to check Kubernetes events directly:

kubectl get events -n ocdp-u-test-b

4. GPU Quota Enforcement

Users with gpu=0 quota can submit deployments referencing GPU-enabled charts. The API does not reject them. Only the K8s ResourceQuota blocks pod creation at runtime. This could lead to:

Unnecessary Helm releases and resource overhead in the cluster
Confusion for users whose deployments appear to hang at pending-install

The login response includes quota information:

{
  "quotaCpu": "2",
  "quotaMemory": "4Gi",
  "quotaGpu": "0",
  "quotaGpuMemory": "0"
}

This could be used by the frontend to show usage limits, but no pre-flight check uses it server-side.

Recommendations

Add pre-flight quota validation in the backend API: before accepting a deployment, check whether the requested resources (from chart values) would exceed the user's quota. Return HTTP 4xx with a clear error message.
Handle "pending-install" timeout: implement a watcher that detects when a Helm release has been created but pods remain stuck (e.g., due to ResourceQuota) and:
- Update instance status to failed with a descriptive statusReason
- Clean up the Helm release, Service, etc.
- Optionally surface the K8s error message via the API
GPU quota pre-check: if a chart requests GPU resources and the user's gpu=0, reject the deployment at the API level before creating any Kubernetes resources.
UI quota indicator: show remaining quota (used vs. hard limit) on the deployment form so users know their limits before submitting.

ResourceQuota YAML (for reference)

apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-quota
  namespace: ocdp-u-test-b
  labels:
    ocdp.io/managed-by: ocdp
    ocdp.io/tenant: ocdp-u-test-b
spec:
  hard:
    requests.cpu: "2"
    requests.memory: 4Gi
    requests.nvidia.com/gpu: "0"
    requests.nvidia.com/gpumem: "0"

Cleanup Verification

All test instances were removed after testing:

quota-test-nginx ✅ deleted (pods terminated, helm release removed, quota back to 0)
quota-test-nginx-2 ✅ cleaned up (no pods created, resources released)
quota-test-gpu ✅ cleaned up (no pods created, resources released)
ResourceQuota used: all resources at 0

6.3 KiB Raw Blame History