Files
ocdp-go/docs/test2-quota.md
Ivan087 7f238a3168 refactor: full-stack restructure with multi-tenancy, workspace management, and K8s diagnostics
- Add Workspace domain (entity, repository, service, handler, DTO)
- Add multi-tenant K8s client with tenant binding and quota management
- Add K8s diagnostics client (instance diagnostics)
- Add authorization middleware (authz package)
- Restructure frontend to feature-based architecture (features/)
- Add User Management page in configuration
- Add AccessDenied page and route guards
- Refactor shared components (form inputs, layout, UI)
- Update Tailwind config for new design system
- Add comprehensive documentation (docs/, tasks/, plans)
- Improve cluster service with better kubeconfig handling
- Add tests for crypto, config, helm client, tenant binding
2026-05-12 16:15:14 +08:00

157 lines
6.3 KiB
Markdown

# Resource Quota Enforcement Test Report
**Date:** 2026-05-11
**Tester:** test-user-b
**Namespace:** ocdp-u-test-b
**User Quota:** cpu=2, memory=4Gi, gpu=0, gpumem=0
---
## Test Summary
| Test | Description | Expected | Actual | Result |
|------|-------------|----------|--------|--------|
| A | Deploy nginx (default, within quota) | Success | Deployed (status: `deployed`) | ✅ PASS |
| B | Deploy nginx (cpu=4, mem=8Gi, replicas=5, exceeds quota) | Blocked by quota | Helm release created, Service created, all pods blocked by ResourceQuota (status: `pending-install`) | ⚠️ PARTIAL |
| C | Deploy vllm-serve with gpu=1 (gpu quota = 0) | Blocked by quota | Helm release created, all pods blocked by ResourceQuota (status: `pending-install`) | ⚠️ PARTIAL |
---
## Detailed Results
### Test A: Deploy nginx within quota limits
- **Instance:** `quota-test-nginx` (ed846c33-3631-4d54-adce-c7f00210176f)
- **Chart:** charts/nginx:22.1.1
- **Values:** defaults
- **API Response:** HTTP 200, status: `pending-install`
- **Final Status after 21s:** `deployed` ("Instance deployed successfully")
- **K8s Resource Usage:** requests.cpu=100m/2, requests.memory=128Mi/4Gi
### Test B: Deploy nginx exceeding quota
- **Instance:** `quota-test-nginx-2` (36c0350f-089c-41c2-a66e-e93539c00d52)
- **Chart:** charts/nginx:22.1.1
- **Values:** replicaCount=5, resources.limits.cpu=4/memory=8Gi, resources.requests.cpu=2/memory=4Gi
- **API Response:** HTTP 200, status: `pending-install`
- **Final Status (observed for 90s+):** `pending-install` (never transitioned to `deployed` or `failed`)
- **K8s Behavior:**
- Helm release created: `sh.helm.release.v1.quota-test-nginx-2.v1`
- TLS secret created
- Service created, IP assigned
- Deployment created, ReplicaSet scaled up
- **All pod creations FAILED** with: `Error creating: pods "..." is forbidden: exceeded quota: tenant-quota, requested: requests.cpu=2,requests.memory=4Gi, used: requests.cpu=100m,requests.memory=128Mi, limited: requests.cpu=2,requests.memory=4Gi`
### Test C: Deploy GPU instance (gpu quota = 0)
- **Instance:** `quota-test-gpu` (a0d692c8-cdf8-4248-a6d4-1468ad4a7cc7)
- **Chart:** charts/vllm-serve:0.6.0
- **Values:** resources.gpuLimit=1, resources.gpuMem=5000
- **API Response:** HTTP 200, status: `pending-install`
- **Final Status (observed for 30s+):** `pending-install`
- **K8s Behavior:**
- vllm-serve chart defaults: requests.cpu=8, requests.memory=16Gi, requests.nvidia.com/gpu=1, requests.nvidia.com/gpumem=5k
- All pods blocked: `exceeded quota: tenant-quota, requested: requests.cpu=8,requests.memory=16Gi,requests.nvidia.com/gpu=1,..., limited: requests.cpu=2,requests.memory=4Gi,requests.nvidia.com/gpu=0`
---
## Key Findings
### 1. No API-Level (Pre-flight) Quota Enforcement
The backend API accepts **all** deployment requests regardless of whether they exceed the user's quota. There is no validation at the API layer that checks:
- Whether the requested resources exceed the user's quota limits
- Whether the user's quota is already fully consumed by existing deployments
**Evidence:** All three deployments returned HTTP 200 with `status: pending-install`. The backend logs contain zero quota-related entries.
### 2. Kubernetes ResourceQuota Enforces at Pod Level
The Kubernetes `ResourceQuota` object `tenant-quota` in namespace `ocdp-u-test-b` does enforce limits, but only at the **pod creation** level:
```yaml
spec:
hard:
requests.cpu: "2"
requests.memory: 4Gi
requests.nvidia.com/gpu: "0"
requests.nvidia.com/gpumem: "0"
```
When pods exceed quota, Kubernetes explicitly refuses to create them with a clear error message.
However, Helm releases, Services, Deployments, and ReplicaSets are **still created** even when pods are blocked.
### 3. Stuck at "pending-install"
Instances that exceed quota remain stuck in `pending-install` status **indefinitely** — they never transition to `deployed`, `failed`, or any error status. The OCDP platform does not detect the ResourceQuota rejection and update the instance status accordingly. The only way to know about the failure is to check Kubernetes events directly:
```bash
kubectl get events -n ocdp-u-test-b
```
### 4. GPU Quota Enforcement
Users with `gpu=0` quota **can** submit deployments referencing GPU-enabled charts. The API does not reject them. Only the K8s ResourceQuota blocks pod creation at runtime. This could lead to:
- Unnecessary Helm releases and resource overhead in the cluster
- Confusion for users whose deployments appear to hang at `pending-install`
### 5. Quota Exposed in Login Response
The login response includes quota information:
```json
{
"quotaCpu": "2",
"quotaMemory": "4Gi",
"quotaGpu": "0",
"quotaGpuMemory": "0"
}
```
This could be used by the frontend to show usage limits, but no pre-flight check uses it server-side.
---
## Recommendations
1. **Add pre-flight quota validation** in the backend API: before accepting a deployment, check whether the requested resources (from chart values) would exceed the user's quota. Return HTTP 4xx with a clear error message.
2. **Handle "pending-install" timeout**: implement a watcher that detects when a Helm release has been created but pods remain stuck (e.g., due to ResourceQuota) and:
- Update instance status to `failed` with a descriptive `statusReason`
- Clean up the Helm release, Service, etc.
- Optionally surface the K8s error message via the API
3. **GPU quota pre-check**: if a chart requests GPU resources and the user's `gpu=0`, reject the deployment at the API level before creating any Kubernetes resources.
4. **UI quota indicator**: show remaining quota (used vs. hard limit) on the deployment form so users know their limits before submitting.
---
## ResourceQuota YAML (for reference)
```yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: tenant-quota
namespace: ocdp-u-test-b
labels:
ocdp.io/managed-by: ocdp
ocdp.io/tenant: ocdp-u-test-b
spec:
hard:
requests.cpu: "2"
requests.memory: 4Gi
requests.nvidia.com/gpu: "0"
requests.nvidia.com/gpumem: "0"
```
---
## Cleanup Verification
All test instances were removed after testing:
- `quota-test-nginx` ✅ deleted (pods terminated, helm release removed, quota back to 0)
- `quota-test-nginx-2` ✅ cleaned up (no pods created, resources released)
- `quota-test-gpu` ✅ cleaned up (no pods created, resources released)
- ResourceQuota used: all resources at 0