# OCDP 最终文档结构 ## Quota lifecycle monitoring implementation 2026-05-14 - [x] Main: integrate per-user per-cluster quota semantics and final verification. - [x] Treat ordinary-user empty CPU/memory/GPU/GPU memory quotas as explicit zero. - [x] Make create/update/scale quota checks use the selected cluster binding and sync ResourceQuota first. - [x] Reject GPU=0 user vllm deployment on k3s before DB instance/release creation. - [x] Worker A: implement backend quota evaluator/resource quota sync without touching frontend. - [x] Worker B: implement user lifecycle cleanup, snake_case DTO normalization, and safe admin/user role transitions. - [x] Add auth DTO alternate snake_case fields plus `Normalize()` for register/update requests, and call it in auth handler before service mapping. - [x] Make admin-to-user role transitions create or safely reuse the username-derived workspace; detect namespace ownership conflicts and return an explicit domain conflict error. - [x] Extend workspace binding repository to list/delete all bindings for a workspace so user deletion can clean every cluster binding. - [x] Extend tenant kube client with idempotent tenant cleanup for namespace/service account/role binding/resource quota, refusing system namespaces such as `default` and `kube-system`. - [x] Extend `AuthService` dependencies for instance/cluster/binding/tenant cleanup, preserving existing callers and avoiding frontend changes. - [x] Update `DeleteUser` to reject deletion when the user owns instances; when safe, clean exclusive user workspace cluster bindings and OCDP tenant resources before deleting the user. - [x] Add focused Go tests for DTO normalization, role downgrade workspace reuse/conflict, delete-with-instances conflict, cleanup path, and protected namespace cleanup. - [x] Run targeted Go tests, review diff, and add Worker B Review summary here. - Review: changed auth DTO normalization, auth handler normalization calls, auth service workspace reuse/delete cleanup logic, workspace binding repository ports/adapters, tenant kube cleanup, domain errors, mock/test coverage, and API wiring. Namespace conflicts now return `ErrWorkspaceNamespaceConflict`/HTTP 409; deleting users with owned/workspace instances returns `ErrUserHasInstances`/HTTP 409; protected tenant namespaces return forbidden-style `ErrProtectedNamespace`. Validation passed with targeted Worker B tests and full backend `go test ./...`. - [x] Worker C: implement monitoring resource aggregation and instance owner username fields. - [x] Worker D: implement frontend user management, instance card, and monitoring UI changes. - [x] Inspect current API/generated/UI type contracts for owner and monitoring resource fields without changing backend. - [x] Rework User Management accounts area into a wider operations layout with quota chips/split columns and actions that do not squeeze quota content. - [x] Change admin-to-user downgrade flow to open/reuse the tenant resource limit editor and submit role plus namespace/cluster/quota fields together. - [x] Show instance owner as `ownerUsername` when present, otherwise a shortened `ownerId`. - [x] Extend monitoring frontend types/adapters as needed for GPU allocation, GPU memory, and per-user resource rows returned by the backend. - [x] Update Cluster Monitoring cards/page to render GPU allocation/GPU Mem and per-user resource tables while respecting backend-scoped data for normal users. - [x] Check responsive behavior for the touched UI and avoid obvious desktop/mobile overflow. - [x] Run targeted frontend type/build tests available in the repo and review diff. - [x] Add Worker D Review summary with changed files and verification results. - Review: changed `frontend/src/features/configuration/users/pages/UserManagementPage.tsx`, `frontend/src/features/artifact/instances/components/InstanceCard.tsx`, `frontend/src/features/monitoring/clusters/components/ClusterMonitorCard.tsx`, `frontend/src/features/monitoring/clusters/pages/MonitoringClustersPage.tsx`, `frontend/src/core/types/index.ts`, and `frontend/src/api/index.ts`. User Management now uses wider operation rows with quota chips and admin-to-user downgrade saves role plus tenant limits. Instance cards show owner username or short owner ID. Cluster monitoring renders GPU allocation, GPU memory, and backend-returned per-user resource rows. Validation: `npm run build` passed; targeted `npx eslint ...` on changed frontend source files passed; full `npm run lint` remains blocked by pre-existing generated/cache and legacy lint errors; Playwright viewport check passed for `/configuration/users` and `/monitoring/clusters` at 390x844 and 1440x1000 with mocked API data and no horizontal overflow detected. - [x] Worker E: add API/Playwright/k3s regression tests for this plan. - [x] Worker F: read-only review for quota bypass, namespace deletion safety, and monitoring privacy. - [x] Run `go test ./...` and `npm run build`. - [x] Run Docker Compose smoke plus API/Playwright regression scripts. - [x] Run real k3s negative vllm quota deployment test and clean up test users. - [ ] Run positive GPU=1 k3s vllm deployment when cluster resources are available. - [x] Add Review summary and lessons. - Review: ordinary users are now forced into username-derived private workspace/namespace so per-cluster bindings behave as per-user/per-cluster quota buckets. Empty user CPU/memory/GPU/GPU Mem default to explicit `0`; create/update/scale use rendered Helm resource estimates, live ResourceQuota usage minus current release delta, and synced tenant ResourceQuota before persistence/Helm mutation. User deletion blocks on owned/workspace instances, then cleans tenant bindings/namespaces and deletes the exclusive workspace record. Monitoring now returns per-user resource rows and strips cluster-wide node/total metrics for ordinary users. Frontend renders `resourceUsageByUser`, instance owners, less cramped user quotas, and wraps InstanceCard actions to keep Delete inside the viewport. Validation passed: backend `go test ./...`, frontend `npm run build`, Docker Compose health checks, `test/unresolved_bugs_security_gateway_contract.py`, `test/unresolved_bugs_api_contract.py`, `test/user_namespace_quota_api_contract.py`, `test/frontend-playwright-smoke.py`, and `test/instance_card_action_layout_playwright.py`. Positive GPU=1 vllm deployment was not run because the cluster resource constraint remains external; negative GPU=0 vllm quota rejection on k3s passed. ## Unresolved bugs implementation 2026-05-14 - [x] Worker A: fix instance API contract: detail replicas, list values, values/valuesYaml conflict, namespace 403. - [x] Inspect current instance handler/service tests and avoid touching other workers' areas. - [x] Add request validation so `values` plus `valuesYaml`/`values_yaml` conflicts return HTTP 400, while YAML-only still populates values. - [x] Enrich `GetInstance` replica count using the same live K8s source as list. - [x] Include `values` in list responses for API compatibility. - [x] Change normal-user tenant namespace mismatch from silent override to `ErrForbidden`/HTTP 403. - [x] Add focused Go tests for namespace mismatch and replica enrichment/list values where practical. - [x] Run targeted Go tests and review diff. - Review: changed only scoped instance backend files plus this task tracker; validated with `go test ./internal/domain/service` and `go test ./internal/adapter/input/http/rest` from `backend/`. - [x] Worker B: add Helm-rendered quota pre-check helper before DB create/Helm install. - [x] Inspect Helm client/service quota contracts and preserve other workers' edits. - [x] Add domain quota precheck types and compare logic for CPU, memory, GPU, and integer-MB gpumem. - [x] Add Helm render estimator output port and real/mock Helm implementations that render final chart values and sum Pod template requests/limits. - [x] Add focused Go tests for quota comparison and rendered manifest estimation where feasible. - [x] Run targeted Go tests and review diff. - Review: exposed `QuotaPrecheckService.EstimateAndCompare` plus `CompareWorkspaceQuota`; real Helm now dry-renders `/tmp/charts/{chart}-{version}.tgz` with final values and estimates Pod template requests/limits. Added quota and manifest estimator tests. Validation passed with `go test ./internal/domain/service ./internal/adapter/output/helm/...` and full backend `go test ./...`. - [ ] Worker C: add compatibility/security backend endpoints and auth/CORS/rate-limit fixes. - [x] Inspect backend route/handler/service contracts and preserve other workers' edits. - [x] Add `/repositories/{repo}/tags` compatibility alias without changing existing artifact behavior. - [x] Add `/monitoring/clusters/{id}/metrics` alias and `/clusters/{id}/stats` compatibility response. - [x] Add `/clusters/{id}/kubeconfig` tenant kubeconfig endpoint scoped to the authenticated user's workspace and requested cluster. - [x] Make login failures uniform and add a lightweight per-client login rate limit. - [x] Replace permissive CORS reflection/wildcard defaults with an allowlist-driven default suitable for local dev. - [ ] Add focused Go tests where straightforward, then run relevant Go tests and review diff. - Review: changed scoped backend route/auth/CORS handlers, added CORS and login limiter tests, and removed the direct SSE CORS wildcard so global CORS applies. Validation attempted with `go test ./cmd/api ./internal/adapter/input/http/rest` from `backend/`, but it is currently blocked by concurrent Worker B compile errors in `internal/domain/service/quota_precheck.go` and Helm client implementations missing `EstimateInstanceResources`. - [x] Worker D: harden Nginx gateway `/health`, server tokens, and security headers. - [ ] Worker E: align frontend/API client and Playwright coverage for conflict/namespace/scale flows. - [ ] Worker F: add API/security/regression test scripts and review coverage. - [ ] Integrate worker changes, resolve conflicts, and run Go/frontend builds. - [ ] Run Docker Compose smoke, API contracts, Playwright, and real k3s deploy cleanup. - [ ] Update `tasks/lessons.md` and add Review summary here. ## docs/ 目录 (已清理) | 文件 | 用途 | 状态 | |------|------|------| | `user-guide.md` | 用户操作指南 | ✅ 永久参考 | | `test-scenarios.md` | 100+ 测试用例设计 | ✅ 永久参考 | | `test-users.json` | 4 个测试账号凭证 | ✅ 永久参考 | | `regression-full-report.md` | 最新综合回归报告 | ✅ 可删除(下一个版本) | | `UNRESOLVED-BUGS.md` | 未修复问题清单 (15 个) | ✅ 当前版本 | ## Worker C monitoring and instance owner backend 2026-05-14 - [x] Inspect existing instance/monitoring permission, repository, DTO, and K8s metrics contracts without reverting other workers' changes. - [x] Add `ownerUsername` to instance entity/DTO responses and hydrate it for detail/list via user repository while preserving ordinary-user/admin visibility rules. - [x] Add K8s Pod resource allocation collection from requests/limits, including GPU and `requests.nvidia.com/gpumem` as integer MB. - [x] Aggregate `resourceUsageByUser` in monitoring service by matching Pods to visible instances/workspaces/owners, with ordinary users scoped to themselves and admins seeing all visible owners. - [x] Expose cluster-level GPU/GPU memory allocation fields and per-user resource usage in `/monitoring/clusters`, detail, and existing aliases. - [x] Add focused Go tests for instance owner username and monitoring resource aggregation/privacy. - [x] Run relevant Go tests, review diff, and add Review summary here. - Review: Instance list/detail now include `ownerUsername` hydrated from the user repository. Monitoring responses now include per-user resource usage plus CPU/memory/GPU/GPU-memory request/limit allocation fields derived from Kubernetes Pod resources and DB instance ownership mapping; ordinary users only see their own allocation rows/totals, admins see all visible instance owners. Validation passed with `go test ./internal/domain/service`, `go test ./cmd/api ./internal/adapter/input/http/rest ./internal/adapter/output/k8s`, and backend `go test ./...`. ## Debug quota limits monitoring UI 2026-05-15 - [x] Inspect current runtime logs for workspace conflict/quota errors without killing other services. - [x] Fix quota semantics: CPU/memory blank means unlimited; GPU/GPU Mem blank means explicit zero for ordinary users. - [x] Fix admin user update so editing an existing user's quota does not recreate/reassign namespace and does not raise false `workspace namespace conflict`. - [x] Rework User Management action controls so they wrap inside the viewport on desktop and mobile. - [x] Improve monitoring for ordinary users with self-scoped useful fields instead of all `N/A`; make admin monitoring show the new resource allocation rows clearly. - [x] Rebuild Docker Compose stack and run backend/frontend tests plus Playwright overflow smoke. - [x] Use ivanwu on k3s with vllm-serve 0.6.0, CPU/memory unlimited and gpumem `10000`, then verify/clean up. - [x] Add Review summary and lessons. Review: - Runtime logs were checked before and after changes. The only 502s observed were during intentional backend rebuild; final backend/nginx logs had no error/fatal/5xx entries. - Admin can now update ivanwu without `workspace namespace conflict`; ivanwu was migrated to workspace `ivanwu`, namespace `ocdp-u-ivanwu`, default cluster k3s, CPU/memory unlimited, GPU `1`, GPU Mem `10000`. - k3s ResourceQuota for ivanwu contains only GPU and GPU Mem hard limits; CPU/memory are omitted as unlimited. A vllm-serve `0.6.0` deployment used `harbor.bwgdi.com/library/vllm-openai:v0.17.1`, reached `deployed`, Pod `1/1 Running`, then was deleted through the platform and quota usage returned to `0/1` GPU and `0/10k` gpumem. - Monitoring now shows ordinary users self-scoped allocation rows and admin per-user rows. The vLLM deployment was visible as CPU `1.00 cores`, memory `9.8 GiB`, GPU `1`, GPU Mem `10000`. - Verification passed: `go test ./...`, `npm run build`, `test/frontend-playwright-smoke.py`, `test/instance_card_action_layout_playwright.py`, `test/user_management_layout_playwright.py`, `test/user_namespace_quota_api_contract.py`, and `test/unresolved_bugs_api_contract.py`. ## Restart docs and user management overflow 2026-05-18 - [x] Inspect current Docker Compose service lifecycle and identify why frontend/backend feel disconnected. - [x] Update Makefile/README so one clear command starts the whole platform, with explicit rebuild/restart/status/log commands. - [x] Restart the full stack through the documented command and verify health endpoints. - [x] Reproduce User Management overflow with Playwright at desktop/tablet/mobile widths. - [x] Fix User Management layout so action buttons and quota controls stay inside the viewport. - [x] Run backend/frontend builds plus Playwright layout smoke. - [x] Record Review summary and lessons. Review: - `make up` is now the single documented platform start command. It runs `docker compose up --build -d` for the whole stack, and old commands (`run-2`, `docker-dev`, `docker-prod`, `docker-up`) are compatibility aliases. - `docker-compose.yml` now keeps `nginx` under `restart: unless-stopped`; `make docker-ps` and `make up` show `docker compose ps -a`, so the expected `frontend-build Exited (0)` state is visible and less confusing. - README now explains that frontend-build is a one-shot build job and the actual frontend runtime is `nginx`, which also proxies `/api`. - User Management layout was changed from a fixed four-column row to a responsive card layout with a wrapping action area. The app shell content column also has `min-w-0` so wide children cannot force browser overflow. - Verification passed: `go test ./...`, `npm run build`, `make up`, health checks for backend/nginx/web, `test/user_management_layout_playwright.py` across 1440/1280/1024/900/768 widths, `test/frontend-playwright-smoke.py`, and `test/instance_card_action_layout_playwright.py`.