fix: scale replicas in response, K8s metrics client, quota precheck, auth tests

- Add GetMetrics method to MetricsClient interface and implement cluster metrics API - Add QuotaPrecheck service for validating resource quotas before deployment - Add auth DTO with role/permission models and auth handler tests - Add instance diagnostics: mounted NFS volumes, labels, annotations in pod diagnostics - Update workspace handler with GetWorkspace endpoint and shared-user list - Fix monitoring handler to use correct service method name - Add tail_lines fallback in instance handler for snake_case query params - Update nginx config for SSE log streaming support (no buffering) - Add comprehensive test coverage: auth_service_test, auth_handler_test, auth_dto_test, metrics_client_test, quota_precheck_test - Update error messages for quota validation and instance operations - ModifyModal: fix YAML lineWidth:0, modified keys summary, delta-only submit - InstanceCard: correctly disable scale-minus when replicas <= 0 - SidebarLayout: add hover transition for sidebar items - Update todo.md and lessons.md with latest fixes
2026-05-20 16:56:29 +08:00
parent 8f90cf0f0d
commit 33ddaf97db
59 changed files with 4805 additions and 457 deletions
--- a/tasks/lessons.md
+++ b/tasks/lessons.md
@ -7,3 +7,9 @@
 - For real Helm smoke tests, wait for platform instance deletion to remove the DB record before deleting the Kubernetes namespace manually. Deleting the namespace too early can make the async Helm uninstall mark the instance failed.
 - When embedding Helm, setting `actionConfig.Init(..., namespace, ...)` and `Install.Namespace` is not enough. The custom `RESTClientGetter` must also override the raw kubeconfig loader namespace, or manifests without `metadata.namespace` can be created in the kubeconfig context namespace such as `default`.
 - **Axios keysToSnake recursively converts ALL object keys including user-provided values map.** This silently renames Helm chart values (gpuMem → gpu_mem) causing chart to ignore user settings. Fix: skip recursion for known data fields (values, valuesYaml) while still converting field names. Backend DTOs must provide dual json tags (camelCase + snake_case) with Normalize() fallback.
+- In the current two-role model, ordinary users must be forced into username-derived private workspaces/namespaces. Do not accept arbitrary `workspaceId` for role=`user`, or `(workspace_id, cluster_id)` quotas become shared across users. When editing an existing user, update the existing private workspace in place; only migrate users still attached to the default workspace.
+- CPU and memory quotas are allowed to be blank, which means no platform ResourceQuota limit for that resource. GPU and `requests.nvidia.com/gpumem` should still default to explicit `0` for ordinary users unless admin sets them.
+- Layout regression tests must not depend on deliberately invalid charts leaving DB instances behind. The safer behavior is to reject before DB persistence; use mocked API data for pure frontend overflow checks.
+- Monitoring Pod-to-instance attribution cannot rely only on Helm standard labels. Some local charts, including `vllm-serve`, use only `app=<release>`; include that fallback before concluding allocation is zero.
+- In this Compose stack the React frontend is not a long-running frontend container. `frontend-build` is a one-shot asset build and `nginx` is the frontend runtime plus API gateway; README and status commands must make that explicit or users will think the stack is partially down.
+- For admin tables/cards inside the sidebar shell, fixed multi-column grids can still overflow even when individual buttons use `min-w-0`. Prefer responsive card layouts with a wrapping action region, and test at 1440/1280/1024/900/768 widths.
--- a/tasks/todo.md
+++ b/tasks/todo.md
@ -1,5 +1,77 @@
 # OCDP 最终文档结构

+## Quota lifecycle monitoring implementation 2026-05-14
+
+- [x] Main: integrate per-user per-cluster quota semantics and final verification.
+  - [x] Treat ordinary-user empty CPU/memory/GPU/GPU memory quotas as explicit zero.
+  - [x] Make create/update/scale quota checks use the selected cluster binding and sync ResourceQuota first.
+  - [x] Reject GPU=0 user vllm deployment on k3s before DB instance/release creation.
+- [x] Worker A: implement backend quota evaluator/resource quota sync without touching frontend.
+- [x] Worker B: implement user lifecycle cleanup, snake_case DTO normalization, and safe admin/user role transitions.
+  - [x] Add auth DTO alternate snake_case fields plus `Normalize()` for register/update requests, and call it in auth handler before service mapping.
+  - [x] Make admin-to-user role transitions create or safely reuse the username-derived workspace; detect namespace ownership conflicts and return an explicit domain conflict error.
+  - [x] Extend workspace binding repository to list/delete all bindings for a workspace so user deletion can clean every cluster binding.
+  - [x] Extend tenant kube client with idempotent tenant cleanup for namespace/service account/role binding/resource quota, refusing system namespaces such as `default` and `kube-system`.
+  - [x] Extend `AuthService` dependencies for instance/cluster/binding/tenant cleanup, preserving existing callers and avoiding frontend changes.
+  - [x] Update `DeleteUser` to reject deletion when the user owns instances; when safe, clean exclusive user workspace cluster bindings and OCDP tenant resources before deleting the user.
+  - [x] Add focused Go tests for DTO normalization, role downgrade workspace reuse/conflict, delete-with-instances conflict, cleanup path, and protected namespace cleanup.
+  - [x] Run targeted Go tests, review diff, and add Worker B Review summary here.
+  - Review: changed auth DTO normalization, auth handler normalization calls, auth service workspace reuse/delete cleanup logic, workspace binding repository ports/adapters, tenant kube cleanup, domain errors, mock/test coverage, and API wiring. Namespace conflicts now return `ErrWorkspaceNamespaceConflict`/HTTP 409; deleting users with owned/workspace instances returns `ErrUserHasInstances`/HTTP 409; protected tenant namespaces return forbidden-style `ErrProtectedNamespace`. Validation passed with targeted Worker B tests and full backend `go test ./...`.
+- [x] Worker C: implement monitoring resource aggregation and instance owner username fields.
+- [x] Worker D: implement frontend user management, instance card, and monitoring UI changes.
+  - [x] Inspect current API/generated/UI type contracts for owner and monitoring resource fields without changing backend.
+  - [x] Rework User Management accounts area into a wider operations layout with quota chips/split columns and actions that do not squeeze quota content.
+  - [x] Change admin-to-user downgrade flow to open/reuse the tenant resource limit editor and submit role plus namespace/cluster/quota fields together.
+  - [x] Show instance owner as `ownerUsername` when present, otherwise a shortened `ownerId`.
+  - [x] Extend monitoring frontend types/adapters as needed for GPU allocation, GPU memory, and per-user resource rows returned by the backend.
+  - [x] Update Cluster Monitoring cards/page to render GPU allocation/GPU Mem and per-user resource tables while respecting backend-scoped data for normal users.
+  - [x] Check responsive behavior for the touched UI and avoid obvious desktop/mobile overflow.
+  - [x] Run targeted frontend type/build tests available in the repo and review diff.
+  - [x] Add Worker D Review summary with changed files and verification results.
+  - Review: changed `frontend/src/features/configuration/users/pages/UserManagementPage.tsx`, `frontend/src/features/artifact/instances/components/InstanceCard.tsx`, `frontend/src/features/monitoring/clusters/components/ClusterMonitorCard.tsx`, `frontend/src/features/monitoring/clusters/pages/MonitoringClustersPage.tsx`, `frontend/src/core/types/index.ts`, and `frontend/src/api/index.ts`. User Management now uses wider operation rows with quota chips and admin-to-user downgrade saves role plus tenant limits. Instance cards show owner username or short owner ID. Cluster monitoring renders GPU allocation, GPU memory, and backend-returned per-user resource rows. Validation: `npm run build` passed; targeted `npx eslint ...` on changed frontend source files passed; full `npm run lint` remains blocked by pre-existing generated/cache and legacy lint errors; Playwright viewport check passed for `/configuration/users` and `/monitoring/clusters` at 390x844 and 1440x1000 with mocked API data and no horizontal overflow detected.
+- [x] Worker E: add API/Playwright/k3s regression tests for this plan.
+- [x] Worker F: read-only review for quota bypass, namespace deletion safety, and monitoring privacy.
+- [x] Run `go test ./...` and `npm run build`.
+- [x] Run Docker Compose smoke plus API/Playwright regression scripts.
+- [x] Run real k3s negative vllm quota deployment test and clean up test users.
+- [ ] Run positive GPU=1 k3s vllm deployment when cluster resources are available.
+- [x] Add Review summary and lessons.
+  - Review: ordinary users are now forced into username-derived private workspace/namespace so per-cluster bindings behave as per-user/per-cluster quota buckets. Empty user CPU/memory/GPU/GPU Mem default to explicit `0`; create/update/scale use rendered Helm resource estimates, live ResourceQuota usage minus current release delta, and synced tenant ResourceQuota before persistence/Helm mutation. User deletion blocks on owned/workspace instances, then cleans tenant bindings/namespaces and deletes the exclusive workspace record. Monitoring now returns per-user resource rows and strips cluster-wide node/total metrics for ordinary users. Frontend renders `resourceUsageByUser`, instance owners, less cramped user quotas, and wraps InstanceCard actions to keep Delete inside the viewport. Validation passed: backend `go test ./...`, frontend `npm run build`, Docker Compose health checks, `test/unresolved_bugs_security_gateway_contract.py`, `test/unresolved_bugs_api_contract.py`, `test/user_namespace_quota_api_contract.py`, `test/frontend-playwright-smoke.py`, and `test/instance_card_action_layout_playwright.py`. Positive GPU=1 vllm deployment was not run because the cluster resource constraint remains external; negative GPU=0 vllm quota rejection on k3s passed.
+
+## Unresolved bugs implementation 2026-05-14
+
+- [x] Worker A: fix instance API contract: detail replicas, list values, values/valuesYaml conflict, namespace 403.
+  - [x] Inspect current instance handler/service tests and avoid touching other workers' areas.
+  - [x] Add request validation so `values` plus `valuesYaml`/`values_yaml` conflicts return HTTP 400, while YAML-only still populates values.
+  - [x] Enrich `GetInstance` replica count using the same live K8s source as list.
+  - [x] Include `values` in list responses for API compatibility.
+  - [x] Change normal-user tenant namespace mismatch from silent override to `ErrForbidden`/HTTP 403.
+  - [x] Add focused Go tests for namespace mismatch and replica enrichment/list values where practical.
+  - [x] Run targeted Go tests and review diff.
+  - Review: changed only scoped instance backend files plus this task tracker; validated with `go test ./internal/domain/service` and `go test ./internal/adapter/input/http/rest` from `backend/`.
+- [x] Worker B: add Helm-rendered quota pre-check helper before DB create/Helm install.
+  - [x] Inspect Helm client/service quota contracts and preserve other workers' edits.
+  - [x] Add domain quota precheck types and compare logic for CPU, memory, GPU, and integer-MB gpumem.
+  - [x] Add Helm render estimator output port and real/mock Helm implementations that render final chart values and sum Pod template requests/limits.
+  - [x] Add focused Go tests for quota comparison and rendered manifest estimation where feasible.
+  - [x] Run targeted Go tests and review diff.
+  - Review: exposed `QuotaPrecheckService.EstimateAndCompare` plus `CompareWorkspaceQuota`; real Helm now dry-renders `/tmp/charts/{chart}-{version}.tgz` with final values and estimates Pod template requests/limits. Added quota and manifest estimator tests. Validation passed with `go test ./internal/domain/service ./internal/adapter/output/helm/...` and full backend `go test ./...`.
+- [ ] Worker C: add compatibility/security backend endpoints and auth/CORS/rate-limit fixes.
+  - [x] Inspect backend route/handler/service contracts and preserve other workers' edits.
+  - [x] Add `/repositories/{repo}/tags` compatibility alias without changing existing artifact behavior.
+  - [x] Add `/monitoring/clusters/{id}/metrics` alias and `/clusters/{id}/stats` compatibility response.
+  - [x] Add `/clusters/{id}/kubeconfig` tenant kubeconfig endpoint scoped to the authenticated user's workspace and requested cluster.
+  - [x] Make login failures uniform and add a lightweight per-client login rate limit.
+  - [x] Replace permissive CORS reflection/wildcard defaults with an allowlist-driven default suitable for local dev.
+  - [ ] Add focused Go tests where straightforward, then run relevant Go tests and review diff.
+  - Review: changed scoped backend route/auth/CORS handlers, added CORS and login limiter tests, and removed the direct SSE CORS wildcard so global CORS applies. Validation attempted with `go test ./cmd/api ./internal/adapter/input/http/rest` from `backend/`, but it is currently blocked by concurrent Worker B compile errors in `internal/domain/service/quota_precheck.go` and Helm client implementations missing `EstimateInstanceResources`.
+- [x] Worker D: harden Nginx gateway `/health`, server tokens, and security headers.
+- [ ] Worker E: align frontend/API client and Playwright coverage for conflict/namespace/scale flows.
+- [ ] Worker F: add API/security/regression test scripts and review coverage.
+- [ ] Integrate worker changes, resolve conflicts, and run Go/frontend builds.
+- [ ] Run Docker Compose smoke, API contracts, Playwright, and real k3s deploy cleanup.
+- [ ] Update `tasks/lessons.md` and add Review summary here.
+
 ## docs/ 目录 (已清理)

 | 文件 | 用途 | 状态 |
@ -9,3 +81,49 @@
 | `test-users.json` | 4 个测试账号凭证 | ✅ 永久参考 |
 | `regression-full-report.md` | 最新综合回归报告 | ✅ 可删除（下一个版本） |
 | `UNRESOLVED-BUGS.md` | 未修复问题清单 (15 个) | ✅ 当前版本 |
+
+## Worker C monitoring and instance owner backend 2026-05-14
+
+- [x] Inspect existing instance/monitoring permission, repository, DTO, and K8s metrics contracts without reverting other workers' changes.
+- [x] Add `ownerUsername` to instance entity/DTO responses and hydrate it for detail/list via user repository while preserving ordinary-user/admin visibility rules.
+- [x] Add K8s Pod resource allocation collection from requests/limits, including GPU and `requests.nvidia.com/gpumem` as integer MB.
+- [x] Aggregate `resourceUsageByUser` in monitoring service by matching Pods to visible instances/workspaces/owners, with ordinary users scoped to themselves and admins seeing all visible owners.
+- [x] Expose cluster-level GPU/GPU memory allocation fields and per-user resource usage in `/monitoring/clusters`, detail, and existing aliases.
+- [x] Add focused Go tests for instance owner username and monitoring resource aggregation/privacy.
+- [x] Run relevant Go tests, review diff, and add Review summary here.
+  - Review: Instance list/detail now include `ownerUsername` hydrated from the user repository. Monitoring responses now include per-user resource usage plus CPU/memory/GPU/GPU-memory request/limit allocation fields derived from Kubernetes Pod resources and DB instance ownership mapping; ordinary users only see their own allocation rows/totals, admins see all visible instance owners. Validation passed with `go test ./internal/domain/service`, `go test ./cmd/api ./internal/adapter/input/http/rest ./internal/adapter/output/k8s`, and backend `go test ./...`.
+
+## Debug quota limits monitoring UI 2026-05-15
+
+- [x] Inspect current runtime logs for workspace conflict/quota errors without killing other services.
+- [x] Fix quota semantics: CPU/memory blank means unlimited; GPU/GPU Mem blank means explicit zero for ordinary users.
+- [x] Fix admin user update so editing an existing user's quota does not recreate/reassign namespace and does not raise false `workspace namespace conflict`.
+- [x] Rework User Management action controls so they wrap inside the viewport on desktop and mobile.
+- [x] Improve monitoring for ordinary users with self-scoped useful fields instead of all `N/A`; make admin monitoring show the new resource allocation rows clearly.
+- [x] Rebuild Docker Compose stack and run backend/frontend tests plus Playwright overflow smoke.
+- [x] Use ivanwu on k3s with vllm-serve 0.6.0, CPU/memory unlimited and gpumem `10000`, then verify/clean up.
+- [x] Add Review summary and lessons.
+
+Review:
+- Runtime logs were checked before and after changes. The only 502s observed were during intentional backend rebuild; final backend/nginx logs had no error/fatal/5xx entries.
+- Admin can now update ivanwu without `workspace namespace conflict`; ivanwu was migrated to workspace `ivanwu`, namespace `ocdp-u-ivanwu`, default cluster k3s, CPU/memory unlimited, GPU `1`, GPU Mem `10000`.
+- k3s ResourceQuota for ivanwu contains only GPU and GPU Mem hard limits; CPU/memory are omitted as unlimited. A vllm-serve `0.6.0` deployment used `harbor.bwgdi.com/library/vllm-openai:v0.17.1`, reached `deployed`, Pod `1/1 Running`, then was deleted through the platform and quota usage returned to `0/1` GPU and `0/10k` gpumem.
+- Monitoring now shows ordinary users self-scoped allocation rows and admin per-user rows. The vLLM deployment was visible as CPU `1.00 cores`, memory `9.8 GiB`, GPU `1`, GPU Mem `10000`.
+- Verification passed: `go test ./...`, `npm run build`, `test/frontend-playwright-smoke.py`, `test/instance_card_action_layout_playwright.py`, `test/user_management_layout_playwright.py`, `test/user_namespace_quota_api_contract.py`, and `test/unresolved_bugs_api_contract.py`.
+
+## Restart docs and user management overflow 2026-05-18
+
+- [x] Inspect current Docker Compose service lifecycle and identify why frontend/backend feel disconnected.
+- [x] Update Makefile/README so one clear command starts the whole platform, with explicit rebuild/restart/status/log commands.
+- [x] Restart the full stack through the documented command and verify health endpoints.
+- [x] Reproduce User Management overflow with Playwright at desktop/tablet/mobile widths.
+- [x] Fix User Management layout so action buttons and quota controls stay inside the viewport.
+- [x] Run backend/frontend builds plus Playwright layout smoke.
+- [x] Record Review summary and lessons.
+
+Review:
+- `make up` is now the single documented platform start command. It runs `docker compose up --build -d` for the whole stack, and old commands (`run-2`, `docker-dev`, `docker-prod`, `docker-up`) are compatibility aliases.
+- `docker-compose.yml` now keeps `nginx` under `restart: unless-stopped`; `make docker-ps` and `make up` show `docker compose ps -a`, so the expected `frontend-build Exited (0)` state is visible and less confusing.
+- README now explains that frontend-build is a one-shot build job and the actual frontend runtime is `nginx`, which also proxies `/api`.
+- User Management layout was changed from a fixed four-column row to a responsive card layout with a wrapping action area. The app shell content column also has `min-w-0` so wide children cannot force browser overflow.
+- Verification passed: `go test ./...`, `npm run build`, `make up`, health checks for backend/nginx/web, `test/user_management_layout_playwright.py` across 1440/1280/1024/900/768 widths, `test/frontend-playwright-smoke.py`, and `test/instance_card_action_layout_playwright.py`.