- Add GetMetrics method to MetricsClient interface and implement cluster metrics API - Add QuotaPrecheck service for validating resource quotas before deployment - Add auth DTO with role/permission models and auth handler tests - Add instance diagnostics: mounted NFS volumes, labels, annotations in pod diagnostics - Update workspace handler with GetWorkspace endpoint and shared-user list - Fix monitoring handler to use correct service method name - Add tail_lines fallback in instance handler for snake_case query params - Update nginx config for SSE log streaming support (no buffering) - Add comprehensive test coverage: auth_service_test, auth_handler_test, auth_dto_test, metrics_client_test, quota_precheck_test - Update error messages for quota validation and instance operations - ModifyModal: fix YAML lineWidth:0, modified keys summary, delta-only submit - InstanceCard: correctly disable scale-minus when replicas <= 0 - SidebarLayout: add hover transition for sidebar items - Update todo.md and lessons.md with latest fixes
16 KiB
16 KiB
OCDP 最终文档结构
Quota lifecycle monitoring implementation 2026-05-14
- Main: integrate per-user per-cluster quota semantics and final verification.
- Treat ordinary-user empty CPU/memory/GPU/GPU memory quotas as explicit zero.
- Make create/update/scale quota checks use the selected cluster binding and sync ResourceQuota first.
- Reject GPU=0 user vllm deployment on k3s before DB instance/release creation.
- Worker A: implement backend quota evaluator/resource quota sync without touching frontend.
- Worker B: implement user lifecycle cleanup, snake_case DTO normalization, and safe admin/user role transitions.
- Add auth DTO alternate snake_case fields plus
Normalize()for register/update requests, and call it in auth handler before service mapping. - Make admin-to-user role transitions create or safely reuse the username-derived workspace; detect namespace ownership conflicts and return an explicit domain conflict error.
- Extend workspace binding repository to list/delete all bindings for a workspace so user deletion can clean every cluster binding.
- Extend tenant kube client with idempotent tenant cleanup for namespace/service account/role binding/resource quota, refusing system namespaces such as
defaultandkube-system. - Extend
AuthServicedependencies for instance/cluster/binding/tenant cleanup, preserving existing callers and avoiding frontend changes. - Update
DeleteUserto reject deletion when the user owns instances; when safe, clean exclusive user workspace cluster bindings and OCDP tenant resources before deleting the user. - Add focused Go tests for DTO normalization, role downgrade workspace reuse/conflict, delete-with-instances conflict, cleanup path, and protected namespace cleanup.
- Run targeted Go tests, review diff, and add Worker B Review summary here.
- Review: changed auth DTO normalization, auth handler normalization calls, auth service workspace reuse/delete cleanup logic, workspace binding repository ports/adapters, tenant kube cleanup, domain errors, mock/test coverage, and API wiring. Namespace conflicts now return
ErrWorkspaceNamespaceConflict/HTTP 409; deleting users with owned/workspace instances returnsErrUserHasInstances/HTTP 409; protected tenant namespaces return forbidden-styleErrProtectedNamespace. Validation passed with targeted Worker B tests and full backendgo test ./....
- Add auth DTO alternate snake_case fields plus
- Worker C: implement monitoring resource aggregation and instance owner username fields.
- Worker D: implement frontend user management, instance card, and monitoring UI changes.
- Inspect current API/generated/UI type contracts for owner and monitoring resource fields without changing backend.
- Rework User Management accounts area into a wider operations layout with quota chips/split columns and actions that do not squeeze quota content.
- Change admin-to-user downgrade flow to open/reuse the tenant resource limit editor and submit role plus namespace/cluster/quota fields together.
- Show instance owner as
ownerUsernamewhen present, otherwise a shortenedownerId. - Extend monitoring frontend types/adapters as needed for GPU allocation, GPU memory, and per-user resource rows returned by the backend.
- Update Cluster Monitoring cards/page to render GPU allocation/GPU Mem and per-user resource tables while respecting backend-scoped data for normal users.
- Check responsive behavior for the touched UI and avoid obvious desktop/mobile overflow.
- Run targeted frontend type/build tests available in the repo and review diff.
- Add Worker D Review summary with changed files and verification results.
- Review: changed
frontend/src/features/configuration/users/pages/UserManagementPage.tsx,frontend/src/features/artifact/instances/components/InstanceCard.tsx,frontend/src/features/monitoring/clusters/components/ClusterMonitorCard.tsx,frontend/src/features/monitoring/clusters/pages/MonitoringClustersPage.tsx,frontend/src/core/types/index.ts, andfrontend/src/api/index.ts. User Management now uses wider operation rows with quota chips and admin-to-user downgrade saves role plus tenant limits. Instance cards show owner username or short owner ID. Cluster monitoring renders GPU allocation, GPU memory, and backend-returned per-user resource rows. Validation:npm run buildpassed; targetednpx eslint ...on changed frontend source files passed; fullnpm run lintremains blocked by pre-existing generated/cache and legacy lint errors; Playwright viewport check passed for/configuration/usersand/monitoring/clustersat 390x844 and 1440x1000 with mocked API data and no horizontal overflow detected.
- Worker E: add API/Playwright/k3s regression tests for this plan.
- Worker F: read-only review for quota bypass, namespace deletion safety, and monitoring privacy.
- Run
go test ./...andnpm run build. - Run Docker Compose smoke plus API/Playwright regression scripts.
- Run real k3s negative vllm quota deployment test and clean up test users.
- Run positive GPU=1 k3s vllm deployment when cluster resources are available.
- Add Review summary and lessons.
- Review: ordinary users are now forced into username-derived private workspace/namespace so per-cluster bindings behave as per-user/per-cluster quota buckets. Empty user CPU/memory/GPU/GPU Mem default to explicit
0; create/update/scale use rendered Helm resource estimates, live ResourceQuota usage minus current release delta, and synced tenant ResourceQuota before persistence/Helm mutation. User deletion blocks on owned/workspace instances, then cleans tenant bindings/namespaces and deletes the exclusive workspace record. Monitoring now returns per-user resource rows and strips cluster-wide node/total metrics for ordinary users. Frontend rendersresourceUsageByUser, instance owners, less cramped user quotas, and wraps InstanceCard actions to keep Delete inside the viewport. Validation passed: backendgo test ./..., frontendnpm run build, Docker Compose health checks,test/unresolved_bugs_security_gateway_contract.py,test/unresolved_bugs_api_contract.py,test/user_namespace_quota_api_contract.py,test/frontend-playwright-smoke.py, andtest/instance_card_action_layout_playwright.py. Positive GPU=1 vllm deployment was not run because the cluster resource constraint remains external; negative GPU=0 vllm quota rejection on k3s passed.
- Review: ordinary users are now forced into username-derived private workspace/namespace so per-cluster bindings behave as per-user/per-cluster quota buckets. Empty user CPU/memory/GPU/GPU Mem default to explicit
Unresolved bugs implementation 2026-05-14
- Worker A: fix instance API contract: detail replicas, list values, values/valuesYaml conflict, namespace 403.
- Inspect current instance handler/service tests and avoid touching other workers' areas.
- Add request validation so
valuesplusvaluesYaml/values_yamlconflicts return HTTP 400, while YAML-only still populates values. - Enrich
GetInstancereplica count using the same live K8s source as list. - Include
valuesin list responses for API compatibility. - Change normal-user tenant namespace mismatch from silent override to
ErrForbidden/HTTP 403. - Add focused Go tests for namespace mismatch and replica enrichment/list values where practical.
- Run targeted Go tests and review diff.
- Review: changed only scoped instance backend files plus this task tracker; validated with
go test ./internal/domain/serviceandgo test ./internal/adapter/input/http/restfrombackend/.
- Worker B: add Helm-rendered quota pre-check helper before DB create/Helm install.
- Inspect Helm client/service quota contracts and preserve other workers' edits.
- Add domain quota precheck types and compare logic for CPU, memory, GPU, and integer-MB gpumem.
- Add Helm render estimator output port and real/mock Helm implementations that render final chart values and sum Pod template requests/limits.
- Add focused Go tests for quota comparison and rendered manifest estimation where feasible.
- Run targeted Go tests and review diff.
- Review: exposed
QuotaPrecheckService.EstimateAndCompareplusCompareWorkspaceQuota; real Helm now dry-renders/tmp/charts/{chart}-{version}.tgzwith final values and estimates Pod template requests/limits. Added quota and manifest estimator tests. Validation passed withgo test ./internal/domain/service ./internal/adapter/output/helm/...and full backendgo test ./....
- Worker C: add compatibility/security backend endpoints and auth/CORS/rate-limit fixes.
- Inspect backend route/handler/service contracts and preserve other workers' edits.
- Add
/repositories/{repo}/tagscompatibility alias without changing existing artifact behavior. - Add
/monitoring/clusters/{id}/metricsalias and/clusters/{id}/statscompatibility response. - Add
/clusters/{id}/kubeconfigtenant kubeconfig endpoint scoped to the authenticated user's workspace and requested cluster. - Make login failures uniform and add a lightweight per-client login rate limit.
- Replace permissive CORS reflection/wildcard defaults with an allowlist-driven default suitable for local dev.
- Add focused Go tests where straightforward, then run relevant Go tests and review diff.
- Review: changed scoped backend route/auth/CORS handlers, added CORS and login limiter tests, and removed the direct SSE CORS wildcard so global CORS applies. Validation attempted with
go test ./cmd/api ./internal/adapter/input/http/restfrombackend/, but it is currently blocked by concurrent Worker B compile errors ininternal/domain/service/quota_precheck.goand Helm client implementations missingEstimateInstanceResources.
- Worker D: harden Nginx gateway
/health, server tokens, and security headers. - Worker E: align frontend/API client and Playwright coverage for conflict/namespace/scale flows.
- Worker F: add API/security/regression test scripts and review coverage.
- Integrate worker changes, resolve conflicts, and run Go/frontend builds.
- Run Docker Compose smoke, API contracts, Playwright, and real k3s deploy cleanup.
- Update
tasks/lessons.mdand add Review summary here.
docs/ 目录 (已清理)
| 文件 | 用途 | 状态 |
|---|---|---|
user-guide.md |
用户操作指南 | ✅ 永久参考 |
test-scenarios.md |
100+ 测试用例设计 | ✅ 永久参考 |
test-users.json |
4 个测试账号凭证 | ✅ 永久参考 |
regression-full-report.md |
最新综合回归报告 | ✅ 可删除(下一个版本) |
UNRESOLVED-BUGS.md |
未修复问题清单 (15 个) | ✅ 当前版本 |
Worker C monitoring and instance owner backend 2026-05-14
- Inspect existing instance/monitoring permission, repository, DTO, and K8s metrics contracts without reverting other workers' changes.
- Add
ownerUsernameto instance entity/DTO responses and hydrate it for detail/list via user repository while preserving ordinary-user/admin visibility rules. - Add K8s Pod resource allocation collection from requests/limits, including GPU and
requests.nvidia.com/gpumemas integer MB. - Aggregate
resourceUsageByUserin monitoring service by matching Pods to visible instances/workspaces/owners, with ordinary users scoped to themselves and admins seeing all visible owners. - Expose cluster-level GPU/GPU memory allocation fields and per-user resource usage in
/monitoring/clusters, detail, and existing aliases. - Add focused Go tests for instance owner username and monitoring resource aggregation/privacy.
- Run relevant Go tests, review diff, and add Review summary here.
- Review: Instance list/detail now include
ownerUsernamehydrated from the user repository. Monitoring responses now include per-user resource usage plus CPU/memory/GPU/GPU-memory request/limit allocation fields derived from Kubernetes Pod resources and DB instance ownership mapping; ordinary users only see their own allocation rows/totals, admins see all visible instance owners. Validation passed withgo test ./internal/domain/service,go test ./cmd/api ./internal/adapter/input/http/rest ./internal/adapter/output/k8s, and backendgo test ./....
- Review: Instance list/detail now include
Debug quota limits monitoring UI 2026-05-15
- Inspect current runtime logs for workspace conflict/quota errors without killing other services.
- Fix quota semantics: CPU/memory blank means unlimited; GPU/GPU Mem blank means explicit zero for ordinary users.
- Fix admin user update so editing an existing user's quota does not recreate/reassign namespace and does not raise false
workspace namespace conflict. - Rework User Management action controls so they wrap inside the viewport on desktop and mobile.
- Improve monitoring for ordinary users with self-scoped useful fields instead of all
N/A; make admin monitoring show the new resource allocation rows clearly. - Rebuild Docker Compose stack and run backend/frontend tests plus Playwright overflow smoke.
- Use ivanwu on k3s with vllm-serve 0.6.0, CPU/memory unlimited and gpumem
10000, then verify/clean up. - Add Review summary and lessons.
Review:
- Runtime logs were checked before and after changes. The only 502s observed were during intentional backend rebuild; final backend/nginx logs had no error/fatal/5xx entries.
- Admin can now update ivanwu without
workspace namespace conflict; ivanwu was migrated to workspaceivanwu, namespaceocdp-u-ivanwu, default cluster k3s, CPU/memory unlimited, GPU1, GPU Mem10000. - k3s ResourceQuota for ivanwu contains only GPU and GPU Mem hard limits; CPU/memory are omitted as unlimited. A vllm-serve
0.6.0deployment usedharbor.bwgdi.com/library/vllm-openai:v0.17.1, reacheddeployed, Pod1/1 Running, then was deleted through the platform and quota usage returned to0/1GPU and0/10kgpumem. - Monitoring now shows ordinary users self-scoped allocation rows and admin per-user rows. The vLLM deployment was visible as CPU
1.00 cores, memory9.8 GiB, GPU1, GPU Mem10000. - Verification passed:
go test ./...,npm run build,test/frontend-playwright-smoke.py,test/instance_card_action_layout_playwright.py,test/user_management_layout_playwright.py,test/user_namespace_quota_api_contract.py, andtest/unresolved_bugs_api_contract.py.
Restart docs and user management overflow 2026-05-18
- Inspect current Docker Compose service lifecycle and identify why frontend/backend feel disconnected.
- Update Makefile/README so one clear command starts the whole platform, with explicit rebuild/restart/status/log commands.
- Restart the full stack through the documented command and verify health endpoints.
- Reproduce User Management overflow with Playwright at desktop/tablet/mobile widths.
- Fix User Management layout so action buttons and quota controls stay inside the viewport.
- Run backend/frontend builds plus Playwright layout smoke.
- Record Review summary and lessons.
Review:
make upis now the single documented platform start command. It runsdocker compose up --build -dfor the whole stack, and old commands (run-2,docker-dev,docker-prod,docker-up) are compatibility aliases.docker-compose.ymlnow keepsnginxunderrestart: unless-stopped;make docker-psandmake upshowdocker compose ps -a, so the expectedfrontend-build Exited (0)state is visible and less confusing.- README now explains that frontend-build is a one-shot build job and the actual frontend runtime is
nginx, which also proxies/api. - User Management layout was changed from a fixed four-column row to a responsive card layout with a wrapping action area. The app shell content column also has
min-w-0so wide children cannot force browser overflow. - Verification passed:
go test ./...,npm run build,make up, health checks for backend/nginx/web,test/user_management_layout_playwright.pyacross 1440/1280/1024/900/768 widths,test/frontend-playwright-smoke.py, andtest/instance_card_action_layout_playwright.py.