Files

Ivan087 33ddaf97db fix: scale replicas in response, K8s metrics client, quota precheck, auth tests

- Add GetMetrics method to MetricsClient interface and implement cluster metrics API
- Add QuotaPrecheck service for validating resource quotas before deployment
- Add auth DTO with role/permission models and auth handler tests
- Add instance diagnostics: mounted NFS volumes, labels, annotations in pod diagnostics
- Update workspace handler with GetWorkspace endpoint and shared-user list
- Fix monitoring handler to use correct service method name
- Add tail_lines fallback in instance handler for snake_case query params
- Update nginx config for SSE log streaming support (no buffering)
- Add comprehensive test coverage: auth_service_test, auth_handler_test,
  auth_dto_test, metrics_client_test, quota_precheck_test
- Update error messages for quota validation and instance operations
- ModifyModal: fix YAML lineWidth:0, modified keys summary, delta-only submit
- InstanceCard: correctly disable scale-minus when replicas <= 0
- SidebarLayout: add hover transition for sidebar items
- Update todo.md and lessons.md with latest fixes

2026-05-20 16:56:29 +08:00

16 KiB

Raw Blame History

OCDP 最终文档结构

Quota lifecycle monitoring implementation 2026-05-14

Main: integrate per-user per-cluster quota semantics and final verification.
- Treat ordinary-user empty CPU/memory/GPU/GPU memory quotas as explicit zero.
- Make create/update/scale quota checks use the selected cluster binding and sync ResourceQuota first.
- Reject GPU=0 user vllm deployment on k3s before DB instance/release creation.
Worker A: implement backend quota evaluator/resource quota sync without touching frontend.
Worker B: implement user lifecycle cleanup, snake_case DTO normalization, and safe admin/user role transitions.
- Add auth DTO alternate snake_case fields plus Normalize() for register/update requests, and call it in auth handler before service mapping.
- Make admin-to-user role transitions create or safely reuse the username-derived workspace; detect namespace ownership conflicts and return an explicit domain conflict error.
- Extend workspace binding repository to list/delete all bindings for a workspace so user deletion can clean every cluster binding.
- Extend tenant kube client with idempotent tenant cleanup for namespace/service account/role binding/resource quota, refusing system namespaces such as default and kube-system.
- Extend AuthService dependencies for instance/cluster/binding/tenant cleanup, preserving existing callers and avoiding frontend changes.
- Update DeleteUser to reject deletion when the user owns instances; when safe, clean exclusive user workspace cluster bindings and OCDP tenant resources before deleting the user.
- Add focused Go tests for DTO normalization, role downgrade workspace reuse/conflict, delete-with-instances conflict, cleanup path, and protected namespace cleanup.
- Run targeted Go tests, review diff, and add Worker B Review summary here.
- Review: changed auth DTO normalization, auth handler normalization calls, auth service workspace reuse/delete cleanup logic, workspace binding repository ports/adapters, tenant kube cleanup, domain errors, mock/test coverage, and API wiring. Namespace conflicts now return ErrWorkspaceNamespaceConflict/HTTP 409; deleting users with owned/workspace instances returns ErrUserHasInstances/HTTP 409; protected tenant namespaces return forbidden-style ErrProtectedNamespace. Validation passed with targeted Worker B tests and full backend go test ./....
Worker C: implement monitoring resource aggregation and instance owner username fields.
Worker D: implement frontend user management, instance card, and monitoring UI changes.
- Inspect current API/generated/UI type contracts for owner and monitoring resource fields without changing backend.
- Rework User Management accounts area into a wider operations layout with quota chips/split columns and actions that do not squeeze quota content.
- Change admin-to-user downgrade flow to open/reuse the tenant resource limit editor and submit role plus namespace/cluster/quota fields together.
- Show instance owner as ownerUsername when present, otherwise a shortened ownerId.
- Extend monitoring frontend types/adapters as needed for GPU allocation, GPU memory, and per-user resource rows returned by the backend.
- Update Cluster Monitoring cards/page to render GPU allocation/GPU Mem and per-user resource tables while respecting backend-scoped data for normal users.
- Check responsive behavior for the touched UI and avoid obvious desktop/mobile overflow.
- Run targeted frontend type/build tests available in the repo and review diff.
- Add Worker D Review summary with changed files and verification results.
- Review: changed frontend/src/features/configuration/users/pages/UserManagementPage.tsx, frontend/src/features/artifact/instances/components/InstanceCard.tsx, frontend/src/features/monitoring/clusters/components/ClusterMonitorCard.tsx, frontend/src/features/monitoring/clusters/pages/MonitoringClustersPage.tsx, frontend/src/core/types/index.ts, and frontend/src/api/index.ts. User Management now uses wider operation rows with quota chips and admin-to-user downgrade saves role plus tenant limits. Instance cards show owner username or short owner ID. Cluster monitoring renders GPU allocation, GPU memory, and backend-returned per-user resource rows. Validation: npm run build passed; targeted npx eslint ... on changed frontend source files passed; full npm run lint remains blocked by pre-existing generated/cache and legacy lint errors; Playwright viewport check passed for /configuration/users and /monitoring/clusters at 390x844 and 1440x1000 with mocked API data and no horizontal overflow detected.
Worker E: add API/Playwright/k3s regression tests for this plan.
Worker F: read-only review for quota bypass, namespace deletion safety, and monitoring privacy.
Run go test ./... and npm run build.
Run Docker Compose smoke plus API/Playwright regression scripts.
Run real k3s negative vllm quota deployment test and clean up test users.
Run positive GPU=1 k3s vllm deployment when cluster resources are available.
Add Review summary and lessons.
- Review: ordinary users are now forced into username-derived private workspace/namespace so per-cluster bindings behave as per-user/per-cluster quota buckets. Empty user CPU/memory/GPU/GPU Mem default to explicit 0; create/update/scale use rendered Helm resource estimates, live ResourceQuota usage minus current release delta, and synced tenant ResourceQuota before persistence/Helm mutation. User deletion blocks on owned/workspace instances, then cleans tenant bindings/namespaces and deletes the exclusive workspace record. Monitoring now returns per-user resource rows and strips cluster-wide node/total metrics for ordinary users. Frontend renders resourceUsageByUser, instance owners, less cramped user quotas, and wraps InstanceCard actions to keep Delete inside the viewport. Validation passed: backend go test ./..., frontend npm run build, Docker Compose health checks, test/unresolved_bugs_security_gateway_contract.py, test/unresolved_bugs_api_contract.py, test/user_namespace_quota_api_contract.py, test/frontend-playwright-smoke.py, and test/instance_card_action_layout_playwright.py. Positive GPU=1 vllm deployment was not run because the cluster resource constraint remains external; negative GPU=0 vllm quota rejection on k3s passed.

Unresolved bugs implementation 2026-05-14

Worker A: fix instance API contract: detail replicas, list values, values/valuesYaml conflict, namespace 403.
- Inspect current instance handler/service tests and avoid touching other workers' areas.
- Add request validation so values plus valuesYaml/values_yaml conflicts return HTTP 400, while YAML-only still populates values.
- Enrich GetInstance replica count using the same live K8s source as list.
- Include values in list responses for API compatibility.
- Change normal-user tenant namespace mismatch from silent override to ErrForbidden/HTTP 403.
- Add focused Go tests for namespace mismatch and replica enrichment/list values where practical.
- Run targeted Go tests and review diff.
- Review: changed only scoped instance backend files plus this task tracker; validated with go test ./internal/domain/service and go test ./internal/adapter/input/http/rest from backend/.
Worker B: add Helm-rendered quota pre-check helper before DB create/Helm install.
- Inspect Helm client/service quota contracts and preserve other workers' edits.
- Add domain quota precheck types and compare logic for CPU, memory, GPU, and integer-MB gpumem.
- Add Helm render estimator output port and real/mock Helm implementations that render final chart values and sum Pod template requests/limits.
- Add focused Go tests for quota comparison and rendered manifest estimation where feasible.
- Run targeted Go tests and review diff.
- Review: exposed QuotaPrecheckService.EstimateAndCompare plus CompareWorkspaceQuota; real Helm now dry-renders /tmp/charts/{chart}-{version}.tgz with final values and estimates Pod template requests/limits. Added quota and manifest estimator tests. Validation passed with go test ./internal/domain/service ./internal/adapter/output/helm/... and full backend go test ./....
Worker C: add compatibility/security backend endpoints and auth/CORS/rate-limit fixes.
- Inspect backend route/handler/service contracts and preserve other workers' edits.
- Add /repositories/{repo}/tags compatibility alias without changing existing artifact behavior.
- Add /monitoring/clusters/{id}/metrics alias and /clusters/{id}/stats compatibility response.
- Add /clusters/{id}/kubeconfig tenant kubeconfig endpoint scoped to the authenticated user's workspace and requested cluster.
- Make login failures uniform and add a lightweight per-client login rate limit.
- Replace permissive CORS reflection/wildcard defaults with an allowlist-driven default suitable for local dev.
- Add focused Go tests where straightforward, then run relevant Go tests and review diff.
- Review: changed scoped backend route/auth/CORS handlers, added CORS and login limiter tests, and removed the direct SSE CORS wildcard so global CORS applies. Validation attempted with go test ./cmd/api ./internal/adapter/input/http/rest from backend/, but it is currently blocked by concurrent Worker B compile errors in internal/domain/service/quota_precheck.go and Helm client implementations missing EstimateInstanceResources.
Worker D: harden Nginx gateway /health, server tokens, and security headers.
Worker E: align frontend/API client and Playwright coverage for conflict/namespace/scale flows.
Worker F: add API/security/regression test scripts and review coverage.
Integrate worker changes, resolve conflicts, and run Go/frontend builds.
Run Docker Compose smoke, API contracts, Playwright, and real k3s deploy cleanup.
Update tasks/lessons.md and add Review summary here.

docs/ 目录 (已清理)

文件	用途	状态
`user-guide.md`	用户操作指南	✅ 永久参考
`test-scenarios.md`	100+ 测试用例设计	✅ 永久参考
`test-users.json`	4 个测试账号凭证	✅ 永久参考
`regression-full-report.md`	最新综合回归报告	✅ 可删除（下一个版本）
`UNRESOLVED-BUGS.md`	未修复问题清单 (15 个)	✅ 当前版本

Worker C monitoring and instance owner backend 2026-05-14

Inspect existing instance/monitoring permission, repository, DTO, and K8s metrics contracts without reverting other workers' changes.
Add ownerUsername to instance entity/DTO responses and hydrate it for detail/list via user repository while preserving ordinary-user/admin visibility rules.
Add K8s Pod resource allocation collection from requests/limits, including GPU and requests.nvidia.com/gpumem as integer MB.
Aggregate resourceUsageByUser in monitoring service by matching Pods to visible instances/workspaces/owners, with ordinary users scoped to themselves and admins seeing all visible owners.
Expose cluster-level GPU/GPU memory allocation fields and per-user resource usage in /monitoring/clusters, detail, and existing aliases.
Add focused Go tests for instance owner username and monitoring resource aggregation/privacy.
Run relevant Go tests, review diff, and add Review summary here.
- Review: Instance list/detail now include ownerUsername hydrated from the user repository. Monitoring responses now include per-user resource usage plus CPU/memory/GPU/GPU-memory request/limit allocation fields derived from Kubernetes Pod resources and DB instance ownership mapping; ordinary users only see their own allocation rows/totals, admins see all visible instance owners. Validation passed with go test ./internal/domain/service, go test ./cmd/api ./internal/adapter/input/http/rest ./internal/adapter/output/k8s, and backend go test ./....

Debug quota limits monitoring UI 2026-05-15

Inspect current runtime logs for workspace conflict/quota errors without killing other services.
Fix quota semantics: CPU/memory blank means unlimited; GPU/GPU Mem blank means explicit zero for ordinary users.
Fix admin user update so editing an existing user's quota does not recreate/reassign namespace and does not raise false workspace namespace conflict.
Rework User Management action controls so they wrap inside the viewport on desktop and mobile.
Improve monitoring for ordinary users with self-scoped useful fields instead of all N/A; make admin monitoring show the new resource allocation rows clearly.
Rebuild Docker Compose stack and run backend/frontend tests plus Playwright overflow smoke.
Use ivanwu on k3s with vllm-serve 0.6.0, CPU/memory unlimited and gpumem 10000, then verify/clean up.
Add Review summary and lessons.

Review:

Runtime logs were checked before and after changes. The only 502s observed were during intentional backend rebuild; final backend/nginx logs had no error/fatal/5xx entries.
Admin can now update ivanwu without workspace namespace conflict; ivanwu was migrated to workspace ivanwu, namespace ocdp-u-ivanwu, default cluster k3s, CPU/memory unlimited, GPU 1, GPU Mem 10000.
k3s ResourceQuota for ivanwu contains only GPU and GPU Mem hard limits; CPU/memory are omitted as unlimited. A vllm-serve 0.6.0 deployment used harbor.bwgdi.com/library/vllm-openai:v0.17.1, reached deployed, Pod 1/1 Running, then was deleted through the platform and quota usage returned to 0/1 GPU and 0/10k gpumem.
Monitoring now shows ordinary users self-scoped allocation rows and admin per-user rows. The vLLM deployment was visible as CPU 1.00 cores, memory 9.8 GiB, GPU 1, GPU Mem 10000.
Verification passed: go test ./..., npm run build, test/frontend-playwright-smoke.py, test/instance_card_action_layout_playwright.py, test/user_management_layout_playwright.py, test/user_namespace_quota_api_contract.py, and test/unresolved_bugs_api_contract.py.

Restart docs and user management overflow 2026-05-18

Inspect current Docker Compose service lifecycle and identify why frontend/backend feel disconnected.
Update Makefile/README so one clear command starts the whole platform, with explicit rebuild/restart/status/log commands.
Restart the full stack through the documented command and verify health endpoints.
Reproduce User Management overflow with Playwright at desktop/tablet/mobile widths.
Fix User Management layout so action buttons and quota controls stay inside the viewport.
Run backend/frontend builds plus Playwright layout smoke.
Record Review summary and lessons.

Review:

make up is now the single documented platform start command. It runs docker compose up --build -d for the whole stack, and old commands (run-2, docker-dev, docker-prod, docker-up) are compatibility aliases.
docker-compose.yml now keeps nginx under restart: unless-stopped; make docker-ps and make up show docker compose ps -a, so the expected frontend-build Exited (0) state is visible and less confusing.
README now explains that frontend-build is a one-shot build job and the actual frontend runtime is nginx, which also proxies /api.
User Management layout was changed from a fixed four-column row to a responsive card layout with a wrapping action area. The app shell content column also has min-w-0 so wide children cannot force browser overflow.
Verification passed: go test ./..., npm run build, make up, health checks for backend/nginx/web, test/user_management_layout_playwright.py across 1440/1280/1024/900/768 widths, test/frontend-playwright-smoke.py, and test/instance_card_action_layout_playwright.py.

16 KiB Raw Blame History

OCDP 最终文档结构

Quota lifecycle monitoring implementation 2026-05-14

Unresolved bugs implementation 2026-05-14

docs/ 目录 (已清理)

Worker C monitoring and instance owner backend 2026-05-14

Debug quota limits monitoring UI 2026-05-15

Restart docs and user management overflow 2026-05-18

16 KiB

Raw Blame History