Files
ocdp-go/tasks/todo.md
Ivan087 33ddaf97db fix: scale replicas in response, K8s metrics client, quota precheck, auth tests
- Add GetMetrics method to MetricsClient interface and implement cluster metrics API
- Add QuotaPrecheck service for validating resource quotas before deployment
- Add auth DTO with role/permission models and auth handler tests
- Add instance diagnostics: mounted NFS volumes, labels, annotations in pod diagnostics
- Update workspace handler with GetWorkspace endpoint and shared-user list
- Fix monitoring handler to use correct service method name
- Add tail_lines fallback in instance handler for snake_case query params
- Update nginx config for SSE log streaming support (no buffering)
- Add comprehensive test coverage: auth_service_test, auth_handler_test,
  auth_dto_test, metrics_client_test, quota_precheck_test
- Update error messages for quota validation and instance operations
- ModifyModal: fix YAML lineWidth:0, modified keys summary, delta-only submit
- InstanceCard: correctly disable scale-minus when replicas <= 0
- SidebarLayout: add hover transition for sidebar items
- Update todo.md and lessons.md with latest fixes
2026-05-20 16:56:29 +08:00

16 KiB

OCDP 最终文档结构

Quota lifecycle monitoring implementation 2026-05-14

  • Main: integrate per-user per-cluster quota semantics and final verification.
    • Treat ordinary-user empty CPU/memory/GPU/GPU memory quotas as explicit zero.
    • Make create/update/scale quota checks use the selected cluster binding and sync ResourceQuota first.
    • Reject GPU=0 user vllm deployment on k3s before DB instance/release creation.
  • Worker A: implement backend quota evaluator/resource quota sync without touching frontend.
  • Worker B: implement user lifecycle cleanup, snake_case DTO normalization, and safe admin/user role transitions.
    • Add auth DTO alternate snake_case fields plus Normalize() for register/update requests, and call it in auth handler before service mapping.
    • Make admin-to-user role transitions create or safely reuse the username-derived workspace; detect namespace ownership conflicts and return an explicit domain conflict error.
    • Extend workspace binding repository to list/delete all bindings for a workspace so user deletion can clean every cluster binding.
    • Extend tenant kube client with idempotent tenant cleanup for namespace/service account/role binding/resource quota, refusing system namespaces such as default and kube-system.
    • Extend AuthService dependencies for instance/cluster/binding/tenant cleanup, preserving existing callers and avoiding frontend changes.
    • Update DeleteUser to reject deletion when the user owns instances; when safe, clean exclusive user workspace cluster bindings and OCDP tenant resources before deleting the user.
    • Add focused Go tests for DTO normalization, role downgrade workspace reuse/conflict, delete-with-instances conflict, cleanup path, and protected namespace cleanup.
    • Run targeted Go tests, review diff, and add Worker B Review summary here.
    • Review: changed auth DTO normalization, auth handler normalization calls, auth service workspace reuse/delete cleanup logic, workspace binding repository ports/adapters, tenant kube cleanup, domain errors, mock/test coverage, and API wiring. Namespace conflicts now return ErrWorkspaceNamespaceConflict/HTTP 409; deleting users with owned/workspace instances returns ErrUserHasInstances/HTTP 409; protected tenant namespaces return forbidden-style ErrProtectedNamespace. Validation passed with targeted Worker B tests and full backend go test ./....
  • Worker C: implement monitoring resource aggregation and instance owner username fields.
  • Worker D: implement frontend user management, instance card, and monitoring UI changes.
    • Inspect current API/generated/UI type contracts for owner and monitoring resource fields without changing backend.
    • Rework User Management accounts area into a wider operations layout with quota chips/split columns and actions that do not squeeze quota content.
    • Change admin-to-user downgrade flow to open/reuse the tenant resource limit editor and submit role plus namespace/cluster/quota fields together.
    • Show instance owner as ownerUsername when present, otherwise a shortened ownerId.
    • Extend monitoring frontend types/adapters as needed for GPU allocation, GPU memory, and per-user resource rows returned by the backend.
    • Update Cluster Monitoring cards/page to render GPU allocation/GPU Mem and per-user resource tables while respecting backend-scoped data for normal users.
    • Check responsive behavior for the touched UI and avoid obvious desktop/mobile overflow.
    • Run targeted frontend type/build tests available in the repo and review diff.
    • Add Worker D Review summary with changed files and verification results.
    • Review: changed frontend/src/features/configuration/users/pages/UserManagementPage.tsx, frontend/src/features/artifact/instances/components/InstanceCard.tsx, frontend/src/features/monitoring/clusters/components/ClusterMonitorCard.tsx, frontend/src/features/monitoring/clusters/pages/MonitoringClustersPage.tsx, frontend/src/core/types/index.ts, and frontend/src/api/index.ts. User Management now uses wider operation rows with quota chips and admin-to-user downgrade saves role plus tenant limits. Instance cards show owner username or short owner ID. Cluster monitoring renders GPU allocation, GPU memory, and backend-returned per-user resource rows. Validation: npm run build passed; targeted npx eslint ... on changed frontend source files passed; full npm run lint remains blocked by pre-existing generated/cache and legacy lint errors; Playwright viewport check passed for /configuration/users and /monitoring/clusters at 390x844 and 1440x1000 with mocked API data and no horizontal overflow detected.
  • Worker E: add API/Playwright/k3s regression tests for this plan.
  • Worker F: read-only review for quota bypass, namespace deletion safety, and monitoring privacy.
  • Run go test ./... and npm run build.
  • Run Docker Compose smoke plus API/Playwright regression scripts.
  • Run real k3s negative vllm quota deployment test and clean up test users.
  • Run positive GPU=1 k3s vllm deployment when cluster resources are available.
  • Add Review summary and lessons.
    • Review: ordinary users are now forced into username-derived private workspace/namespace so per-cluster bindings behave as per-user/per-cluster quota buckets. Empty user CPU/memory/GPU/GPU Mem default to explicit 0; create/update/scale use rendered Helm resource estimates, live ResourceQuota usage minus current release delta, and synced tenant ResourceQuota before persistence/Helm mutation. User deletion blocks on owned/workspace instances, then cleans tenant bindings/namespaces and deletes the exclusive workspace record. Monitoring now returns per-user resource rows and strips cluster-wide node/total metrics for ordinary users. Frontend renders resourceUsageByUser, instance owners, less cramped user quotas, and wraps InstanceCard actions to keep Delete inside the viewport. Validation passed: backend go test ./..., frontend npm run build, Docker Compose health checks, test/unresolved_bugs_security_gateway_contract.py, test/unresolved_bugs_api_contract.py, test/user_namespace_quota_api_contract.py, test/frontend-playwright-smoke.py, and test/instance_card_action_layout_playwright.py. Positive GPU=1 vllm deployment was not run because the cluster resource constraint remains external; negative GPU=0 vllm quota rejection on k3s passed.

Unresolved bugs implementation 2026-05-14

  • Worker A: fix instance API contract: detail replicas, list values, values/valuesYaml conflict, namespace 403.
    • Inspect current instance handler/service tests and avoid touching other workers' areas.
    • Add request validation so values plus valuesYaml/values_yaml conflicts return HTTP 400, while YAML-only still populates values.
    • Enrich GetInstance replica count using the same live K8s source as list.
    • Include values in list responses for API compatibility.
    • Change normal-user tenant namespace mismatch from silent override to ErrForbidden/HTTP 403.
    • Add focused Go tests for namespace mismatch and replica enrichment/list values where practical.
    • Run targeted Go tests and review diff.
    • Review: changed only scoped instance backend files plus this task tracker; validated with go test ./internal/domain/service and go test ./internal/adapter/input/http/rest from backend/.
  • Worker B: add Helm-rendered quota pre-check helper before DB create/Helm install.
    • Inspect Helm client/service quota contracts and preserve other workers' edits.
    • Add domain quota precheck types and compare logic for CPU, memory, GPU, and integer-MB gpumem.
    • Add Helm render estimator output port and real/mock Helm implementations that render final chart values and sum Pod template requests/limits.
    • Add focused Go tests for quota comparison and rendered manifest estimation where feasible.
    • Run targeted Go tests and review diff.
    • Review: exposed QuotaPrecheckService.EstimateAndCompare plus CompareWorkspaceQuota; real Helm now dry-renders /tmp/charts/{chart}-{version}.tgz with final values and estimates Pod template requests/limits. Added quota and manifest estimator tests. Validation passed with go test ./internal/domain/service ./internal/adapter/output/helm/... and full backend go test ./....
  • Worker C: add compatibility/security backend endpoints and auth/CORS/rate-limit fixes.
    • Inspect backend route/handler/service contracts and preserve other workers' edits.
    • Add /repositories/{repo}/tags compatibility alias without changing existing artifact behavior.
    • Add /monitoring/clusters/{id}/metrics alias and /clusters/{id}/stats compatibility response.
    • Add /clusters/{id}/kubeconfig tenant kubeconfig endpoint scoped to the authenticated user's workspace and requested cluster.
    • Make login failures uniform and add a lightweight per-client login rate limit.
    • Replace permissive CORS reflection/wildcard defaults with an allowlist-driven default suitable for local dev.
    • Add focused Go tests where straightforward, then run relevant Go tests and review diff.
    • Review: changed scoped backend route/auth/CORS handlers, added CORS and login limiter tests, and removed the direct SSE CORS wildcard so global CORS applies. Validation attempted with go test ./cmd/api ./internal/adapter/input/http/rest from backend/, but it is currently blocked by concurrent Worker B compile errors in internal/domain/service/quota_precheck.go and Helm client implementations missing EstimateInstanceResources.
  • Worker D: harden Nginx gateway /health, server tokens, and security headers.
  • Worker E: align frontend/API client and Playwright coverage for conflict/namespace/scale flows.
  • Worker F: add API/security/regression test scripts and review coverage.
  • Integrate worker changes, resolve conflicts, and run Go/frontend builds.
  • Run Docker Compose smoke, API contracts, Playwright, and real k3s deploy cleanup.
  • Update tasks/lessons.md and add Review summary here.

docs/ 目录 (已清理)

文件 用途 状态
user-guide.md 用户操作指南 永久参考
test-scenarios.md 100+ 测试用例设计 永久参考
test-users.json 4 个测试账号凭证 永久参考
regression-full-report.md 最新综合回归报告 可删除(下一个版本)
UNRESOLVED-BUGS.md 未修复问题清单 (15 个) 当前版本

Worker C monitoring and instance owner backend 2026-05-14

  • Inspect existing instance/monitoring permission, repository, DTO, and K8s metrics contracts without reverting other workers' changes.
  • Add ownerUsername to instance entity/DTO responses and hydrate it for detail/list via user repository while preserving ordinary-user/admin visibility rules.
  • Add K8s Pod resource allocation collection from requests/limits, including GPU and requests.nvidia.com/gpumem as integer MB.
  • Aggregate resourceUsageByUser in monitoring service by matching Pods to visible instances/workspaces/owners, with ordinary users scoped to themselves and admins seeing all visible owners.
  • Expose cluster-level GPU/GPU memory allocation fields and per-user resource usage in /monitoring/clusters, detail, and existing aliases.
  • Add focused Go tests for instance owner username and monitoring resource aggregation/privacy.
  • Run relevant Go tests, review diff, and add Review summary here.
    • Review: Instance list/detail now include ownerUsername hydrated from the user repository. Monitoring responses now include per-user resource usage plus CPU/memory/GPU/GPU-memory request/limit allocation fields derived from Kubernetes Pod resources and DB instance ownership mapping; ordinary users only see their own allocation rows/totals, admins see all visible instance owners. Validation passed with go test ./internal/domain/service, go test ./cmd/api ./internal/adapter/input/http/rest ./internal/adapter/output/k8s, and backend go test ./....

Debug quota limits monitoring UI 2026-05-15

  • Inspect current runtime logs for workspace conflict/quota errors without killing other services.
  • Fix quota semantics: CPU/memory blank means unlimited; GPU/GPU Mem blank means explicit zero for ordinary users.
  • Fix admin user update so editing an existing user's quota does not recreate/reassign namespace and does not raise false workspace namespace conflict.
  • Rework User Management action controls so they wrap inside the viewport on desktop and mobile.
  • Improve monitoring for ordinary users with self-scoped useful fields instead of all N/A; make admin monitoring show the new resource allocation rows clearly.
  • Rebuild Docker Compose stack and run backend/frontend tests plus Playwright overflow smoke.
  • Use ivanwu on k3s with vllm-serve 0.6.0, CPU/memory unlimited and gpumem 10000, then verify/clean up.
  • Add Review summary and lessons.

Review:

  • Runtime logs were checked before and after changes. The only 502s observed were during intentional backend rebuild; final backend/nginx logs had no error/fatal/5xx entries.
  • Admin can now update ivanwu without workspace namespace conflict; ivanwu was migrated to workspace ivanwu, namespace ocdp-u-ivanwu, default cluster k3s, CPU/memory unlimited, GPU 1, GPU Mem 10000.
  • k3s ResourceQuota for ivanwu contains only GPU and GPU Mem hard limits; CPU/memory are omitted as unlimited. A vllm-serve 0.6.0 deployment used harbor.bwgdi.com/library/vllm-openai:v0.17.1, reached deployed, Pod 1/1 Running, then was deleted through the platform and quota usage returned to 0/1 GPU and 0/10k gpumem.
  • Monitoring now shows ordinary users self-scoped allocation rows and admin per-user rows. The vLLM deployment was visible as CPU 1.00 cores, memory 9.8 GiB, GPU 1, GPU Mem 10000.
  • Verification passed: go test ./..., npm run build, test/frontend-playwright-smoke.py, test/instance_card_action_layout_playwright.py, test/user_management_layout_playwright.py, test/user_namespace_quota_api_contract.py, and test/unresolved_bugs_api_contract.py.

Restart docs and user management overflow 2026-05-18

  • Inspect current Docker Compose service lifecycle and identify why frontend/backend feel disconnected.
  • Update Makefile/README so one clear command starts the whole platform, with explicit rebuild/restart/status/log commands.
  • Restart the full stack through the documented command and verify health endpoints.
  • Reproduce User Management overflow with Playwright at desktop/tablet/mobile widths.
  • Fix User Management layout so action buttons and quota controls stay inside the viewport.
  • Run backend/frontend builds plus Playwright layout smoke.
  • Record Review summary and lessons.

Review:

  • make up is now the single documented platform start command. It runs docker compose up --build -d for the whole stack, and old commands (run-2, docker-dev, docker-prod, docker-up) are compatibility aliases.
  • docker-compose.yml now keeps nginx under restart: unless-stopped; make docker-ps and make up show docker compose ps -a, so the expected frontend-build Exited (0) state is visible and less confusing.
  • README now explains that frontend-build is a one-shot build job and the actual frontend runtime is nginx, which also proxies /api.
  • User Management layout was changed from a fixed four-column row to a responsive card layout with a wrapping action area. The app shell content column also has min-w-0 so wide children cannot force browser overflow.
  • Verification passed: go test ./..., npm run build, make up, health checks for backend/nginx/web, test/user_management_layout_playwright.py across 1440/1280/1024/900/768 widths, test/frontend-playwright-smoke.py, and test/instance_card_action_layout_playwright.py.