Files
ocdp-go/tasks/lessons.md
Ivan087 33ddaf97db fix: scale replicas in response, K8s metrics client, quota precheck, auth tests
- Add GetMetrics method to MetricsClient interface and implement cluster metrics API
- Add QuotaPrecheck service for validating resource quotas before deployment
- Add auth DTO with role/permission models and auth handler tests
- Add instance diagnostics: mounted NFS volumes, labels, annotations in pod diagnostics
- Update workspace handler with GetWorkspace endpoint and shared-user list
- Fix monitoring handler to use correct service method name
- Add tail_lines fallback in instance handler for snake_case query params
- Update nginx config for SSE log streaming support (no buffering)
- Add comprehensive test coverage: auth_service_test, auth_handler_test,
  auth_dto_test, metrics_client_test, quota_precheck_test
- Update error messages for quota validation and instance operations
- ModifyModal: fix YAML lineWidth:0, modified keys summary, delta-only submit
- InstanceCard: correctly disable scale-minus when replicas <= 0
- SidebarLayout: add hover transition for sidebar items
- Update todo.md and lessons.md with latest fixes
2026-05-20 16:56:29 +08:00

16 lines
3.4 KiB
Markdown

# Lessons
- Do not leave real bootstrap credentials, cluster endpoints, certificates, or passwords in code fallbacks. Bootstrap defaults must be empty/disabled; real data must come only from `.env`, `BOOTSTRAP_CONFIG_JSON`, or explicit config files.
- Keep backend permission names aligned with frontend route guards. Returning legacy domain permissions like `clusters:manage:own` without UI permissions such as `configuration:clusters:manage_own` makes ordinary users appear logged in but blocked by every page.
- Treat `requests.nvidia.com/gpumem` as a vendor integer MB scalar in this project. Do not normalize it through Kubernetes memory units such as `M`, `G`, or `Gi`; use values like `10000`.
- Multi-cluster tenant resources must be scoped by `(workspace_id, cluster_id)`. Do not infer the target cluster from list order; user/workspace defaults, kubeconfig issuance, namespace creation, ResourceQuota, and deploy must all use the same selected cluster.
- For real Helm smoke tests, wait for platform instance deletion to remove the DB record before deleting the Kubernetes namespace manually. Deleting the namespace too early can make the async Helm uninstall mark the instance failed.
- When embedding Helm, setting `actionConfig.Init(..., namespace, ...)` and `Install.Namespace` is not enough. The custom `RESTClientGetter` must also override the raw kubeconfig loader namespace, or manifests without `metadata.namespace` can be created in the kubeconfig context namespace such as `default`.
- **Axios keysToSnake recursively converts ALL object keys including user-provided values map.** This silently renames Helm chart values (gpuMem → gpu_mem) causing chart to ignore user settings. Fix: skip recursion for known data fields (values, valuesYaml) while still converting field names. Backend DTOs must provide dual json tags (camelCase + snake_case) with Normalize() fallback.
- In the current two-role model, ordinary users must be forced into username-derived private workspaces/namespaces. Do not accept arbitrary `workspaceId` for role=`user`, or `(workspace_id, cluster_id)` quotas become shared across users. When editing an existing user, update the existing private workspace in place; only migrate users still attached to the default workspace.
- CPU and memory quotas are allowed to be blank, which means no platform ResourceQuota limit for that resource. GPU and `requests.nvidia.com/gpumem` should still default to explicit `0` for ordinary users unless admin sets them.
- Layout regression tests must not depend on deliberately invalid charts leaving DB instances behind. The safer behavior is to reject before DB persistence; use mocked API data for pure frontend overflow checks.
- Monitoring Pod-to-instance attribution cannot rely only on Helm standard labels. Some local charts, including `vllm-serve`, use only `app=<release>`; include that fallback before concluding allocation is zero.
- In this Compose stack the React frontend is not a long-running frontend container. `frontend-build` is a one-shot asset build and `nginx` is the frontend runtime plus API gateway; README and status commands must make that explicit or users will think the stack is partially down.
- For admin tables/cards inside the sidebar shell, fixed multi-column grids can still overflow even when individual buttons use `min-w-0`. Prefer responsive card layouts with a wrapping action region, and test at 1440/1280/1024/900/768 widths.