- Add GetMetrics method to MetricsClient interface and implement cluster metrics API - Add QuotaPrecheck service for validating resource quotas before deployment - Add auth DTO with role/permission models and auth handler tests - Add instance diagnostics: mounted NFS volumes, labels, annotations in pod diagnostics - Update workspace handler with GetWorkspace endpoint and shared-user list - Fix monitoring handler to use correct service method name - Add tail_lines fallback in instance handler for snake_case query params - Update nginx config for SSE log streaming support (no buffering) - Add comprehensive test coverage: auth_service_test, auth_handler_test, auth_dto_test, metrics_client_test, quota_precheck_test - Update error messages for quota validation and instance operations - ModifyModal: fix YAML lineWidth:0, modified keys summary, delta-only submit - InstanceCard: correctly disable scale-minus when replicas <= 0 - SidebarLayout: add hover transition for sidebar items - Update todo.md and lessons.md with latest fixes
3.4 KiB
3.4 KiB
Lessons
- Do not leave real bootstrap credentials, cluster endpoints, certificates, or passwords in code fallbacks. Bootstrap defaults must be empty/disabled; real data must come only from
.env,BOOTSTRAP_CONFIG_JSON, or explicit config files. - Keep backend permission names aligned with frontend route guards. Returning legacy domain permissions like
clusters:manage:ownwithout UI permissions such asconfiguration:clusters:manage_ownmakes ordinary users appear logged in but blocked by every page. - Treat
requests.nvidia.com/gpumemas a vendor integer MB scalar in this project. Do not normalize it through Kubernetes memory units such asM,G, orGi; use values like10000. - Multi-cluster tenant resources must be scoped by
(workspace_id, cluster_id). Do not infer the target cluster from list order; user/workspace defaults, kubeconfig issuance, namespace creation, ResourceQuota, and deploy must all use the same selected cluster. - For real Helm smoke tests, wait for platform instance deletion to remove the DB record before deleting the Kubernetes namespace manually. Deleting the namespace too early can make the async Helm uninstall mark the instance failed.
- When embedding Helm, setting
actionConfig.Init(..., namespace, ...)andInstall.Namespaceis not enough. The customRESTClientGettermust also override the raw kubeconfig loader namespace, or manifests withoutmetadata.namespacecan be created in the kubeconfig context namespace such asdefault. - Axios keysToSnake recursively converts ALL object keys including user-provided values map. This silently renames Helm chart values (gpuMem → gpu_mem) causing chart to ignore user settings. Fix: skip recursion for known data fields (values, valuesYaml) while still converting field names. Backend DTOs must provide dual json tags (camelCase + snake_case) with Normalize() fallback.
- In the current two-role model, ordinary users must be forced into username-derived private workspaces/namespaces. Do not accept arbitrary
workspaceIdfor role=user, or(workspace_id, cluster_id)quotas become shared across users. When editing an existing user, update the existing private workspace in place; only migrate users still attached to the default workspace. - CPU and memory quotas are allowed to be blank, which means no platform ResourceQuota limit for that resource. GPU and
requests.nvidia.com/gpumemshould still default to explicit0for ordinary users unless admin sets them. - Layout regression tests must not depend on deliberately invalid charts leaving DB instances behind. The safer behavior is to reject before DB persistence; use mocked API data for pure frontend overflow checks.
- Monitoring Pod-to-instance attribution cannot rely only on Helm standard labels. Some local charts, including
vllm-serve, use onlyapp=<release>; include that fallback before concluding allocation is zero. - In this Compose stack the React frontend is not a long-running frontend container.
frontend-buildis a one-shot asset build andnginxis the frontend runtime plus API gateway; README and status commands must make that explicit or users will think the stack is partially down. - For admin tables/cards inside the sidebar shell, fixed multi-column grids can still overflow even when individual buttons use
min-w-0. Prefer responsive card layouts with a wrapping action region, and test at 1440/1280/1024/900/768 widths.