refactor: full-stack restructure with multi-tenancy, workspace management, and K8s diagnostics

- Add Workspace domain (entity, repository, service, handler, DTO)
- Add multi-tenant K8s client with tenant binding and quota management
- Add K8s diagnostics client (instance diagnostics)
- Add authorization middleware (authz package)
- Restructure frontend to feature-based architecture (features/)
- Add User Management page in configuration
- Add AccessDenied page and route guards
- Refactor shared components (form inputs, layout, UI)
- Update Tailwind config for new design system
- Add comprehensive documentation (docs/, tasks/, plans)
- Improve cluster service with better kubeconfig handling
- Add tests for crypto, config, helm client, tenant binding
This commit is contained in:
Ivan087
2026-05-12 16:15:14 +08:00
parent c5e51ed069
commit 7f238a3168
172 changed files with 15703 additions and 3162 deletions

127
Multi-Tenant Kubeconfig.md Normal file
View File

@ -0,0 +1,127 @@
# Technical Specification: Multi-Tenant Kubeconfig & Auth Gateway
## 1. System Overview & Goals
- **Objective**: Develop a backend API service that automates Kubernetes multi-tenant onboarding (Namespace + Quota isolation) and securely distributes short-lived, dynamic `kubeconfig` files using the Kubernetes `TokenRequest` API.
- **Architecture Independence**: This backend service acts as a standalone control plane. It is **not** strictly bound to a BFF pattern and does **not** need to run inside the target Kubernetes cluster (it supports Out-of-Cluster execution).
- **Out of Scope**: This spec does NOT cover the frontend UI implementation or the downstream workload deployment. It focuses strictly on identity, tenant provisioning, and credential brokering.
- **Security Principles**: Adhere strictly to Zero-Knowledge architecture (no token storage in DB), Ephemeral Credentials (short-lived tokens only), and Least Privilege (the Gateway must NOT be a `cluster-admin`).
## 2. Architecture & Topology
- **Tech Stack**: Go `net/http` (or FastAPI), utilizing the official Kubernetes Client SDK (`client-go` or `kubernetes-client/python`).
- **Control Plane Flow**:
1. Client/Frontend -> Gateway: User requests environment access.
2. Gateway -> K8s API: Gateway authenticates to the target K8s cluster using its own master credentials (e.g., an Out-of-Cluster `kubeconfig`).
3. Gateway -> K8s API: Executes Namespace/SA creation (if new) or calls `TokenRequest` API (if existing).
4. Gateway -> Client/Frontend: Returns a generated `kubeconfig` YAML string with the short-lived JWT token.
## 3. Core Business Logic Workflows
### Phase 1: Tenant Initialization (Onboarding)
Triggered when a new user registers or requests a workspace for the first time. The Gateway must execute a K8s transaction creating four resources:
1. **Namespace**: `tenant-{user_uuid}`
2. **ServiceAccount**: `sa-tenant-admin` (Created inside the tenant's namespace).
3. **RoleBinding**: Bind `sa-tenant-admin` to the `admin` (or custom) ClusterRole, strictly isolated within `tenant-{user_uuid}`.
4. **ResourceQuota**: Enforce limits (e.g., `requests.cpu: "4"`, `limits.memory: "16Gi"`) to prevent noisy neighbors.
### Phase 2: Credential Distribution (Dynamic Token)
Triggered when the user requests CLI access or downloads a kubeconfig.
1. Locate the user's associated Namespace and ServiceAccount, verifying the user's ownership of the workspace.
2. Audit Logging: Record the credential issuance event (User, IP, Workspace) into the database.
3. Call the `authentication.k8s.io/v1 TokenRequest` API targeting `sa-tenant-admin` in the specific tenant's namespace.
4. Set `expirationSeconds: 7200` (2 hours). Hard limit; cannot be extended.
5. Retrieve the generated JWT token and inject it into a pre-defined `kubeconfig` text template.
### Phase 3: Automated Renewal & Emergency Suspension
- **Session Management**: If accessed via a Web UI, the Gateway intercepts requests, attaches the dynamic token, and forwards them. If the token is within 10 minutes of expiration, the Gateway automatically issues a new TokenRequest.
- **Emergency Suspension**: If a workspace is marked compromised, the Gateway deletes its K8s `RoleBinding`, instantly revoking access for all currently active tokens of that tenant.
## 4. API Contracts
### 4.1. Initialize Tenant Workspace
- **Route**: `POST /api/v1/workspaces/init`
- **Auth**: Gateway Session / Bearer Token
- **Rate Limit**: Strictly rate-limited per user to prevent Namespace exhaustion.
- **Request Payload**:
```json
{
"tier": "basic" // Determines the ResourceQuota template
}
- **Response Payload (201 Created)**:
```json
{
"namespace": "tenant-a1b2c3d4",
"status": "provisioned",
"quota": {"cpu": "4", "memory": "8Gi"}
}
```
### 4.2. Generate Dynamic Kubeconfig
- **Route**: `GET /api/v1/workspaces/credentials/kubeconfig`
- **Auth**: Gateway Session / Bearer Token
- **Request Payload(200 OK)**: Returns raw `application/x-yaml`content.
```yaml
apiVersion: v1
clusters:
- cluster:
server: https://<k8s-api-server>
certificate-authority-data: <ca-base64>
name: internal-cluster
contexts:
- context:
cluster: internal-cluster
namespace: tenant-a1b2c3d4 # Default context locked to their namespace
user: sa-tenant-admin
name: tenant-context
current-context: tenant-context
kind: Config
users:
- name: sa-tenant-admin
user:
token: "eyJhbGciOiJSUzI1NiIs..." # Short-lived token injected here
```
### 4.3. Suspend Workspace (Emergency Kill Switch)
- **Route**: POST /api/v1/workspaces/{id}/suspend
- **Auth**: Admin Only
- **Behavior**: Updates DB status to suspended and deletes the associated K8s RoleBinding.
### 5. Data Architecture & Persistence
- **Database**: PostgreSQL (Relational mapping between Users and K8s Namespaces).
- **Table**: `users`
- `id` (UUID, PK),`email`,`password_hash`,`status`
- **Table**: `workspaces`
- `id` (UUID, PK)
- `user_id` (UUID, FK to Users table)
- `k8s_namespace` (String, unique)
- `k8s_sa_name` (String)
- `tier` (String)
- `created_at` (Timestamp)
- **Table**: `audit_logs`(Security Compliance)
- `id` (UUID, PK), `user_id` (UUID), `workspace_id` (UUID), `action` (e.g., IssueKubeconfig), `ip_address`, `created_at`
- **Constraint**: We do NOT store the K8s Token in the database. Tokens are ephemeral and generated on-the-fly.
## 6. Security, Threat Mitigation & Infrastructure Constraints
### 6.1 Threat Model
| Threat | Mitigation Strategy |
| :--- | :--- |
| **Gateway Compromise** | The Gateway uses a strictly restricted K8s role. It cannot read existing `Secrets` or interfere with other tenants' running Pods. |
| **Token Theft (XSS)** | Application-level Auth must use `HttpOnly, Secure` Cookies. Generated Kubeconfigs expire in 2 hours. |
| **Resource Abuse (Mining)** | Hardcoded `ResourceQuota` per tenant upon creation. Global `LimitRange` enforced at the cluster level. |
### 6.2 Restricted Gateway Credentials (Crucial)
The Gateway requires a K8s credential (Out-of-Cluster `kubeconfig` or Cloud IAM Role) to operate. **This credential MAY NOT have `cluster-admin` privileges.** It should be bound to a custom `ClusterRole` with ONLY the following permissions:
- `create`, `get`, `list` on `namespaces`, `resourcequotas`.
- `create`, `get`, `list` on `serviceaccounts`, `rolebindings`.
- `create` on `serviceaccounts/token` (CRITICAL for TokenRequest API).
- *Strictly prohibited*: `get` or `list` on `secrets`, `pods`, or `deployments`.
### 6.3 Deployment & Networking
- **Deployment Agnostic**: The application will be packaged as a Docker image and can be deployed via Docker Compose, standalone VMs, or within a Kubernetes cluster.
- **CORS/CSP**: Since this might not be a single-origin BFF, explicit CORS policies (`Access-Control-Allow-Origin`) must be tightly defined if the frontend is hosted on a separate domain. Wildcards (`*`) are prohibited.