File Storage Architecture¶
This document describes the file storage system architecture for the OpenAI Files API compatibility layer, including persistent metadata storage introduced in PR #125.
Table of Contents¶
- Overview
- Problem Statement
- Architecture
- Storage Structure
- Metadata Schema
- Storage Backends
- Authentication and Authorization
- Startup Recovery
- Orphan Detection and Cleanup
- Configuration
- API Endpoints
- Design Decisions
Overview¶
The File Storage system provides OpenAI Files API compatible file management with persistent metadata storage. It allows users to upload files for fine-tuning, batch processing, and other purposes while ensuring data durability across server restarts.
Key Features¶
- OpenAI Files API Compatibility: Full support for
/v1/filesendpoints - Persistent Metadata: File metadata survives server restarts
- Automatic Recovery: Rebuilds metadata index from sidecar files on startup
- Orphan Management: Detects and cleans up inconsistent file states
- Pluggable Backends: Support for memory and persistent storage backends
Problem Statement¶
Before (In-Memory Storage)¶
Previously, file metadata was stored in an in-memory DashMap:
┌─────────────────────────────────────────────────────┐
│ Server │
│ ┌───────────────────────────────────────────────┐ │
│ │ DashMap<FileId, Metadata> │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ file-1 │ │ file-2 │ │ file-3 │ ... │ │
│ │ └─────────┘ └─────────┘ └─────────┘ │ │
│ └───────────────────────────────────────────────┘ │
│ ↓ │
│ Server Restart │
│ ↓ │
│ ┌───────────────────────────────────────────────┐ │
│ │ DashMap<FileId, Metadata> │ │
│ │ (empty) │ │ ← Data Loss!
│ └───────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
Problems: - Complete metadata loss on server restart - Orphaned files on disk with no API access - Inconsistent state between files and metadata - No way to recover uploaded files
After (Persistent Storage)¶
With persistent metadata storage:
┌─────────────────────────────────────────────────────┐
│ Server │
│ ┌───────────────────────────────────────────────┐ │
│ │ In-Memory Cache (DashMap) │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ file-1 │ │ file-2 │ │ file-3 │ ... │ │
│ │ └────┬────┘ └────┬────┘ └────┬────┘ │ │
│ └───────┼───────────┼───────────┼───────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌───────────────────────────────────────────────┐ │
│ │ File System │ │
│ │ ┌─────────────────────────────────────────┐ │ │
│ │ │ subdir1/ │ │ │
│ │ │ file-abc123.bin ← Data │ │ │
│ │ │ file-abc123.meta.json ← Metadata │ │ │
│ │ │ subdir2/ │ │ │
│ │ │ file-def456.bin │ │ │
│ │ │ file-def456.meta.json │ │ │
│ │ └─────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
↓
Server Restart
↓
┌─────────────────────────────────────────────────────┐
│ Server │
│ ┌───────────────────────────────────────────────┐ │
│ │ Scan .meta.json files → Rebuild Cache │ │ ← Auto Recovery!
│ └───────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
Architecture¶
Component Diagram¶
┌─────────────────────────────────────────────────────────────────┐
│ HTTP Layer │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ /v1/files Handlers ││
│ │ POST /v1/files GET /v1/files GET /v1/files/:id DELETE ││
│ └──────────────────────────┬──────────────────────────────────┘│
└─────────────────────────────┼───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Services Layer │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ FileService ││
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ ││
│ │ │ Upload │ │ Delete │ │ List / Retrieve │ ││
│ │ │ Handler │ │ Handler │ │ Handler │ ││
│ │ └──────┬──────┘ └──────┬──────┘ └──────────┬──────────┘ ││
│ │ │ │ │ ││
│ │ ▼ ▼ ▼ ││
│ │ ┌───────────────────────────────────────────────────────┐ ││
│ │ │ Arc<dyn MetadataBackend> │ ││
│ │ └───────────────────────────────────────────────────────┘ ││
│ └──────────────────────────┬──────────────────────────────────┘│
└─────────────────────────────┼───────────────────────────────────┘
│
┌─────────────────┴─────────────────┐
▼ ▼
┌───────────────────────┐ ┌───────────────────────────┐
│ MetadataStore │ │ PersistentMetadataStore │
│ (In-Memory) │ │ (Sidecar JSON) │
├───────────────────────┤ ├───────────────────────────┤
│ • DashMap storage │ │ • DashMap cache │
│ • Fast operations │ │ • JSON file persistence │
│ • No durability │ │ • Startup recovery │
│ │ │ • Orphan detection │
└───────────────────────┘ └───────────────────────────┘
Layer Responsibilities¶
| Layer | Component | Responsibility |
|---|---|---|
| HTTP | Handlers | Request parsing, validation, response formatting |
| Services | FileService | Business logic, coordination |
| Services | MetadataBackend | Metadata storage abstraction |
| Infrastructure | LocalFileStorage | Physical file I/O |
Storage Structure¶
Directory Layout¶
storage/
├── a1b2c/ # Subdirectory (first 5 chars of file ID)
│ ├── file-a1b2c3d4e5f6.bin # Binary data file
│ └── file-a1b2c3d4e5f6.meta.json # Metadata sidecar file
├── x9y8z/
│ ├── file-x9y8z7w6v5u4.bin
│ └── file-x9y8z7w6v5u4.meta.json
└── ...
File Naming Convention¶
| File Type | Extension | Pattern | Description |
|---|---|---|---|
| Data | .bin | file-{id}.bin | Raw file content |
| Metadata | .meta.json | file-{id}.meta.json | JSON metadata sidecar |
Sidecar Pattern Benefits¶
- Co-location: Data and metadata are stored together
- Atomic Operations: Metadata writes use atomic rename pattern
- Easy Backup: Simple directory copy preserves everything
- Debug Friendly: Human-readable JSON metadata
- No External Dependencies: No database required
Metadata Schema¶
FileMetadata Structure¶
{
"id": "file-abc123def456",
"object": "file",
"filename": "training_data.jsonl",
"bytes": 1048576,
"purpose": "fine-tune",
"created_at": 1699574400,
"content_type": "application/jsonl",
"storage_path": "a1b2c/file-abc123def456.bin"
}
Field Descriptions¶
| Field | Type | Description |
|---|---|---|
id | string | Unique file identifier (OpenAI format: file-{random}) |
object | string | Always "file" for API compatibility |
filename | string | Original uploaded filename |
bytes | integer | File size in bytes |
purpose | string | File purpose: fine-tune, batch, assistants, etc. |
created_at | integer | Unix timestamp of creation |
content_type | string | MIME type of the file |
storage_path | string | Relative path to data file |
Supported Purposes¶
| Purpose | Description |
|---|---|
fine-tune | Training data for fine-tuning |
batch | Batch API input files |
assistants | Files for Assistants API |
vision | Image files for vision models |
user_data | General user uploads |
Storage Backends¶
MetadataBackend Trait¶
#[async_trait]
pub trait MetadataBackend: Send + Sync {
async fn insert(&self, metadata: FileMetadata) -> Result<(), FileError>;
async fn get(&self, id: &str) -> Option<FileMetadata>;
async fn remove(&self, id: &str) -> Option<FileMetadata>;
async fn list(&self, query: &FileListQuery) -> Vec<FileMetadata>;
async fn len(&self) -> usize;
async fn is_empty(&self) -> bool;
}
Backend Comparison¶
| Feature | MetadataStore (Memory) | PersistentMetadataStore |
|---|---|---|
| Persistence | No | Yes |
| Startup Recovery | No | Yes |
| Performance | Fastest | Fast (cached) |
| Orphan Detection | No | Yes |
| Use Case | Development/Testing | Production |
Write Path (Persistent)¶
1. Generate file ID
2. Store data file: storage/{subdir}/file-{id}.bin
3. Create metadata JSON
4. Write to temp file: file-{id}.meta.json.tmp
5. Atomic rename: file-{id}.meta.json.tmp → file-{id}.meta.json
6. Update in-memory cache
Read Path (Persistent)¶
1. Check in-memory cache (DashMap)
2. If hit → return cached metadata
3. If miss → (only on startup recovery)
a. Scan directory for .meta.json files
b. Parse and validate each file
c. Populate cache
Authentication and Authorization¶
The Files API includes comprehensive authentication and authorization to secure file operations.
Authentication Methods¶
| Method | Description | Use Case |
|---|---|---|
api_key (default) | Bearer token authentication | Production environments |
none | No authentication | Development/testing only |
Authorization Model¶
┌─────────────────────────────────────────────────────────────────┐
│ Files API Request │
└──────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Authentication Layer │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ Extract Bearer Token → Validate API Key → Check Scope ││
│ └─────────────────────────────────────────────────────────────┘│
└──────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Authorization Layer │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ Check File Ownership → Admin Override → Allow/Deny ││
│ └─────────────────────────────────────────────────────────────┘│
└──────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ File Operation │
└─────────────────────────────────────────────────────────────────┘
File Ownership¶
When enforce_ownership is enabled (default):
| Operation | Owner | Admin | Other Users |
|---|---|---|---|
| Upload | Creates owned file | Creates owned file | Creates owned file |
| List | Own files only | All files | Own files only |
| Get | Own files only | All files | 403 Forbidden |
| Download | Own files only | All files | 403 Forbidden |
| Delete | Own files only | All files | 403 Forbidden |
Metadata Fields for Authorization¶
The FileMetadata structure includes ownership fields:
{
"id": "file-abc123def456",
"owner_id": "user-xyz789",
"organization_id": "org-abc123",
"source_ip": "192.168.1.100",
"created_at": 1699574400
}
| Field | Description |
|---|---|
owner_id | User ID who uploaded the file |
organization_id | Organization the user belongs to |
source_ip | IP address of the upload request (for audit) |
Audit Logging¶
All file operations are logged with authentication context:
INFO file_uploaded file_id="file-abc123" user_id="user-xyz" org_id="org-abc" client_ip="192.168.1.1"
INFO file_downloaded file_id="file-abc123" user_id="user-xyz"
INFO file_deleted file_id="file-abc123" user_id="user-xyz" client_ip="192.168.1.1"
WARN file_access_denied file_id="file-abc123" user_id="user-xyz" file_owner="user-other"
Security Considerations¶
- Development Keys: Only available when
CONTINUUM_DEV_MODEis set or in debug builds - Scope Requirements: API keys must have the configured scope (default: "files")
- Legacy Files: Files without
owner_idare accessible by all authenticated users - Admin Override: Users with "admin" scope bypass ownership checks if configured
Startup Recovery¶
Recovery Process¶
┌─────────────────────────────────────────────────────┐
│ Server Startup │
└──────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Scan storage directory recursively │
│ Find all *.meta.json files │
└──────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ For each .meta.json file: │
│ 1. Parse JSON content │
│ 2. Validate schema │
│ 3. Check corresponding .bin exists │
│ 4. Add to in-memory cache │
└──────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Log recovery statistics: │
│ - Files recovered: N │
│ - Orphans detected: M │
└─────────────────────────────────────────────────────┘
Recovery Guarantees¶
- Idempotent: Safe to run multiple times
- Non-destructive: Never deletes files during recovery
- Partial Success: Continues even if some files are corrupted
- Logged: All recovery actions are logged for debugging
Orphan Detection and Cleanup¶
Orphan Types¶
| Type | Description | Cause |
|---|---|---|
| Orphaned Data | .bin file without .meta.json | Crash during upload, manual deletion |
| Orphaned Metadata | .meta.json without .bin | Crash during deletion, disk corruption |
Detection Algorithm¶
pub async fn detect_orphans(&self) -> Result<(Vec<PathBuf>, Vec<PathBuf>), FileError> {
// Scan all files in storage directory
for file in storage_directory {
if file.ends_with(".bin") {
// Check if corresponding .meta.json exists
let meta_path = file.replace(".bin", ".meta.json");
if !meta_path.exists() {
orphaned_data.push(file);
}
} else if file.ends_with(".meta.json") {
// Check if corresponding .bin exists
let data_path = file.replace(".meta.json", ".bin");
if !data_path.exists() {
orphaned_metadata.push(file);
}
}
}
Ok((orphaned_data, orphaned_metadata))
}
Cleanup Options¶
| Option | cleanup_orphans_on_startup | Effect |
|---|---|---|
| Disabled (default) | false | Only detect and log orphans |
| Enabled | true | Auto-delete orphaned metadata files |
Warning: Data file cleanup requires manual intervention to prevent accidental data loss.
TOCTOU Safety Note¶
Orphan cleanup is NOT safe during active file operations due to Time-of-Check-Time-of-Use race conditions:
Thread A: detect_orphans() → finds file-X as orphan
Thread B: upload() → creates metadata for file-X
Thread A: cleanup() → deletes "orphan" (now valid!)
Recommendation: Only run cleanup during server startup or maintenance windows.
Configuration¶
YAML Configuration¶
files:
enabled: true
max_file_size: 536870912 # 512MB
storage_path: "./data/files" # Supports ~ expansion
retention_days: 0 # 0 = keep forever
metadata_storage: persistent # "memory" or "persistent"
cleanup_orphans_on_startup: false
Environment Variables¶
| Variable | Default | Description |
|---|---|---|
CONTINUUM_FILES_ENABLED | true | Enable/disable Files API |
CONTINUUM_FILES_MAX_SIZE | 536870912 | Max file size (bytes) |
CONTINUUM_FILES_STORAGE_PATH | ./data/files | Storage directory |
CONTINUUM_FILES_RETENTION_DAYS | 0 | Auto-delete after N days |
CONTINUUM_FILES_METADATA_STORAGE | persistent | Backend type |
CONTINUUM_FILES_CLEANUP_ORPHANS | false | Auto-cleanup on startup |
Storage Backend Selection¶
| Backend | When to Use |
|---|---|
memory | Development, testing, ephemeral workloads |
persistent | Production, data durability required |
API Endpoints¶
POST /v1/files¶
Upload a new file.
curl -X POST http://localhost:8080/v1/files \
-H "Content-Type: multipart/form-data" \
-F "file=@training.jsonl" \
-F "purpose=fine-tune"
GET /v1/files¶
List all files.
GET /v1/files/:id¶
Get file metadata.
GET /v1/files/:id/content¶
Download file content.
DELETE /v1/files/:id¶
Delete a file.
Design Decisions¶
Why Sidecar JSON Files?¶
Alternatives Considered:
| Option | Pros | Cons |
|---|---|---|
| SQLite | ACID, queries | Additional dependency, complexity |
| Single JSON file | Simple | Concurrency issues, large file problems |
| RocksDB/LevelDB | Fast, durable | Heavy dependency |
| Sidecar JSON | Simple, no deps, co-located | Many small files |
Decision: Sidecar JSON files were chosen because: 1. Fits existing file-based architecture 2. No additional dependencies (uses existing serde_json) 3. Files and metadata are co-located for easy backup/restore 4. Atomic writes possible with rename pattern 5. Human-readable for debugging
Why In-Memory Cache + Disk?¶
Pattern: Write-through cache with disk persistence
Benefits: - Sub-millisecond read latency - Durable writes - Automatic recovery on restart
Why Not Database?¶
For the Files API use case: - Typically hundreds to thousands of files, not millions - Simple key-value access pattern - No complex queries needed - Filesystem already provides atomicity guarantees
A database would add: - Operational complexity - Additional dependency - Potential single point of failure
Related Documentation¶
- Configuration Guide - Full configuration reference
- API Documentation - Complete API reference
- Architecture Guide - Overall system architecture