Skip to content

File Storage Architecture

This document describes the file storage system architecture for the OpenAI Files API compatibility layer, including persistent metadata storage introduced in PR #125.

Table of Contents

Overview

The File Storage system provides OpenAI Files API compatible file management with persistent metadata storage. It allows users to upload files for fine-tuning, batch processing, and other purposes while ensuring data durability across server restarts.

Key Features

  • OpenAI Files API Compatibility: Full support for /v1/files endpoints
  • Persistent Metadata: File metadata survives server restarts
  • Automatic Recovery: Rebuilds metadata index from sidecar files on startup
  • Orphan Management: Detects and cleans up inconsistent file states
  • Pluggable Backends: Support for memory and persistent storage backends

Problem Statement

Before (In-Memory Storage)

Previously, file metadata was stored in an in-memory DashMap:

┌─────────────────────────────────────────────────────┐
│                    Server                           │
│  ┌───────────────────────────────────────────────┐  │
│  │           DashMap<FileId, Metadata>           │  │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐        │  │
│  │  │ file-1  │ │ file-2  │ │ file-3  │  ...   │  │
│  │  └─────────┘ └─────────┘ └─────────┘        │  │
│  └───────────────────────────────────────────────┘  │
│                       ↓                             │
│              Server Restart                         │
│                       ↓                             │
│  ┌───────────────────────────────────────────────┐  │
│  │           DashMap<FileId, Metadata>           │  │
│  │                  (empty)                      │  │  ← Data Loss!
│  └───────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────┘

Problems: - Complete metadata loss on server restart - Orphaned files on disk with no API access - Inconsistent state between files and metadata - No way to recover uploaded files

After (Persistent Storage)

With persistent metadata storage:

┌─────────────────────────────────────────────────────┐
│                    Server                           │
│  ┌───────────────────────────────────────────────┐  │
│  │        In-Memory Cache (DashMap)              │  │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐        │  │
│  │  │ file-1  │ │ file-2  │ │ file-3  │  ...   │  │
│  │  └────┬────┘ └────┬────┘ └────┬────┘        │  │
│  └───────┼───────────┼───────────┼───────────────┘  │
│          │           │           │                  │
│          ▼           ▼           ▼                  │
│  ┌───────────────────────────────────────────────┐  │
│  │              File System                      │  │
│  │  ┌─────────────────────────────────────────┐  │  │
│  │  │ subdir1/                                │  │  │
│  │  │   file-abc123.bin      ← Data           │  │  │
│  │  │   file-abc123.meta.json ← Metadata      │  │  │
│  │  │ subdir2/                                │  │  │
│  │  │   file-def456.bin                       │  │  │
│  │  │   file-def456.meta.json                 │  │  │
│  │  └─────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────┘
              Server Restart
┌─────────────────────────────────────────────────────┐
│                    Server                           │
│  ┌───────────────────────────────────────────────┐  │
│  │   Scan .meta.json files → Rebuild Cache      │  │  ← Auto Recovery!
│  └───────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────┘

Architecture

Component Diagram

┌─────────────────────────────────────────────────────────────────┐
│                         HTTP Layer                               │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                    /v1/files Handlers                       ││
│  │  POST /v1/files  GET /v1/files  GET /v1/files/:id  DELETE  ││
│  └──────────────────────────┬──────────────────────────────────┘│
└─────────────────────────────┼───────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│                       Services Layer                             │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                      FileService                            ││
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐ ││
│  │  │   Upload    │  │   Delete    │  │   List / Retrieve   │ ││
│  │  │   Handler   │  │   Handler   │  │      Handler        │ ││
│  │  └──────┬──────┘  └──────┬──────┘  └──────────┬──────────┘ ││
│  │         │                │                    │            ││
│  │         ▼                ▼                    ▼            ││
│  │  ┌───────────────────────────────────────────────────────┐ ││
│  │  │              Arc<dyn MetadataBackend>                 │ ││
│  │  └───────────────────────────────────────────────────────┘ ││
│  └──────────────────────────┬──────────────────────────────────┘│
└─────────────────────────────┼───────────────────────────────────┘
            ┌─────────────────┴─────────────────┐
            ▼                                   ▼
┌───────────────────────┐           ┌───────────────────────────┐
│    MetadataStore      │           │  PersistentMetadataStore  │
│    (In-Memory)        │           │    (Sidecar JSON)         │
├───────────────────────┤           ├───────────────────────────┤
│ • DashMap storage     │           │ • DashMap cache           │
│ • Fast operations     │           │ • JSON file persistence   │
│ • No durability       │           │ • Startup recovery        │
│                       │           │ • Orphan detection        │
└───────────────────────┘           └───────────────────────────┘

Layer Responsibilities

Layer Component Responsibility
HTTP Handlers Request parsing, validation, response formatting
Services FileService Business logic, coordination
Services MetadataBackend Metadata storage abstraction
Infrastructure LocalFileStorage Physical file I/O

Storage Structure

Directory Layout

storage/
├── a1b2c/                          # Subdirectory (first 5 chars of file ID)
│   ├── file-a1b2c3d4e5f6.bin       # Binary data file
│   └── file-a1b2c3d4e5f6.meta.json # Metadata sidecar file
├── x9y8z/
│   ├── file-x9y8z7w6v5u4.bin
│   └── file-x9y8z7w6v5u4.meta.json
└── ...

File Naming Convention

File Type Extension Pattern Description
Data .bin file-{id}.bin Raw file content
Metadata .meta.json file-{id}.meta.json JSON metadata sidecar

Sidecar Pattern Benefits

  1. Co-location: Data and metadata are stored together
  2. Atomic Operations: Metadata writes use atomic rename pattern
  3. Easy Backup: Simple directory copy preserves everything
  4. Debug Friendly: Human-readable JSON metadata
  5. No External Dependencies: No database required

Metadata Schema

FileMetadata Structure

{
  "id": "file-abc123def456",
  "object": "file",
  "filename": "training_data.jsonl",
  "bytes": 1048576,
  "purpose": "fine-tune",
  "created_at": 1699574400,
  "content_type": "application/jsonl",
  "storage_path": "a1b2c/file-abc123def456.bin"
}

Field Descriptions

Field Type Description
id string Unique file identifier (OpenAI format: file-{random})
object string Always "file" for API compatibility
filename string Original uploaded filename
bytes integer File size in bytes
purpose string File purpose: fine-tune, batch, assistants, etc.
created_at integer Unix timestamp of creation
content_type string MIME type of the file
storage_path string Relative path to data file

Supported Purposes

Purpose Description
fine-tune Training data for fine-tuning
batch Batch API input files
assistants Files for Assistants API
vision Image files for vision models
user_data General user uploads

Storage Backends

MetadataBackend Trait

#[async_trait]
pub trait MetadataBackend: Send + Sync {
    async fn insert(&self, metadata: FileMetadata) -> Result<(), FileError>;
    async fn get(&self, id: &str) -> Option<FileMetadata>;
    async fn remove(&self, id: &str) -> Option<FileMetadata>;
    async fn list(&self, query: &FileListQuery) -> Vec<FileMetadata>;
    async fn len(&self) -> usize;
    async fn is_empty(&self) -> bool;
}

Backend Comparison

Feature MetadataStore (Memory) PersistentMetadataStore
Persistence No Yes
Startup Recovery No Yes
Performance Fastest Fast (cached)
Orphan Detection No Yes
Use Case Development/Testing Production

Write Path (Persistent)

1. Generate file ID
2. Store data file: storage/{subdir}/file-{id}.bin
3. Create metadata JSON
4. Write to temp file: file-{id}.meta.json.tmp
5. Atomic rename: file-{id}.meta.json.tmp → file-{id}.meta.json
6. Update in-memory cache

Read Path (Persistent)

1. Check in-memory cache (DashMap)
2. If hit → return cached metadata
3. If miss → (only on startup recovery)
   a. Scan directory for .meta.json files
   b. Parse and validate each file
   c. Populate cache

Authentication and Authorization

The Files API includes comprehensive authentication and authorization to secure file operations.

Authentication Methods

Method Description Use Case
api_key (default) Bearer token authentication Production environments
none No authentication Development/testing only

Authorization Model

┌─────────────────────────────────────────────────────────────────┐
│                    Files API Request                             │
└──────────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│                   Authentication Layer                           │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │  Extract Bearer Token → Validate API Key → Check Scope      ││
│  └─────────────────────────────────────────────────────────────┘│
└──────────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│                   Authorization Layer                            │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │  Check File Ownership → Admin Override → Allow/Deny         ││
│  └─────────────────────────────────────────────────────────────┘│
└──────────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│                      File Operation                              │
└─────────────────────────────────────────────────────────────────┘

File Ownership

When enforce_ownership is enabled (default):

Operation Owner Admin Other Users
Upload Creates owned file Creates owned file Creates owned file
List Own files only All files Own files only
Get Own files only All files 403 Forbidden
Download Own files only All files 403 Forbidden
Delete Own files only All files 403 Forbidden

Metadata Fields for Authorization

The FileMetadata structure includes ownership fields:

{
  "id": "file-abc123def456",
  "owner_id": "user-xyz789",
  "organization_id": "org-abc123",
  "source_ip": "192.168.1.100",
  "created_at": 1699574400
}
Field Description
owner_id User ID who uploaded the file
organization_id Organization the user belongs to
source_ip IP address of the upload request (for audit)

Audit Logging

All file operations are logged with authentication context:

INFO file_uploaded file_id="file-abc123" user_id="user-xyz" org_id="org-abc" client_ip="192.168.1.1"
INFO file_downloaded file_id="file-abc123" user_id="user-xyz"
INFO file_deleted file_id="file-abc123" user_id="user-xyz" client_ip="192.168.1.1"
WARN file_access_denied file_id="file-abc123" user_id="user-xyz" file_owner="user-other"

Security Considerations

  1. Development Keys: Only available when CONTINUUM_DEV_MODE is set or in debug builds
  2. Scope Requirements: API keys must have the configured scope (default: "files")
  3. Legacy Files: Files without owner_id are accessible by all authenticated users
  4. Admin Override: Users with "admin" scope bypass ownership checks if configured

Startup Recovery

Recovery Process

┌─────────────────────────────────────────────────────┐
│                 Server Startup                       │
└──────────────────────┬──────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│         Scan storage directory recursively          │
│         Find all *.meta.json files                  │
└──────────────────────┬──────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│         For each .meta.json file:                   │
│         1. Parse JSON content                       │
│         2. Validate schema                          │
│         3. Check corresponding .bin exists          │
│         4. Add to in-memory cache                   │
└──────────────────────┬──────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│         Log recovery statistics:                    │
│         - Files recovered: N                        │
│         - Orphans detected: M                       │
└─────────────────────────────────────────────────────┘

Recovery Guarantees

  1. Idempotent: Safe to run multiple times
  2. Non-destructive: Never deletes files during recovery
  3. Partial Success: Continues even if some files are corrupted
  4. Logged: All recovery actions are logged for debugging

Orphan Detection and Cleanup

Orphan Types

Type Description Cause
Orphaned Data .bin file without .meta.json Crash during upload, manual deletion
Orphaned Metadata .meta.json without .bin Crash during deletion, disk corruption

Detection Algorithm

pub async fn detect_orphans(&self) -> Result<(Vec<PathBuf>, Vec<PathBuf>), FileError> {
    // Scan all files in storage directory
    for file in storage_directory {
        if file.ends_with(".bin") {
            // Check if corresponding .meta.json exists
            let meta_path = file.replace(".bin", ".meta.json");
            if !meta_path.exists() {
                orphaned_data.push(file);
            }
        } else if file.ends_with(".meta.json") {
            // Check if corresponding .bin exists
            let data_path = file.replace(".meta.json", ".bin");
            if !data_path.exists() {
                orphaned_metadata.push(file);
            }
        }
    }
    Ok((orphaned_data, orphaned_metadata))
}

Cleanup Options

Option cleanup_orphans_on_startup Effect
Disabled (default) false Only detect and log orphans
Enabled true Auto-delete orphaned metadata files

Warning: Data file cleanup requires manual intervention to prevent accidental data loss.

TOCTOU Safety Note

Orphan cleanup is NOT safe during active file operations due to Time-of-Check-Time-of-Use race conditions:

Thread A: detect_orphans() → finds file-X as orphan
Thread B: upload() → creates metadata for file-X
Thread A: cleanup() → deletes "orphan" (now valid!)

Recommendation: Only run cleanup during server startup or maintenance windows.

Configuration

YAML Configuration

files:
  enabled: true
  max_file_size: 536870912        # 512MB
  storage_path: "./data/files"    # Supports ~ expansion
  retention_days: 0               # 0 = keep forever
  metadata_storage: persistent    # "memory" or "persistent"
  cleanup_orphans_on_startup: false

Environment Variables

Variable Default Description
CONTINUUM_FILES_ENABLED true Enable/disable Files API
CONTINUUM_FILES_MAX_SIZE 536870912 Max file size (bytes)
CONTINUUM_FILES_STORAGE_PATH ./data/files Storage directory
CONTINUUM_FILES_RETENTION_DAYS 0 Auto-delete after N days
CONTINUUM_FILES_METADATA_STORAGE persistent Backend type
CONTINUUM_FILES_CLEANUP_ORPHANS false Auto-cleanup on startup

Storage Backend Selection

Backend When to Use
memory Development, testing, ephemeral workloads
persistent Production, data durability required

API Endpoints

POST /v1/files

Upload a new file.

curl -X POST http://localhost:8080/v1/files \
  -H "Content-Type: multipart/form-data" \
  -F "file=@training.jsonl" \
  -F "purpose=fine-tune"

GET /v1/files

List all files.

curl http://localhost:8080/v1/files?purpose=fine-tune&limit=10

GET /v1/files/:id

Get file metadata.

curl http://localhost:8080/v1/files/file-abc123

GET /v1/files/:id/content

Download file content.

curl http://localhost:8080/v1/files/file-abc123/content -o downloaded.jsonl

DELETE /v1/files/:id

Delete a file.

curl -X DELETE http://localhost:8080/v1/files/file-abc123

Design Decisions

Why Sidecar JSON Files?

Alternatives Considered:

Option Pros Cons
SQLite ACID, queries Additional dependency, complexity
Single JSON file Simple Concurrency issues, large file problems
RocksDB/LevelDB Fast, durable Heavy dependency
Sidecar JSON Simple, no deps, co-located Many small files

Decision: Sidecar JSON files were chosen because: 1. Fits existing file-based architecture 2. No additional dependencies (uses existing serde_json) 3. Files and metadata are co-located for easy backup/restore 4. Atomic writes possible with rename pattern 5. Human-readable for debugging

Why In-Memory Cache + Disk?

Pattern: Write-through cache with disk persistence

Write: Cache → Disk (synchronous)
Read:  Cache (fast path)

Benefits: - Sub-millisecond read latency - Durable writes - Automatic recovery on restart

Why Not Database?

For the Files API use case: - Typically hundreds to thousands of files, not millions - Simple key-value access pattern - No complex queries needed - Filesystem already provides atomicity guarantees

A database would add: - Operational complexity - Additional dependency - Potential single point of failure