Designing a Planet-Scale Distributed Object Storage System: Engineering Deep Dive

July 1, 2024 • 45 min read

Distributed Systems

Building a planet-scale distributed object storage system from scratch requires making fundamental engineering decisions at every layer. Unlike describing existing systems, we must explore multiple approaches, understand their trade-offs, and make informed choices based on requirements. This deep dive examines the engineering challenges, design alternatives, and architectural decisions needed to build such a system.

Core Design Challenges

Before diving into solutions, we must understand the fundamental challenges:

Challenge 1: Durability vs Cost: How do we ensure data survives multiple failures without excessive storage overhead?

Challenge 2: Consistency vs Availability: How do we balance correctness with system availability during failures?

Challenge 3: Metadata Scalability: How do we scale metadata operations to handle billions of objects?

Challenge 4: Data Placement: How do we distribute data across nodes while maintaining load balance and fault tolerance?

Challenge 5: Performance vs Consistency: How do we optimize for low latency while maintaining correctness guarantees?

Each challenge has multiple solutions, each with distinct trade-offs. Let’s explore them systematically.

Replication Strategies: A Deep Dive

Replication is fundamental to durability and availability. Different strategies offer different guarantees, performance characteristics, and complexity.

Approach 1: Synchronous Replication

How It Works: Write is acknowledged only after data is replicated to N nodes.

Client          Coordinator      Node 1      Node 2      Node 3
------          ----------      ------      ------      ------
  |--Write------->|                |            |            |
  |                |--Replicate---->|            |            |
  |                |--Replicate------------------>|            |
  |                |--Replicate------------------------------>|
  |                |                |            |            |
  |                |<--ACK----------|            |            |
  |                |<--ACK-----------------------|            |
  |                |<--ACK-------------------------------------|
  |                |                |            |            |
  |                | Wait for N ACKs (e.g., N=3) |            |
  |<--Success------|                |            |            |

Pros:

Strong Durability: Data guaranteed to be on N nodes before acknowledgment
No Data Loss: If client receives success, data is safely stored
Consistent View: All replicas have same data immediately

Cons:

High Write Latency: Must wait for slowest replica (tail latency problem)
Lower Availability: System unavailable if N nodes unavailable
Network Sensitivity: Performance degrades with network issues

When to Use:

Critical data requiring strong durability guarantees
Low write throughput acceptable
Strong consistency requirements

Real-World Example: Financial transaction logs, critical metadata.

Approach 2: Asynchronous Replication

How It Works: Write acknowledged immediately, replication happens in background.

Client          Primary Node    Secondary 1    Secondary 2
------          -----------    -----------    -----------
  |--Write------->|                |                |
  |                | Write locally  |                |
  |<--Success------|                |                |
  |                |                |                |
  |                |--Async replicate-------------->|
  |                |--Async replicate----------------------->|
  |                |                |                |
  |                | (Background replication)        |

Pros:

Low Write Latency: Immediate acknowledgment
High Throughput: Not limited by replication speed
Better Availability: System remains available even if replicas lag

Cons:

Potential Data Loss: If primary fails before replication completes
Eventual Consistency: Replicas may lag behind primary
Complex Failure Handling: Must handle partial replication on failure

When to Use:

High write throughput requirements
Acceptable to lose recent writes in failure scenarios
Eventual consistency acceptable

Real-World Example: Log aggregation, metrics collection, non-critical data.

Approach 3: Chain Replication

How It Works: Replicas form a chain. Writes flow down the chain, reads from tail.

Client          Head Node      Middle Node    Tail Node
------          ---------      -----------    ---------
  |--Write------->|                |                |
  |                |--Forward------>|                |
  |                |                |--Forward------>|
  |                |                |                |
  |                |                |<--ACK---------|
  |                |<--ACK----------|                |
  |<--Success------|                |                |
  |                |                |                |
  |--Read------------------------------------------------->|
  |                |                |                |
  |<--Data---------|                |                |

Pros:

Strong Consistency: Tail always has committed data
Read Optimization: Reads from tail, no coordination needed
Ordering Guarantee: Writes processed in order

Cons:

Write Latency: Must traverse entire chain
Single Point of Failure: Chain breaks if any node fails
Recovery Complexity: Must rebuild chain on failure

When to Use:

Strong consistency requirements
Read-heavy workloads
Acceptable write latency

Real-World Example: Strongly consistent key-value stores, configuration management.

Approach 4: CRAQ (Chain Replication with Apportioned Queries)

How It Works: Chain replication where all nodes can serve reads (with version tracking).

Client          Head Node      Middle Node    Tail Node
------          ---------      -----------    ---------
  |--Write------->|                |                |
  |                |--Forward------>|                |
  |                |                |--Forward------>|
  |                |                |                |
  |                |                |<--ACK---------|
  |                |<--ACK----------|                |
  |<--Success------|                |                |
  |                |                |                |
  |--Read--------->| (Can read from any node)        |
  |<--Data---------|                |                |

Pros:

Read Scalability: All nodes serve reads
Strong Consistency: Writes still strongly consistent
Load Distribution: Reads distributed across chain

Cons:

Version Tracking Overhead: Must track committed vs uncommitted versions
Complex Read Logic: Must check version before serving read
Write Latency: Still limited by chain traversal

When to Use:

Read-heavy workloads requiring strong consistency
Need to scale reads independently
Acceptable write latency

Real-World Example: CRAQ paper implementation, some distributed databases.

Approach 5: Quorum-Based Replication

How It Works: Write to W nodes, read from R nodes, where W + R > N (total replicas).

Client          Node 1      Node 2      Node 3      Node 4      Node 5
------          ------      ------      ------      ------      ------
  |--Write------->|            |            |            |            |
  |                |--Write---->|            |            |            |
  |                |--Write------------------>|            |            |
  |                |--Write------------------------------>|            |
  |                |            |            |            |            |
  |                |<--ACK------|            |            |            |
  |                |<--ACK-------------------|            |            |
  |                |<--ACK--------------------------------|            |
  |                |            |            |            |            |
  |                | Wait for W=3 ACKs       |            |            |
  |<--Success------|            |            |            |            |
  |                |            |            |            |            |
  |--Read--------->|            |            |            |            |
  |                |--Read----->|            |            |            |
  |                |--Read------------------>|            |            |
  |                |<--Data-----|            |            |            |
  |                |<--Data------------------|            |            |
  |                |            |            |            |            |
  |                | Read from R=3 nodes, return latest  |            |
  |<--Data---------|            |            |            |            |

Pros:

Flexible Consistency: Adjust W and R for consistency/availability trade-off
Fault Tolerance: Can tolerate (N-W) write failures, (N-R) read failures
Tunable Performance: Lower W for faster writes, lower R for faster reads

Cons:

Consistency Complexity: Must resolve conflicts when R < N
Coordination Overhead: Requires coordination for reads and writes
Stale Reads Possible: If R < N, may read stale data

When to Use:

Need flexibility in consistency/availability trade-off
Variable workload (read-heavy vs write-heavy)
Acceptable to tune W and R based on requirements

Real-World Example: DynamoDB, Cassandra (tunable consistency), Riak.

Approach 6: Primary-Backup with Log Shipping

How It Works: Primary writes to log, backups replay log asynchronously.

Client          Primary        Backup 1        Backup 2
------          -------        --------        --------
  |--Write------->|                |                |
  |                | Append to log  |                |
  |                | Apply locally  |                |
  |<--Success------|                |                |
  |                |                |                |
  |                |--Ship log---------------------->|
  |                |--Ship log------------------------------>|
  |                |                |                |
  |                |                | Replay log      |
  |                |                | Replay log      |

Pros:

Simple Model: Easy to understand and implement
Low Primary Overhead: Primary not blocked by backups
Point-in-Time Recovery: Can replay log to any point

Cons:

Lag Risk: Backups may lag significantly
Log Management: Must manage log growth and cleanup
Recovery Time: Must replay entire log on recovery

When to Use:

Simple replication requirements
Acceptable backup lag
Need point-in-time recovery

Real-World Example: PostgreSQL streaming replication, MySQL replication.

Hybrid Approaches

Hot-Cold Replication: Synchronous for hot data, asynchronous for cold data.

Tiered Replication: Different replication factors for different data tiers.

Geographic Replication: Synchronous within region, asynchronous across regions.

Decision Matrix:

Strategy	Write Latency	Read Latency	Consistency	Availability	Complexity
Synchronous	High	Low	Strong	Medium	Low
Asynchronous	Low	Low	Eventual	High	Medium
Chain	High	Low	Strong	Medium	Medium
CRAQ	High	Low	Strong	Medium	High
Quorum	Medium	Medium	Tunable	High	High
Primary-Backup	Low	Low	Eventual	High	Low

Erasure Coding: Engineering Trade-offs

Erasure coding provides durability with lower storage overhead than replication, but introduces computational complexity.

Reed-Solomon Encoding

How It Works: Divide data into k fragments, generate m parity fragments. Can reconstruct from any k fragments.

Storage Overhead: (k+m)/k - 1 = m/k

Example: RS(10,4):

10 data fragments + 4 parity = 14 total fragments
Can lose any 4 fragments
Storage overhead: 40% (vs 200% for 3x replication)
Need 10/14 = 71% of fragments to reconstruct

Pros:

Low Storage Overhead: Much better than replication
High Durability: Can tolerate multiple failures
Mature Technology: Well-understood algorithms

Cons:

Computational Cost: Encoding/decoding requires CPU
Reconstruction Overhead: Must read k fragments to reconstruct
Fixed Overhead: Cannot adjust durability without re-encoding

When to Use:

Cold data (low access frequency)
Cost-sensitive scenarios
Acceptable reconstruction latency

Local Reconstruction Codes (LRC)

How It Works: Combines local and global parity. Local parity enables fast reconstruction of single failures.

Example: LRC(12,2,2):

12 data fragments
2 local parity fragments (each protects 6 data fragments)
2 global parity fragments (protect all data)
Can reconstruct single failure using only local parity (fast)
Can reconstruct multiple failures using global parity (slower)

Pros:

Fast Single Failure Recovery: Only need local fragments
Lower Storage Overhead: Than replication
Flexible: Can optimize for common case (single failure)

Cons:

More Complex: Than Reed-Solomon
Limited Multi-Failure: Global parity limits multi-failure recovery
Implementation Complexity: More complex than RS

When to Use:

Expect mostly single failures
Need fast recovery for common case
Acceptable complexity

Real-World Example: Azure Storage uses LRC.

Regenerating Codes

How It Works: Reduces bandwidth during reconstruction by allowing partial reconstruction.

Pros:

Bandwidth Efficient: Less data transfer during reconstruction
Network Optimized: Reduces network usage

Cons:

High Complexity: Very complex to implement
Computational Overhead: More CPU intensive
Limited Adoption: Less mature than RS

When to Use:

Network bandwidth is primary concern
Acceptable complexity and CPU cost

Decision: Replication vs Erasure Coding

Use Replication When:

Hot data (frequent access)
Low latency requirements
Simple operations needed
Acceptable storage cost

Use Erasure Coding When:

Cold data (infrequent access)
Cost-sensitive
Acceptable reconstruction latency
High durability needed with lower cost

Hybrid Approach:

Hot data: 3x replication (fast access)
Warm data: Erasure coding with local reconstruction (balanced)
Cold data: Erasure coding with high parity (cost-optimized)

Metadata Architecture: Multiple Approaches

Metadata management is often the bottleneck. Different architectures offer different scalability and consistency characteristics.

Approach 1: Centralized Metadata Service

Architecture: Single metadata service (possibly replicated) handles all metadata operations.

Clients          Metadata Service (Replicated)
------          -----------------------------
  |--Get metadata->|                            |
  |                | Query metadata DB          |
  |<--Metadata-----|                            |

Pros:

Strong Consistency: Single source of truth
Simple Model: Easy to reason about
Transaction Support: Can support transactions easily

Cons:

Scalability Limit: Single service becomes bottleneck
Single Point of Failure: If service fails, system unavailable
Geographic Limitations: Latency for distant clients

When to Use:

Small to medium scale
Strong consistency required
Acceptable single service limits

Real-World Example: Early versions of some storage systems.

Approach 2: Sharded Metadata Service

Architecture: Metadata partitioned across multiple shards based on object key hash.

Clients          Metadata Shard 1    Metadata Shard 2    Metadata Shard N
------          ----------------    ----------------    ----------------
  |--Get metadata->| (hash % N = 1)                      |
  |                | Query shard 1   |                    |
  |<--Metadata-----|                 |                    |

Sharding Strategies:

Hash-Based Sharding:

Hash object key, modulo number of shards
Pros: Even distribution, simple
Cons: Resharding expensive, hotspots possible

Range-Based Sharding:

Partition by key ranges
Pros: Easy range queries, simple resharding
Cons: Uneven distribution possible, hotspots

Directory-Based Sharding:

Partition by bucket or key prefix
Pros: Natural isolation, easy management
Cons: Uneven distribution, complex routing

Pros:

Horizontal Scalability: Add shards to scale
Load Distribution: Spread load across shards
Fault Isolation: Failure of one shard doesn’t affect others

Cons:

Cross-Shard Operations: Complex for operations spanning shards
Resharding Complexity: Moving data between shards is expensive
Consistency Challenges: Maintaining consistency across shards

When to Use:

Large scale
Acceptable eventual consistency
Need horizontal scalability

Real-World Example: DynamoDB (hash-based), MongoDB (range-based).

Approach 3: Distributed Hash Table (DHT)

Architecture: Metadata distributed using DHT (like Chord, Pastry, Kademlia).

Clients          DHT Node 1      DHT Node 2      DHT Node 3
------          ----------      ----------      ----------
  |--Get metadata->|                |                |
  |                | Route via DHT  |                |
  |                |--Forward------>|                |
  |                |                |--Forward------>|
  |                |                |                |
  |                |                |<--Metadata-----|
  |                |<--Metadata-----|                |
  |<--Metadata-----|                |                |

Pros:

Fully Distributed: No central coordinator
Self-Organizing: Nodes join/leave automatically
Fault Tolerant: Handles node failures gracefully

Cons:

Complex Routing: DHT routing can be complex
Eventual Consistency: May have temporary inconsistencies
Lookup Latency: Multiple hops for lookups

When to Use:

Very large scale
Acceptable lookup latency
Need fully distributed system

Real-World Example: Ceph (CRUSH algorithm), Cassandra (consistent hashing).

Approach 4: Metadata Co-located with Data

Architecture: Each storage node maintains metadata for objects it stores.

Clients          Storage Node 1    Storage Node 2    Storage Node 3
------          --------------    --------------    --------------
  |--Get object->| (has metadata)                    |
  |                | Check local metadata            |
  |                | Serve object                   |
  |<--Object-------|                                 |

Pros:

No Separate Metadata Service: Simpler architecture
Locality: Metadata near data
Scalability: Scales with storage nodes

Cons:

Discovery Complexity: Must find which node has object
Consistency Challenges: Maintaining metadata consistency
Query Limitations: Cross-node queries expensive

When to Use:

Object-to-node mapping is stable
Acceptable discovery overhead
Need simplicity

Real-World Example: Ceph OSDs maintain object metadata locally.

Approach 5: Hierarchical Metadata

Architecture: Metadata organized hierarchically (bucket → object → versions).

Metadata Hierarchy:
Root
├── Bucket 1
│   ├── Object 1
│   │   ├── Version 1
│   │   └── Version 2
│   └── Object 2
└── Bucket 2

Pros:

Natural Organization: Matches logical structure
Efficient Queries: Can query by bucket efficiently
Isolation: Bucket-level isolation

Cons:

Hierarchy Maintenance: Must maintain hierarchy
Cross-Bucket Operations: Complex
Scalability: Hierarchy depth limits scalability

When to Use:

Natural hierarchy exists
Need efficient bucket-level operations
Acceptable hierarchy limitations

Decision: Choosing Metadata Architecture

Factors to Consider:

Scale: How many objects? (millions vs billions)
Consistency: Strong vs eventual?
Query Patterns: Point lookups vs range queries?
Update Frequency: How often does metadata change?
Geographic Distribution: Single region vs global?

Recommendation Matrix:

Scale	Consistency	Query Pattern	Best Approach
Small	Strong	Point	Centralized
Medium	Strong	Point/Range	Sharded
Large	Eventual	Point	DHT
Large	Eventual	Range	Sharded (range-based)
Very Large	Eventual	Point	Metadata co-located

Data Placement Algorithms

How we place data across nodes determines load balance, fault tolerance, and performance.

Approach 1: Consistent Hashing

How It Works: Map nodes and objects to a hash ring. Object placed on first node clockwise.

Hash Ring:
    0x0000
        │
  0xC000│0x4000
        │
  0x8000│
        │
    0xFFFF

Nodes: A(0x2000), B(0x6000), C(0xA000)
Object hash: 0x5000 → Placed on Node B (first clockwise)

Pros:

Minimal Rebalancing: Only objects near added/removed node move
Even Distribution: With good hash function
Simple: Easy to understand and implement

Cons:

Uneven Load: Without virtual nodes, load can be uneven
Hotspots: Popular objects create hotspots
Fixed Mapping: Cannot easily move objects

Improvements:

Virtual Nodes: Each physical node has multiple virtual nodes on ring
Weighted Nodes: Assign more virtual nodes to powerful nodes

Real-World Example: DynamoDB, Riak, Cassandra.

Approach 2: CRUSH (Controlled Replication Under Scalable Hashing)

How It Works: Pseudo-random placement function considering node weights, failure domains, and placement rules.

Key Features:

Deterministic: Same input always produces same output
Weighted: Considers node capacity/performance
Failure Domain Aware: Ensures replicas in different failure domains
Rule-Based: Flexible placement rules

Example Rule: “Place 3 replicas across 3 different racks in same data center”

Pros:

Flexible: Can express complex placement policies
Deterministic: No need to store placement information
Failure Domain Aware: Better fault tolerance
Weighted: Can handle heterogeneous nodes

Cons:

Complex: More complex than consistent hashing
Rule Complexity: Must design placement rules carefully
Rebalancing: Still requires rebalancing on topology changes

Real-World Example: Ceph uses CRUSH extensively.

Approach 3: Rendezvous Hashing (Highest Random Weight)

How It Works: For each object, compute hash(node_id + object_key) for all nodes. Place on node with highest hash value.

Pros:

No Virtual Nodes: Simpler than consistent hashing with virtual nodes
Even Distribution: Good load distribution
Minimal Rebalancing: Only objects on added/removed node move

Cons:

O(N) Lookup: Must check all nodes (can cache)
Topology Changes: All objects may need re-evaluation

When to Use:

Small to medium number of nodes
Need even distribution
Acceptable O(N) lookup

Approach 4: Distributed Hash Table (DHT) Placement

How It Works: Objects placed on nodes responsible for their key in DHT.

Pros:

Fully Distributed: No central coordinator
Self-Organizing: Handles node joins/leaves
Scalable: Scales to very large number of nodes

Cons:

Routing Overhead: Multiple hops for placement
Churn Sensitivity: Frequent node changes cause churn
Complexity: DHT algorithms are complex

When to Use:

Very large scale
Highly dynamic environment
Acceptable routing overhead

Approach 5: Manual Placement with Load Balancing

How It Works: Administrator or system assigns objects to nodes, load balancer distributes requests.

Pros:

Full Control: Complete control over placement
Optimization: Can optimize for specific workloads
Predictable: Know exactly where data is

Cons:

Manual Effort: Requires manual management
Not Scalable: Doesn’t scale to large systems
Static: Doesn’t adapt to changes

When to Use:

Small systems
Special requirements
Need full control

Decision: Choosing Placement Algorithm

Factors:

Scale: Number of nodes
Churn: How often nodes join/leave
Heterogeneity: Are nodes identical or different?
Failure Domains: Need to consider racks, data centers?
Rebalancing Cost: How expensive is moving data?

Recommendation:

Small Scale (<100 nodes): Rendezvous hashing
Medium Scale (100-1000 nodes): Consistent hashing with virtual nodes
Large Scale (>1000 nodes): CRUSH or DHT
Heterogeneous Nodes: CRUSH (weighted)
Failure Domain Awareness: CRUSH

Consistency Models: Engineering Choices

Different consistency models offer different guarantees and performance characteristics.

Strong Consistency

Guarantee: All clients see same data at same time.

Implementation:

Synchronous replication
Quorum reads and writes
Linearizable operations

Pros:

Correctness: Always correct data
Simple Programming Model: No need to handle inconsistencies

Cons:

Lower Availability: Unavailable during partitions
Higher Latency: Must coordinate across nodes
Lower Throughput: Coordination limits throughput

When to Use:

Critical data correctness
Acceptable availability trade-off
Low write throughput acceptable

Eventual Consistency

Guarantee: System will eventually converge to consistent state.

Implementation:

Asynchronous replication
Conflict resolution (last-write-wins, vector clocks)
Read repair

Pros:

High Availability: Available during partitions
Low Latency: No coordination needed
High Throughput: No coordination overhead

Cons:

Temporary Inconsistencies: Clients may see stale data
Complex Programming: Must handle inconsistencies
Conflict Resolution: Must resolve conflicts

When to Use:

High availability required
Acceptable temporary inconsistencies
High throughput needed

Causal Consistency

Guarantee: Causally related operations seen in order by all clients.

Implementation:

Vector clocks to track causality
Order causally related operations
Allow concurrent operations to be seen in different orders

Pros:

Stronger than Eventual: Preserves causality
Weaker than Strong: More available than strong consistency

Cons:

Vector Clock Overhead: Must maintain vector clocks
Complexity: More complex than eventual consistency

When to Use:

Need causality preservation
Acceptable complexity
Need better availability than strong consistency

Read-Your-Writes Consistency

Guarantee: Client always sees its own writes.

Implementation:

Track client’s write history
Route reads to nodes with client’s writes
Session affinity

Pros:

User Experience: Users see their own updates immediately
Weaker than Strong: More available

Cons:

Per-Client State: Must track client state
Routing Complexity: Must route based on client

When to Use:

User-facing applications
Need users to see their updates
Acceptable per-client tracking

Monotonic Reads

Guarantee: Client never sees older data after seeing newer data.

Implementation:

Track client’s read history
Ensure reads are from nodes with at least same version

Pros:

Prevents Regression: Users don’t see data go backwards
Weaker than Strong: More available

Cons:

Per-Client Tracking: Must track client read history

When to Use:

User-facing applications
Need to prevent seeing older data
Acceptable tracking overhead

Decision: Choosing Consistency Model

Requirements Analysis:

Correctness Requirements: How critical is correctness?
Availability Requirements: How important is availability?
Latency Requirements: What latency is acceptable?
Throughput Requirements: What throughput is needed?
Application Complexity: Can application handle inconsistencies?

Recommendation:

Critical Financial Data: Strong consistency
User Content (Social Media): Eventual consistency
User Profiles: Read-your-writes or causal consistency
Configuration Data: Strong consistency
Logs/Metrics: Eventual consistency

System Architecture Comparison

Let’s compare how different real-world systems make these engineering decisions:

Amazon S3 Architecture

Metadata: DynamoDB-like sharded service Replication: 3x synchronous within region Placement: Consistent hashing Consistency: Read-after-write for new objects, eventual for overwrites Erasure Coding: Not used (replication only)

Design Philosophy: Simplicity, durability, eventual consistency acceptable.

Ceph Architecture

Metadata: Distributed across OSDs (Object Storage Daemons) Replication: Configurable (typically 3x) Placement: CRUSH algorithm Consistency: Strong consistency with configurable replication Erasure Coding: Supported via erasure-coded pools

Design Philosophy: Flexibility, self-healing, no single point of failure.

Google Cloud Storage

Metadata: Spanner (strongly consistent) Replication: Multi-regional or regional Placement: Google’s internal placement Consistency: Strong consistency options Erasure Coding: Used for durability

Design Philosophy: Strong consistency, global scale, integration with Google services.

Azure Blob Storage

Metadata: Azure Table Storage Replication: Geo-redundant options Placement: Azure’s internal placement Consistency: Strong consistency Erasure Coding: Local Reconstruction Codes (LRC)

Design Philosophy: Strong consistency, cost optimization via LRC, Azure integration.

MinIO Architecture

Metadata: Embedded (simpler architecture) Replication: Erasure coding (Reed-Solomon) Placement: Consistent hashing Consistency: Strong consistency Erasure Coding: Primary mechanism (not replication)

Design Philosophy: Simplicity, erasure coding focus, S3 API compatibility.

Building Our System: Design Decisions

Given the exploration above, here’s how we might design our system:

Phase 1: Core Requirements

Durability: 11 nines (99.999999999%) Availability: 99.99% (4 nines) Consistency: Read-after-write for new objects, eventual for overwrites Scale: Exabytes of data, millions of requests per second

Phase 2: Architecture Decisions

Metadata Architecture: Sharded metadata service

Rationale: Need horizontal scalability, eventual consistency acceptable
Sharding: Hash-based sharding for even distribution
Replication: 3x replication for metadata service itself

Replication Strategy: Hybrid approach

Hot Data: 3x synchronous replication (low latency, high durability)
Warm Data: 3x asynchronous replication (balance)
Cold Data: Erasure coding RS(10,4) (cost optimization)

Placement Algorithm: CRUSH-like algorithm

Rationale: Need failure domain awareness, weighted nodes
Failure Domains: Rack → Data Center → Region
Placement Rules: 3 replicas across 3 different racks

Consistency Model: Read-after-write + eventual

New Objects: Read-after-write consistency (synchronous replication)
Overwrites: Eventual consistency (asynchronous replication)
Rationale: Balance correctness and availability

Phase 3: Component Design

API Gateway:

Authentication/authorization
Rate limiting
Request routing to appropriate shard

Metadata Service:

Sharded key-value store
Consistent hashing for sharding
3x replication per shard
Caching layer for hot metadata

Data Storage Service:

Storage nodes organized in failure domains
CRUSH-like placement
Hybrid replication/erasure coding
Self-healing (detect and repair failures)

Monitoring:

Comprehensive metrics
Alerting on SLA violations
Capacity planning

Trade-offs Summary

Every engineering decision involves trade-offs:

Replication vs Erasure Coding:

Replication: Simpler, faster, more expensive
Erasure Coding: Complex, slower reconstruction, cheaper

Strong vs Eventual Consistency:

Strong: Correct but less available
Eventual: Available but may be inconsistent

Centralized vs Distributed Metadata:

Centralized: Simple but doesn’t scale
Distributed: Complex but scales

Synchronous vs Asynchronous Replication:

Synchronous: Durable but slower
Asynchronous: Fast but may lose data

Consistent Hashing vs CRUSH:

Consistent Hashing: Simple but less flexible
CRUSH: Complex but more flexible

Conclusion

Designing a planet-scale distributed object storage system requires making fundamental engineering decisions at every layer. There are no universally correct answers - each decision depends on requirements, constraints, and trade-offs.

The key is understanding:

Multiple Approaches: For each problem, there are multiple solutions
Trade-offs: Each solution has pros and cons
Requirements: Decisions must align with requirements
Real-World Constraints: Consider operational complexity, cost, performance

By exploring different approaches, understanding their trade-offs, and making informed decisions based on requirements, we can design systems that meet our specific needs while understanding the implications of our choices.

The systems we examined (S3, Ceph, GCS, Azure, MinIO) each made different choices based on their priorities. There’s no single “best” architecture - only architectures that best fit specific requirements and constraints.