Designing a Planet-Scale Distributed Object Storage System: Engineering Deep Dive

• 45 min read
Distributed Systems

Building a planet-scale distributed object storage system from scratch requires making fundamental engineering decisions at every layer. Unlike describing existing systems, we must explore multiple approaches, understand their trade-offs, and make informed choices based on requirements. This deep dive examines the engineering challenges, design alternatives, and architectural decisions needed to build such a system.

Core Design Challenges

Before diving into solutions, we must understand the fundamental challenges:

Challenge 1: Durability vs Cost: How do we ensure data survives multiple failures without excessive storage overhead?

Challenge 2: Consistency vs Availability: How do we balance correctness with system availability during failures?

Challenge 3: Metadata Scalability: How do we scale metadata operations to handle billions of objects?

Challenge 4: Data Placement: How do we distribute data across nodes while maintaining load balance and fault tolerance?

Challenge 5: Performance vs Consistency: How do we optimize for low latency while maintaining correctness guarantees?

Each challenge has multiple solutions, each with distinct trade-offs. Let’s explore them systematically.

Replication Strategies: A Deep Dive

Replication is fundamental to durability and availability. Different strategies offer different guarantees, performance characteristics, and complexity.

Approach 1: Synchronous Replication

How It Works: Write is acknowledged only after data is replicated to N nodes.

Client          Coordinator      Node 1      Node 2      Node 3
------          ----------      ------      ------      ------
  |--Write------->|                |            |            |
  |                |--Replicate---->|            |            |
  |                |--Replicate------------------>|            |
  |                |--Replicate------------------------------>|
  |                |                |            |            |
  |                |<--ACK----------|            |            |
  |                |<--ACK-----------------------|            |
  |                |<--ACK-------------------------------------|
  |                |                |            |            |
  |                | Wait for N ACKs (e.g., N=3) |            |
  |<--Success------|                |            |            |

Pros:

  • Strong Durability: Data guaranteed to be on N nodes before acknowledgment
  • No Data Loss: If client receives success, data is safely stored
  • Consistent View: All replicas have same data immediately

Cons:

  • High Write Latency: Must wait for slowest replica (tail latency problem)
  • Lower Availability: System unavailable if N nodes unavailable
  • Network Sensitivity: Performance degrades with network issues

When to Use:

  • Critical data requiring strong durability guarantees
  • Low write throughput acceptable
  • Strong consistency requirements

Real-World Example: Financial transaction logs, critical metadata.

Approach 2: Asynchronous Replication

How It Works: Write acknowledged immediately, replication happens in background.

Client          Primary Node    Secondary 1    Secondary 2
------          -----------    -----------    -----------
  |--Write------->|                |                |
  |                | Write locally  |                |
  |<--Success------|                |                |
  |                |                |                |
  |                |--Async replicate-------------->|
  |                |--Async replicate----------------------->|
  |                |                |                |
  |                | (Background replication)        |

Pros:

  • Low Write Latency: Immediate acknowledgment
  • High Throughput: Not limited by replication speed
  • Better Availability: System remains available even if replicas lag

Cons:

  • Potential Data Loss: If primary fails before replication completes
  • Eventual Consistency: Replicas may lag behind primary
  • Complex Failure Handling: Must handle partial replication on failure

When to Use:

  • High write throughput requirements
  • Acceptable to lose recent writes in failure scenarios
  • Eventual consistency acceptable

Real-World Example: Log aggregation, metrics collection, non-critical data.

Approach 3: Chain Replication

How It Works: Replicas form a chain. Writes flow down the chain, reads from tail.

Client          Head Node      Middle Node    Tail Node
------          ---------      -----------    ---------
  |--Write------->|                |                |
  |                |--Forward------>|                |
  |                |                |--Forward------>|
  |                |                |                |
  |                |                |<--ACK---------|
  |                |<--ACK----------|                |
  |<--Success------|                |                |
  |                |                |                |
  |--Read------------------------------------------------->|
  |                |                |                |
  |<--Data---------|                |                |

Pros:

  • Strong Consistency: Tail always has committed data
  • Read Optimization: Reads from tail, no coordination needed
  • Ordering Guarantee: Writes processed in order

Cons:

  • Write Latency: Must traverse entire chain
  • Single Point of Failure: Chain breaks if any node fails
  • Recovery Complexity: Must rebuild chain on failure

When to Use:

  • Strong consistency requirements
  • Read-heavy workloads
  • Acceptable write latency

Real-World Example: Strongly consistent key-value stores, configuration management.

Approach 4: CRAQ (Chain Replication with Apportioned Queries)

How It Works: Chain replication where all nodes can serve reads (with version tracking).

Client          Head Node      Middle Node    Tail Node
------          ---------      -----------    ---------
  |--Write------->|                |                |
  |                |--Forward------>|                |
  |                |                |--Forward------>|
  |                |                |                |
  |                |                |<--ACK---------|
  |                |<--ACK----------|                |
  |<--Success------|                |                |
  |                |                |                |
  |--Read--------->| (Can read from any node)        |
  |<--Data---------|                |                |

Pros:

  • Read Scalability: All nodes serve reads
  • Strong Consistency: Writes still strongly consistent
  • Load Distribution: Reads distributed across chain

Cons:

  • Version Tracking Overhead: Must track committed vs uncommitted versions
  • Complex Read Logic: Must check version before serving read
  • Write Latency: Still limited by chain traversal

When to Use:

  • Read-heavy workloads requiring strong consistency
  • Need to scale reads independently
  • Acceptable write latency

Real-World Example: CRAQ paper implementation, some distributed databases.

Approach 5: Quorum-Based Replication

How It Works: Write to W nodes, read from R nodes, where W + R > N (total replicas).

Client          Node 1      Node 2      Node 3      Node 4      Node 5
------          ------      ------      ------      ------      ------
  |--Write------->|            |            |            |            |
  |                |--Write---->|            |            |            |
  |                |--Write------------------>|            |            |
  |                |--Write------------------------------>|            |
  |                |            |            |            |            |
  |                |<--ACK------|            |            |            |
  |                |<--ACK-------------------|            |            |
  |                |<--ACK--------------------------------|            |
  |                |            |            |            |            |
  |                | Wait for W=3 ACKs       |            |            |
  |<--Success------|            |            |            |            |
  |                |            |            |            |            |
  |--Read--------->|            |            |            |            |
  |                |--Read----->|            |            |            |
  |                |--Read------------------>|            |            |
  |                |<--Data-----|            |            |            |
  |                |<--Data------------------|            |            |
  |                |            |            |            |            |
  |                | Read from R=3 nodes, return latest  |            |
  |<--Data---------|            |            |            |            |

Pros:

  • Flexible Consistency: Adjust W and R for consistency/availability trade-off
  • Fault Tolerance: Can tolerate (N-W) write failures, (N-R) read failures
  • Tunable Performance: Lower W for faster writes, lower R for faster reads

Cons:

  • Consistency Complexity: Must resolve conflicts when R < N
  • Coordination Overhead: Requires coordination for reads and writes
  • Stale Reads Possible: If R < N, may read stale data

When to Use:

  • Need flexibility in consistency/availability trade-off
  • Variable workload (read-heavy vs write-heavy)
  • Acceptable to tune W and R based on requirements

Real-World Example: DynamoDB, Cassandra (tunable consistency), Riak.

Approach 6: Primary-Backup with Log Shipping

How It Works: Primary writes to log, backups replay log asynchronously.

Client          Primary        Backup 1        Backup 2
------          -------        --------        --------
  |--Write------->|                |                |
  |                | Append to log  |                |
  |                | Apply locally  |                |
  |<--Success------|                |                |
  |                |                |                |
  |                |--Ship log---------------------->|
  |                |--Ship log------------------------------>|
  |                |                |                |
  |                |                | Replay log      |
  |                |                | Replay log      |

Pros:

  • Simple Model: Easy to understand and implement
  • Low Primary Overhead: Primary not blocked by backups
  • Point-in-Time Recovery: Can replay log to any point

Cons:

  • Lag Risk: Backups may lag significantly
  • Log Management: Must manage log growth and cleanup
  • Recovery Time: Must replay entire log on recovery

When to Use:

  • Simple replication requirements
  • Acceptable backup lag
  • Need point-in-time recovery

Real-World Example: PostgreSQL streaming replication, MySQL replication.

Hybrid Approaches

Hot-Cold Replication: Synchronous for hot data, asynchronous for cold data.

Tiered Replication: Different replication factors for different data tiers.

Geographic Replication: Synchronous within region, asynchronous across regions.

Decision Matrix:

StrategyWrite LatencyRead LatencyConsistencyAvailabilityComplexity
SynchronousHighLowStrongMediumLow
AsynchronousLowLowEventualHighMedium
ChainHighLowStrongMediumMedium
CRAQHighLowStrongMediumHigh
QuorumMediumMediumTunableHighHigh
Primary-BackupLowLowEventualHighLow

Erasure Coding: Engineering Trade-offs

Erasure coding provides durability with lower storage overhead than replication, but introduces computational complexity.

Reed-Solomon Encoding

How It Works: Divide data into k fragments, generate m parity fragments. Can reconstruct from any k fragments.

Storage Overhead: (k+m)/k - 1 = m/k

Example: RS(10,4):

  • 10 data fragments + 4 parity = 14 total fragments
  • Can lose any 4 fragments
  • Storage overhead: 40% (vs 200% for 3x replication)
  • Need 10/14 = 71% of fragments to reconstruct

Pros:

  • Low Storage Overhead: Much better than replication
  • High Durability: Can tolerate multiple failures
  • Mature Technology: Well-understood algorithms

Cons:

  • Computational Cost: Encoding/decoding requires CPU
  • Reconstruction Overhead: Must read k fragments to reconstruct
  • Fixed Overhead: Cannot adjust durability without re-encoding

When to Use:

  • Cold data (low access frequency)
  • Cost-sensitive scenarios
  • Acceptable reconstruction latency

Local Reconstruction Codes (LRC)

How It Works: Combines local and global parity. Local parity enables fast reconstruction of single failures.

Example: LRC(12,2,2):

  • 12 data fragments
  • 2 local parity fragments (each protects 6 data fragments)
  • 2 global parity fragments (protect all data)
  • Can reconstruct single failure using only local parity (fast)
  • Can reconstruct multiple failures using global parity (slower)

Pros:

  • Fast Single Failure Recovery: Only need local fragments
  • Lower Storage Overhead: Than replication
  • Flexible: Can optimize for common case (single failure)

Cons:

  • More Complex: Than Reed-Solomon
  • Limited Multi-Failure: Global parity limits multi-failure recovery
  • Implementation Complexity: More complex than RS

When to Use:

  • Expect mostly single failures
  • Need fast recovery for common case
  • Acceptable complexity

Real-World Example: Azure Storage uses LRC.

Regenerating Codes

How It Works: Reduces bandwidth during reconstruction by allowing partial reconstruction.

Pros:

  • Bandwidth Efficient: Less data transfer during reconstruction
  • Network Optimized: Reduces network usage

Cons:

  • High Complexity: Very complex to implement
  • Computational Overhead: More CPU intensive
  • Limited Adoption: Less mature than RS

When to Use:

  • Network bandwidth is primary concern
  • Acceptable complexity and CPU cost

Decision: Replication vs Erasure Coding

Use Replication When:

  • Hot data (frequent access)
  • Low latency requirements
  • Simple operations needed
  • Acceptable storage cost

Use Erasure Coding When:

  • Cold data (infrequent access)
  • Cost-sensitive
  • Acceptable reconstruction latency
  • High durability needed with lower cost

Hybrid Approach:

  • Hot data: 3x replication (fast access)
  • Warm data: Erasure coding with local reconstruction (balanced)
  • Cold data: Erasure coding with high parity (cost-optimized)

Metadata Architecture: Multiple Approaches

Metadata management is often the bottleneck. Different architectures offer different scalability and consistency characteristics.

Approach 1: Centralized Metadata Service

Architecture: Single metadata service (possibly replicated) handles all metadata operations.

Clients          Metadata Service (Replicated)
------          -----------------------------
  |--Get metadata->|                            |
  |                | Query metadata DB          |
  |<--Metadata-----|                            |

Pros:

  • Strong Consistency: Single source of truth
  • Simple Model: Easy to reason about
  • Transaction Support: Can support transactions easily

Cons:

  • Scalability Limit: Single service becomes bottleneck
  • Single Point of Failure: If service fails, system unavailable
  • Geographic Limitations: Latency for distant clients

When to Use:

  • Small to medium scale
  • Strong consistency required
  • Acceptable single service limits

Real-World Example: Early versions of some storage systems.

Approach 2: Sharded Metadata Service

Architecture: Metadata partitioned across multiple shards based on object key hash.

Clients          Metadata Shard 1    Metadata Shard 2    Metadata Shard N
------          ----------------    ----------------    ----------------
  |--Get metadata->| (hash % N = 1)                      |
  |                | Query shard 1   |                    |
  |<--Metadata-----|                 |                    |

Sharding Strategies:

Hash-Based Sharding:

  • Hash object key, modulo number of shards
  • Pros: Even distribution, simple
  • Cons: Resharding expensive, hotspots possible

Range-Based Sharding:

  • Partition by key ranges
  • Pros: Easy range queries, simple resharding
  • Cons: Uneven distribution possible, hotspots

Directory-Based Sharding:

  • Partition by bucket or key prefix
  • Pros: Natural isolation, easy management
  • Cons: Uneven distribution, complex routing

Pros:

  • Horizontal Scalability: Add shards to scale
  • Load Distribution: Spread load across shards
  • Fault Isolation: Failure of one shard doesn’t affect others

Cons:

  • Cross-Shard Operations: Complex for operations spanning shards
  • Resharding Complexity: Moving data between shards is expensive
  • Consistency Challenges: Maintaining consistency across shards

When to Use:

  • Large scale
  • Acceptable eventual consistency
  • Need horizontal scalability

Real-World Example: DynamoDB (hash-based), MongoDB (range-based).

Approach 3: Distributed Hash Table (DHT)

Architecture: Metadata distributed using DHT (like Chord, Pastry, Kademlia).

Clients          DHT Node 1      DHT Node 2      DHT Node 3
------          ----------      ----------      ----------
  |--Get metadata->|                |                |
  |                | Route via DHT  |                |
  |                |--Forward------>|                |
  |                |                |--Forward------>|
  |                |                |                |
  |                |                |<--Metadata-----|
  |                |<--Metadata-----|                |
  |<--Metadata-----|                |                |

Pros:

  • Fully Distributed: No central coordinator
  • Self-Organizing: Nodes join/leave automatically
  • Fault Tolerant: Handles node failures gracefully

Cons:

  • Complex Routing: DHT routing can be complex
  • Eventual Consistency: May have temporary inconsistencies
  • Lookup Latency: Multiple hops for lookups

When to Use:

  • Very large scale
  • Acceptable lookup latency
  • Need fully distributed system

Real-World Example: Ceph (CRUSH algorithm), Cassandra (consistent hashing).

Approach 4: Metadata Co-located with Data

Architecture: Each storage node maintains metadata for objects it stores.

Clients          Storage Node 1    Storage Node 2    Storage Node 3
------          --------------    --------------    --------------
  |--Get object->| (has metadata)                    |
  |                | Check local metadata            |
  |                | Serve object                   |
  |<--Object-------|                                 |

Pros:

  • No Separate Metadata Service: Simpler architecture
  • Locality: Metadata near data
  • Scalability: Scales with storage nodes

Cons:

  • Discovery Complexity: Must find which node has object
  • Consistency Challenges: Maintaining metadata consistency
  • Query Limitations: Cross-node queries expensive

When to Use:

  • Object-to-node mapping is stable
  • Acceptable discovery overhead
  • Need simplicity

Real-World Example: Ceph OSDs maintain object metadata locally.

Approach 5: Hierarchical Metadata

Architecture: Metadata organized hierarchically (bucket → object → versions).

Metadata Hierarchy:
Root
├── Bucket 1
│   ├── Object 1
│   │   ├── Version 1
│   │   └── Version 2
│   └── Object 2
└── Bucket 2

Pros:

  • Natural Organization: Matches logical structure
  • Efficient Queries: Can query by bucket efficiently
  • Isolation: Bucket-level isolation

Cons:

  • Hierarchy Maintenance: Must maintain hierarchy
  • Cross-Bucket Operations: Complex
  • Scalability: Hierarchy depth limits scalability

When to Use:

  • Natural hierarchy exists
  • Need efficient bucket-level operations
  • Acceptable hierarchy limitations

Decision: Choosing Metadata Architecture

Factors to Consider:

  1. Scale: How many objects? (millions vs billions)
  2. Consistency: Strong vs eventual?
  3. Query Patterns: Point lookups vs range queries?
  4. Update Frequency: How often does metadata change?
  5. Geographic Distribution: Single region vs global?

Recommendation Matrix:

ScaleConsistencyQuery PatternBest Approach
SmallStrongPointCentralized
MediumStrongPoint/RangeSharded
LargeEventualPointDHT
LargeEventualRangeSharded (range-based)
Very LargeEventualPointMetadata co-located

Data Placement Algorithms

How we place data across nodes determines load balance, fault tolerance, and performance.

Approach 1: Consistent Hashing

How It Works: Map nodes and objects to a hash ring. Object placed on first node clockwise.

Hash Ring:
    0x0000

  0xC000│0x4000

  0x8000│

    0xFFFF

Nodes: A(0x2000), B(0x6000), C(0xA000)
Object hash: 0x5000 → Placed on Node B (first clockwise)

Pros:

  • Minimal Rebalancing: Only objects near added/removed node move
  • Even Distribution: With good hash function
  • Simple: Easy to understand and implement

Cons:

  • Uneven Load: Without virtual nodes, load can be uneven
  • Hotspots: Popular objects create hotspots
  • Fixed Mapping: Cannot easily move objects

Improvements:

  • Virtual Nodes: Each physical node has multiple virtual nodes on ring
  • Weighted Nodes: Assign more virtual nodes to powerful nodes

Real-World Example: DynamoDB, Riak, Cassandra.

Approach 2: CRUSH (Controlled Replication Under Scalable Hashing)

How It Works: Pseudo-random placement function considering node weights, failure domains, and placement rules.

Key Features:

  • Deterministic: Same input always produces same output
  • Weighted: Considers node capacity/performance
  • Failure Domain Aware: Ensures replicas in different failure domains
  • Rule-Based: Flexible placement rules

Example Rule: “Place 3 replicas across 3 different racks in same data center”

Pros:

  • Flexible: Can express complex placement policies
  • Deterministic: No need to store placement information
  • Failure Domain Aware: Better fault tolerance
  • Weighted: Can handle heterogeneous nodes

Cons:

  • Complex: More complex than consistent hashing
  • Rule Complexity: Must design placement rules carefully
  • Rebalancing: Still requires rebalancing on topology changes

Real-World Example: Ceph uses CRUSH extensively.

Approach 3: Rendezvous Hashing (Highest Random Weight)

How It Works: For each object, compute hash(node_id + object_key) for all nodes. Place on node with highest hash value.

Pros:

  • No Virtual Nodes: Simpler than consistent hashing with virtual nodes
  • Even Distribution: Good load distribution
  • Minimal Rebalancing: Only objects on added/removed node move

Cons:

  • O(N) Lookup: Must check all nodes (can cache)
  • Topology Changes: All objects may need re-evaluation

When to Use:

  • Small to medium number of nodes
  • Need even distribution
  • Acceptable O(N) lookup

Approach 4: Distributed Hash Table (DHT) Placement

How It Works: Objects placed on nodes responsible for their key in DHT.

Pros:

  • Fully Distributed: No central coordinator
  • Self-Organizing: Handles node joins/leaves
  • Scalable: Scales to very large number of nodes

Cons:

  • Routing Overhead: Multiple hops for placement
  • Churn Sensitivity: Frequent node changes cause churn
  • Complexity: DHT algorithms are complex

When to Use:

  • Very large scale
  • Highly dynamic environment
  • Acceptable routing overhead

Approach 5: Manual Placement with Load Balancing

How It Works: Administrator or system assigns objects to nodes, load balancer distributes requests.

Pros:

  • Full Control: Complete control over placement
  • Optimization: Can optimize for specific workloads
  • Predictable: Know exactly where data is

Cons:

  • Manual Effort: Requires manual management
  • Not Scalable: Doesn’t scale to large systems
  • Static: Doesn’t adapt to changes

When to Use:

  • Small systems
  • Special requirements
  • Need full control

Decision: Choosing Placement Algorithm

Factors:

  1. Scale: Number of nodes
  2. Churn: How often nodes join/leave
  3. Heterogeneity: Are nodes identical or different?
  4. Failure Domains: Need to consider racks, data centers?
  5. Rebalancing Cost: How expensive is moving data?

Recommendation:

  • Small Scale (<100 nodes): Rendezvous hashing
  • Medium Scale (100-1000 nodes): Consistent hashing with virtual nodes
  • Large Scale (>1000 nodes): CRUSH or DHT
  • Heterogeneous Nodes: CRUSH (weighted)
  • Failure Domain Awareness: CRUSH

Consistency Models: Engineering Choices

Different consistency models offer different guarantees and performance characteristics.

Strong Consistency

Guarantee: All clients see same data at same time.

Implementation:

  • Synchronous replication
  • Quorum reads and writes
  • Linearizable operations

Pros:

  • Correctness: Always correct data
  • Simple Programming Model: No need to handle inconsistencies

Cons:

  • Lower Availability: Unavailable during partitions
  • Higher Latency: Must coordinate across nodes
  • Lower Throughput: Coordination limits throughput

When to Use:

  • Critical data correctness
  • Acceptable availability trade-off
  • Low write throughput acceptable

Eventual Consistency

Guarantee: System will eventually converge to consistent state.

Implementation:

  • Asynchronous replication
  • Conflict resolution (last-write-wins, vector clocks)
  • Read repair

Pros:

  • High Availability: Available during partitions
  • Low Latency: No coordination needed
  • High Throughput: No coordination overhead

Cons:

  • Temporary Inconsistencies: Clients may see stale data
  • Complex Programming: Must handle inconsistencies
  • Conflict Resolution: Must resolve conflicts

When to Use:

  • High availability required
  • Acceptable temporary inconsistencies
  • High throughput needed

Causal Consistency

Guarantee: Causally related operations seen in order by all clients.

Implementation:

  • Vector clocks to track causality
  • Order causally related operations
  • Allow concurrent operations to be seen in different orders

Pros:

  • Stronger than Eventual: Preserves causality
  • Weaker than Strong: More available than strong consistency

Cons:

  • Vector Clock Overhead: Must maintain vector clocks
  • Complexity: More complex than eventual consistency

When to Use:

  • Need causality preservation
  • Acceptable complexity
  • Need better availability than strong consistency

Read-Your-Writes Consistency

Guarantee: Client always sees its own writes.

Implementation:

  • Track client’s write history
  • Route reads to nodes with client’s writes
  • Session affinity

Pros:

  • User Experience: Users see their own updates immediately
  • Weaker than Strong: More available

Cons:

  • Per-Client State: Must track client state
  • Routing Complexity: Must route based on client

When to Use:

  • User-facing applications
  • Need users to see their updates
  • Acceptable per-client tracking

Monotonic Reads

Guarantee: Client never sees older data after seeing newer data.

Implementation:

  • Track client’s read history
  • Ensure reads are from nodes with at least same version

Pros:

  • Prevents Regression: Users don’t see data go backwards
  • Weaker than Strong: More available

Cons:

  • Per-Client Tracking: Must track client read history

When to Use:

  • User-facing applications
  • Need to prevent seeing older data
  • Acceptable tracking overhead

Decision: Choosing Consistency Model

Requirements Analysis:

  1. Correctness Requirements: How critical is correctness?
  2. Availability Requirements: How important is availability?
  3. Latency Requirements: What latency is acceptable?
  4. Throughput Requirements: What throughput is needed?
  5. Application Complexity: Can application handle inconsistencies?

Recommendation:

  • Critical Financial Data: Strong consistency
  • User Content (Social Media): Eventual consistency
  • User Profiles: Read-your-writes or causal consistency
  • Configuration Data: Strong consistency
  • Logs/Metrics: Eventual consistency

System Architecture Comparison

Let’s compare how different real-world systems make these engineering decisions:

Amazon S3 Architecture

Metadata: DynamoDB-like sharded service Replication: 3x synchronous within region Placement: Consistent hashing Consistency: Read-after-write for new objects, eventual for overwrites Erasure Coding: Not used (replication only)

Design Philosophy: Simplicity, durability, eventual consistency acceptable.

Ceph Architecture

Metadata: Distributed across OSDs (Object Storage Daemons) Replication: Configurable (typically 3x) Placement: CRUSH algorithm Consistency: Strong consistency with configurable replication Erasure Coding: Supported via erasure-coded pools

Design Philosophy: Flexibility, self-healing, no single point of failure.

Google Cloud Storage

Metadata: Spanner (strongly consistent) Replication: Multi-regional or regional Placement: Google’s internal placement Consistency: Strong consistency options Erasure Coding: Used for durability

Design Philosophy: Strong consistency, global scale, integration with Google services.

Azure Blob Storage

Metadata: Azure Table Storage Replication: Geo-redundant options Placement: Azure’s internal placement Consistency: Strong consistency Erasure Coding: Local Reconstruction Codes (LRC)

Design Philosophy: Strong consistency, cost optimization via LRC, Azure integration.

MinIO Architecture

Metadata: Embedded (simpler architecture) Replication: Erasure coding (Reed-Solomon) Placement: Consistent hashing Consistency: Strong consistency Erasure Coding: Primary mechanism (not replication)

Design Philosophy: Simplicity, erasure coding focus, S3 API compatibility.

Building Our System: Design Decisions

Given the exploration above, here’s how we might design our system:

Phase 1: Core Requirements

Durability: 11 nines (99.999999999%) Availability: 99.99% (4 nines) Consistency: Read-after-write for new objects, eventual for overwrites Scale: Exabytes of data, millions of requests per second

Phase 2: Architecture Decisions

Metadata Architecture: Sharded metadata service

  • Rationale: Need horizontal scalability, eventual consistency acceptable
  • Sharding: Hash-based sharding for even distribution
  • Replication: 3x replication for metadata service itself

Replication Strategy: Hybrid approach

  • Hot Data: 3x synchronous replication (low latency, high durability)
  • Warm Data: 3x asynchronous replication (balance)
  • Cold Data: Erasure coding RS(10,4) (cost optimization)

Placement Algorithm: CRUSH-like algorithm

  • Rationale: Need failure domain awareness, weighted nodes
  • Failure Domains: Rack → Data Center → Region
  • Placement Rules: 3 replicas across 3 different racks

Consistency Model: Read-after-write + eventual

  • New Objects: Read-after-write consistency (synchronous replication)
  • Overwrites: Eventual consistency (asynchronous replication)
  • Rationale: Balance correctness and availability

Phase 3: Component Design

API Gateway:

  • Authentication/authorization
  • Rate limiting
  • Request routing to appropriate shard

Metadata Service:

  • Sharded key-value store
  • Consistent hashing for sharding
  • 3x replication per shard
  • Caching layer for hot metadata

Data Storage Service:

  • Storage nodes organized in failure domains
  • CRUSH-like placement
  • Hybrid replication/erasure coding
  • Self-healing (detect and repair failures)

Monitoring:

  • Comprehensive metrics
  • Alerting on SLA violations
  • Capacity planning

Trade-offs Summary

Every engineering decision involves trade-offs:

Replication vs Erasure Coding:

  • Replication: Simpler, faster, more expensive
  • Erasure Coding: Complex, slower reconstruction, cheaper

Strong vs Eventual Consistency:

  • Strong: Correct but less available
  • Eventual: Available but may be inconsistent

Centralized vs Distributed Metadata:

  • Centralized: Simple but doesn’t scale
  • Distributed: Complex but scales

Synchronous vs Asynchronous Replication:

  • Synchronous: Durable but slower
  • Asynchronous: Fast but may lose data

Consistent Hashing vs CRUSH:

  • Consistent Hashing: Simple but less flexible
  • CRUSH: Complex but more flexible

Conclusion

Designing a planet-scale distributed object storage system requires making fundamental engineering decisions at every layer. There are no universally correct answers - each decision depends on requirements, constraints, and trade-offs.

The key is understanding:

  1. Multiple Approaches: For each problem, there are multiple solutions
  2. Trade-offs: Each solution has pros and cons
  3. Requirements: Decisions must align with requirements
  4. Real-World Constraints: Consider operational complexity, cost, performance

By exploring different approaches, understanding their trade-offs, and making informed decisions based on requirements, we can design systems that meet our specific needs while understanding the implications of our choices.

The systems we examined (S3, Ceph, GCS, Azure, MinIO) each made different choices based on their priorities. There’s no single “best” architecture - only architectures that best fit specific requirements and constraints.