Designing a Planet-Scale Distributed Object Storage System: Engineering Deep Dive
Building a planet-scale distributed object storage system from scratch requires making fundamental engineering decisions at every layer. Unlike describing existing systems, we must explore multiple approaches, understand their trade-offs, and make informed choices based on requirements. This deep dive examines the engineering challenges, design alternatives, and architectural decisions needed to build such a system.
Core Design Challenges
Before diving into solutions, we must understand the fundamental challenges:
Challenge 1: Durability vs Cost: How do we ensure data survives multiple failures without excessive storage overhead?
Challenge 2: Consistency vs Availability: How do we balance correctness with system availability during failures?
Challenge 3: Metadata Scalability: How do we scale metadata operations to handle billions of objects?
Challenge 4: Data Placement: How do we distribute data across nodes while maintaining load balance and fault tolerance?
Challenge 5: Performance vs Consistency: How do we optimize for low latency while maintaining correctness guarantees?
Each challenge has multiple solutions, each with distinct trade-offs. Let’s explore them systematically.
Replication Strategies: A Deep Dive
Replication is fundamental to durability and availability. Different strategies offer different guarantees, performance characteristics, and complexity.
Approach 1: Synchronous Replication
How It Works: Write is acknowledged only after data is replicated to N nodes.
Client Coordinator Node 1 Node 2 Node 3
------ ---------- ------ ------ ------
|--Write------->| | | |
| |--Replicate---->| | |
| |--Replicate------------------>| |
| |--Replicate------------------------------>|
| | | | |
| |<--ACK----------| | |
| |<--ACK-----------------------| |
| |<--ACK-------------------------------------|
| | | | |
| | Wait for N ACKs (e.g., N=3) | |
|<--Success------| | | |
Pros:
- Strong Durability: Data guaranteed to be on N nodes before acknowledgment
- No Data Loss: If client receives success, data is safely stored
- Consistent View: All replicas have same data immediately
Cons:
- High Write Latency: Must wait for slowest replica (tail latency problem)
- Lower Availability: System unavailable if N nodes unavailable
- Network Sensitivity: Performance degrades with network issues
When to Use:
- Critical data requiring strong durability guarantees
- Low write throughput acceptable
- Strong consistency requirements
Real-World Example: Financial transaction logs, critical metadata.
Approach 2: Asynchronous Replication
How It Works: Write acknowledged immediately, replication happens in background.
Client Primary Node Secondary 1 Secondary 2
------ ----------- ----------- -----------
|--Write------->| | |
| | Write locally | |
|<--Success------| | |
| | | |
| |--Async replicate-------------->|
| |--Async replicate----------------------->|
| | | |
| | (Background replication) |
Pros:
- Low Write Latency: Immediate acknowledgment
- High Throughput: Not limited by replication speed
- Better Availability: System remains available even if replicas lag
Cons:
- Potential Data Loss: If primary fails before replication completes
- Eventual Consistency: Replicas may lag behind primary
- Complex Failure Handling: Must handle partial replication on failure
When to Use:
- High write throughput requirements
- Acceptable to lose recent writes in failure scenarios
- Eventual consistency acceptable
Real-World Example: Log aggregation, metrics collection, non-critical data.
Approach 3: Chain Replication
How It Works: Replicas form a chain. Writes flow down the chain, reads from tail.
Client Head Node Middle Node Tail Node
------ --------- ----------- ---------
|--Write------->| | |
| |--Forward------>| |
| | |--Forward------>|
| | | |
| | |<--ACK---------|
| |<--ACK----------| |
|<--Success------| | |
| | | |
|--Read------------------------------------------------->|
| | | |
|<--Data---------| | |
Pros:
- Strong Consistency: Tail always has committed data
- Read Optimization: Reads from tail, no coordination needed
- Ordering Guarantee: Writes processed in order
Cons:
- Write Latency: Must traverse entire chain
- Single Point of Failure: Chain breaks if any node fails
- Recovery Complexity: Must rebuild chain on failure
When to Use:
- Strong consistency requirements
- Read-heavy workloads
- Acceptable write latency
Real-World Example: Strongly consistent key-value stores, configuration management.
Approach 4: CRAQ (Chain Replication with Apportioned Queries)
How It Works: Chain replication where all nodes can serve reads (with version tracking).
Client Head Node Middle Node Tail Node
------ --------- ----------- ---------
|--Write------->| | |
| |--Forward------>| |
| | |--Forward------>|
| | | |
| | |<--ACK---------|
| |<--ACK----------| |
|<--Success------| | |
| | | |
|--Read--------->| (Can read from any node) |
|<--Data---------| | |
Pros:
- Read Scalability: All nodes serve reads
- Strong Consistency: Writes still strongly consistent
- Load Distribution: Reads distributed across chain
Cons:
- Version Tracking Overhead: Must track committed vs uncommitted versions
- Complex Read Logic: Must check version before serving read
- Write Latency: Still limited by chain traversal
When to Use:
- Read-heavy workloads requiring strong consistency
- Need to scale reads independently
- Acceptable write latency
Real-World Example: CRAQ paper implementation, some distributed databases.
Approach 5: Quorum-Based Replication
How It Works: Write to W nodes, read from R nodes, where W + R > N (total replicas).
Client Node 1 Node 2 Node 3 Node 4 Node 5
------ ------ ------ ------ ------ ------
|--Write------->| | | | |
| |--Write---->| | | |
| |--Write------------------>| | |
| |--Write------------------------------>| |
| | | | | |
| |<--ACK------| | | |
| |<--ACK-------------------| | |
| |<--ACK--------------------------------| |
| | | | | |
| | Wait for W=3 ACKs | | |
|<--Success------| | | | |
| | | | | |
|--Read--------->| | | | |
| |--Read----->| | | |
| |--Read------------------>| | |
| |<--Data-----| | | |
| |<--Data------------------| | |
| | | | | |
| | Read from R=3 nodes, return latest | |
|<--Data---------| | | | |
Pros:
- Flexible Consistency: Adjust W and R for consistency/availability trade-off
- Fault Tolerance: Can tolerate (N-W) write failures, (N-R) read failures
- Tunable Performance: Lower W for faster writes, lower R for faster reads
Cons:
- Consistency Complexity: Must resolve conflicts when R < N
- Coordination Overhead: Requires coordination for reads and writes
- Stale Reads Possible: If R < N, may read stale data
When to Use:
- Need flexibility in consistency/availability trade-off
- Variable workload (read-heavy vs write-heavy)
- Acceptable to tune W and R based on requirements
Real-World Example: DynamoDB, Cassandra (tunable consistency), Riak.
Approach 6: Primary-Backup with Log Shipping
How It Works: Primary writes to log, backups replay log asynchronously.
Client Primary Backup 1 Backup 2
------ ------- -------- --------
|--Write------->| | |
| | Append to log | |
| | Apply locally | |
|<--Success------| | |
| | | |
| |--Ship log---------------------->|
| |--Ship log------------------------------>|
| | | |
| | | Replay log |
| | | Replay log |
Pros:
- Simple Model: Easy to understand and implement
- Low Primary Overhead: Primary not blocked by backups
- Point-in-Time Recovery: Can replay log to any point
Cons:
- Lag Risk: Backups may lag significantly
- Log Management: Must manage log growth and cleanup
- Recovery Time: Must replay entire log on recovery
When to Use:
- Simple replication requirements
- Acceptable backup lag
- Need point-in-time recovery
Real-World Example: PostgreSQL streaming replication, MySQL replication.
Hybrid Approaches
Hot-Cold Replication: Synchronous for hot data, asynchronous for cold data.
Tiered Replication: Different replication factors for different data tiers.
Geographic Replication: Synchronous within region, asynchronous across regions.
Decision Matrix:
| Strategy | Write Latency | Read Latency | Consistency | Availability | Complexity |
|---|---|---|---|---|---|
| Synchronous | High | Low | Strong | Medium | Low |
| Asynchronous | Low | Low | Eventual | High | Medium |
| Chain | High | Low | Strong | Medium | Medium |
| CRAQ | High | Low | Strong | Medium | High |
| Quorum | Medium | Medium | Tunable | High | High |
| Primary-Backup | Low | Low | Eventual | High | Low |
Erasure Coding: Engineering Trade-offs
Erasure coding provides durability with lower storage overhead than replication, but introduces computational complexity.
Reed-Solomon Encoding
How It Works: Divide data into k fragments, generate m parity fragments. Can reconstruct from any k fragments.
Storage Overhead: (k+m)/k - 1 = m/k
Example: RS(10,4):
- 10 data fragments + 4 parity = 14 total fragments
- Can lose any 4 fragments
- Storage overhead: 40% (vs 200% for 3x replication)
- Need 10/14 = 71% of fragments to reconstruct
Pros:
- Low Storage Overhead: Much better than replication
- High Durability: Can tolerate multiple failures
- Mature Technology: Well-understood algorithms
Cons:
- Computational Cost: Encoding/decoding requires CPU
- Reconstruction Overhead: Must read k fragments to reconstruct
- Fixed Overhead: Cannot adjust durability without re-encoding
When to Use:
- Cold data (low access frequency)
- Cost-sensitive scenarios
- Acceptable reconstruction latency
Local Reconstruction Codes (LRC)
How It Works: Combines local and global parity. Local parity enables fast reconstruction of single failures.
Example: LRC(12,2,2):
- 12 data fragments
- 2 local parity fragments (each protects 6 data fragments)
- 2 global parity fragments (protect all data)
- Can reconstruct single failure using only local parity (fast)
- Can reconstruct multiple failures using global parity (slower)
Pros:
- Fast Single Failure Recovery: Only need local fragments
- Lower Storage Overhead: Than replication
- Flexible: Can optimize for common case (single failure)
Cons:
- More Complex: Than Reed-Solomon
- Limited Multi-Failure: Global parity limits multi-failure recovery
- Implementation Complexity: More complex than RS
When to Use:
- Expect mostly single failures
- Need fast recovery for common case
- Acceptable complexity
Real-World Example: Azure Storage uses LRC.
Regenerating Codes
How It Works: Reduces bandwidth during reconstruction by allowing partial reconstruction.
Pros:
- Bandwidth Efficient: Less data transfer during reconstruction
- Network Optimized: Reduces network usage
Cons:
- High Complexity: Very complex to implement
- Computational Overhead: More CPU intensive
- Limited Adoption: Less mature than RS
When to Use:
- Network bandwidth is primary concern
- Acceptable complexity and CPU cost
Decision: Replication vs Erasure Coding
Use Replication When:
- Hot data (frequent access)
- Low latency requirements
- Simple operations needed
- Acceptable storage cost
Use Erasure Coding When:
- Cold data (infrequent access)
- Cost-sensitive
- Acceptable reconstruction latency
- High durability needed with lower cost
Hybrid Approach:
- Hot data: 3x replication (fast access)
- Warm data: Erasure coding with local reconstruction (balanced)
- Cold data: Erasure coding with high parity (cost-optimized)
Metadata Architecture: Multiple Approaches
Metadata management is often the bottleneck. Different architectures offer different scalability and consistency characteristics.
Approach 1: Centralized Metadata Service
Architecture: Single metadata service (possibly replicated) handles all metadata operations.
Clients Metadata Service (Replicated)
------ -----------------------------
|--Get metadata->| |
| | Query metadata DB |
|<--Metadata-----| |
Pros:
- Strong Consistency: Single source of truth
- Simple Model: Easy to reason about
- Transaction Support: Can support transactions easily
Cons:
- Scalability Limit: Single service becomes bottleneck
- Single Point of Failure: If service fails, system unavailable
- Geographic Limitations: Latency for distant clients
When to Use:
- Small to medium scale
- Strong consistency required
- Acceptable single service limits
Real-World Example: Early versions of some storage systems.
Approach 2: Sharded Metadata Service
Architecture: Metadata partitioned across multiple shards based on object key hash.
Clients Metadata Shard 1 Metadata Shard 2 Metadata Shard N
------ ---------------- ---------------- ----------------
|--Get metadata->| (hash % N = 1) |
| | Query shard 1 | |
|<--Metadata-----| | |
Sharding Strategies:
Hash-Based Sharding:
- Hash object key, modulo number of shards
- Pros: Even distribution, simple
- Cons: Resharding expensive, hotspots possible
Range-Based Sharding:
- Partition by key ranges
- Pros: Easy range queries, simple resharding
- Cons: Uneven distribution possible, hotspots
Directory-Based Sharding:
- Partition by bucket or key prefix
- Pros: Natural isolation, easy management
- Cons: Uneven distribution, complex routing
Pros:
- Horizontal Scalability: Add shards to scale
- Load Distribution: Spread load across shards
- Fault Isolation: Failure of one shard doesn’t affect others
Cons:
- Cross-Shard Operations: Complex for operations spanning shards
- Resharding Complexity: Moving data between shards is expensive
- Consistency Challenges: Maintaining consistency across shards
When to Use:
- Large scale
- Acceptable eventual consistency
- Need horizontal scalability
Real-World Example: DynamoDB (hash-based), MongoDB (range-based).
Approach 3: Distributed Hash Table (DHT)
Architecture: Metadata distributed using DHT (like Chord, Pastry, Kademlia).
Clients DHT Node 1 DHT Node 2 DHT Node 3
------ ---------- ---------- ----------
|--Get metadata->| | |
| | Route via DHT | |
| |--Forward------>| |
| | |--Forward------>|
| | | |
| | |<--Metadata-----|
| |<--Metadata-----| |
|<--Metadata-----| | |
Pros:
- Fully Distributed: No central coordinator
- Self-Organizing: Nodes join/leave automatically
- Fault Tolerant: Handles node failures gracefully
Cons:
- Complex Routing: DHT routing can be complex
- Eventual Consistency: May have temporary inconsistencies
- Lookup Latency: Multiple hops for lookups
When to Use:
- Very large scale
- Acceptable lookup latency
- Need fully distributed system
Real-World Example: Ceph (CRUSH algorithm), Cassandra (consistent hashing).
Approach 4: Metadata Co-located with Data
Architecture: Each storage node maintains metadata for objects it stores.
Clients Storage Node 1 Storage Node 2 Storage Node 3
------ -------------- -------------- --------------
|--Get object->| (has metadata) |
| | Check local metadata |
| | Serve object |
|<--Object-------| |
Pros:
- No Separate Metadata Service: Simpler architecture
- Locality: Metadata near data
- Scalability: Scales with storage nodes
Cons:
- Discovery Complexity: Must find which node has object
- Consistency Challenges: Maintaining metadata consistency
- Query Limitations: Cross-node queries expensive
When to Use:
- Object-to-node mapping is stable
- Acceptable discovery overhead
- Need simplicity
Real-World Example: Ceph OSDs maintain object metadata locally.
Approach 5: Hierarchical Metadata
Architecture: Metadata organized hierarchically (bucket → object → versions).
Metadata Hierarchy:
Root
├── Bucket 1
│ ├── Object 1
│ │ ├── Version 1
│ │ └── Version 2
│ └── Object 2
└── Bucket 2
Pros:
- Natural Organization: Matches logical structure
- Efficient Queries: Can query by bucket efficiently
- Isolation: Bucket-level isolation
Cons:
- Hierarchy Maintenance: Must maintain hierarchy
- Cross-Bucket Operations: Complex
- Scalability: Hierarchy depth limits scalability
When to Use:
- Natural hierarchy exists
- Need efficient bucket-level operations
- Acceptable hierarchy limitations
Decision: Choosing Metadata Architecture
Factors to Consider:
- Scale: How many objects? (millions vs billions)
- Consistency: Strong vs eventual?
- Query Patterns: Point lookups vs range queries?
- Update Frequency: How often does metadata change?
- Geographic Distribution: Single region vs global?
Recommendation Matrix:
| Scale | Consistency | Query Pattern | Best Approach |
|---|---|---|---|
| Small | Strong | Point | Centralized |
| Medium | Strong | Point/Range | Sharded |
| Large | Eventual | Point | DHT |
| Large | Eventual | Range | Sharded (range-based) |
| Very Large | Eventual | Point | Metadata co-located |
Data Placement Algorithms
How we place data across nodes determines load balance, fault tolerance, and performance.
Approach 1: Consistent Hashing
How It Works: Map nodes and objects to a hash ring. Object placed on first node clockwise.
Hash Ring:
0x0000
│
0xC000│0x4000
│
0x8000│
│
0xFFFF
Nodes: A(0x2000), B(0x6000), C(0xA000)
Object hash: 0x5000 → Placed on Node B (first clockwise)
Pros:
- Minimal Rebalancing: Only objects near added/removed node move
- Even Distribution: With good hash function
- Simple: Easy to understand and implement
Cons:
- Uneven Load: Without virtual nodes, load can be uneven
- Hotspots: Popular objects create hotspots
- Fixed Mapping: Cannot easily move objects
Improvements:
- Virtual Nodes: Each physical node has multiple virtual nodes on ring
- Weighted Nodes: Assign more virtual nodes to powerful nodes
Real-World Example: DynamoDB, Riak, Cassandra.
Approach 2: CRUSH (Controlled Replication Under Scalable Hashing)
How It Works: Pseudo-random placement function considering node weights, failure domains, and placement rules.
Key Features:
- Deterministic: Same input always produces same output
- Weighted: Considers node capacity/performance
- Failure Domain Aware: Ensures replicas in different failure domains
- Rule-Based: Flexible placement rules
Example Rule: “Place 3 replicas across 3 different racks in same data center”
Pros:
- Flexible: Can express complex placement policies
- Deterministic: No need to store placement information
- Failure Domain Aware: Better fault tolerance
- Weighted: Can handle heterogeneous nodes
Cons:
- Complex: More complex than consistent hashing
- Rule Complexity: Must design placement rules carefully
- Rebalancing: Still requires rebalancing on topology changes
Real-World Example: Ceph uses CRUSH extensively.
Approach 3: Rendezvous Hashing (Highest Random Weight)
How It Works: For each object, compute hash(node_id + object_key) for all nodes. Place on node with highest hash value.
Pros:
- No Virtual Nodes: Simpler than consistent hashing with virtual nodes
- Even Distribution: Good load distribution
- Minimal Rebalancing: Only objects on added/removed node move
Cons:
- O(N) Lookup: Must check all nodes (can cache)
- Topology Changes: All objects may need re-evaluation
When to Use:
- Small to medium number of nodes
- Need even distribution
- Acceptable O(N) lookup
Approach 4: Distributed Hash Table (DHT) Placement
How It Works: Objects placed on nodes responsible for their key in DHT.
Pros:
- Fully Distributed: No central coordinator
- Self-Organizing: Handles node joins/leaves
- Scalable: Scales to very large number of nodes
Cons:
- Routing Overhead: Multiple hops for placement
- Churn Sensitivity: Frequent node changes cause churn
- Complexity: DHT algorithms are complex
When to Use:
- Very large scale
- Highly dynamic environment
- Acceptable routing overhead
Approach 5: Manual Placement with Load Balancing
How It Works: Administrator or system assigns objects to nodes, load balancer distributes requests.
Pros:
- Full Control: Complete control over placement
- Optimization: Can optimize for specific workloads
- Predictable: Know exactly where data is
Cons:
- Manual Effort: Requires manual management
- Not Scalable: Doesn’t scale to large systems
- Static: Doesn’t adapt to changes
When to Use:
- Small systems
- Special requirements
- Need full control
Decision: Choosing Placement Algorithm
Factors:
- Scale: Number of nodes
- Churn: How often nodes join/leave
- Heterogeneity: Are nodes identical or different?
- Failure Domains: Need to consider racks, data centers?
- Rebalancing Cost: How expensive is moving data?
Recommendation:
- Small Scale (<100 nodes): Rendezvous hashing
- Medium Scale (100-1000 nodes): Consistent hashing with virtual nodes
- Large Scale (>1000 nodes): CRUSH or DHT
- Heterogeneous Nodes: CRUSH (weighted)
- Failure Domain Awareness: CRUSH
Consistency Models: Engineering Choices
Different consistency models offer different guarantees and performance characteristics.
Strong Consistency
Guarantee: All clients see same data at same time.
Implementation:
- Synchronous replication
- Quorum reads and writes
- Linearizable operations
Pros:
- Correctness: Always correct data
- Simple Programming Model: No need to handle inconsistencies
Cons:
- Lower Availability: Unavailable during partitions
- Higher Latency: Must coordinate across nodes
- Lower Throughput: Coordination limits throughput
When to Use:
- Critical data correctness
- Acceptable availability trade-off
- Low write throughput acceptable
Eventual Consistency
Guarantee: System will eventually converge to consistent state.
Implementation:
- Asynchronous replication
- Conflict resolution (last-write-wins, vector clocks)
- Read repair
Pros:
- High Availability: Available during partitions
- Low Latency: No coordination needed
- High Throughput: No coordination overhead
Cons:
- Temporary Inconsistencies: Clients may see stale data
- Complex Programming: Must handle inconsistencies
- Conflict Resolution: Must resolve conflicts
When to Use:
- High availability required
- Acceptable temporary inconsistencies
- High throughput needed
Causal Consistency
Guarantee: Causally related operations seen in order by all clients.
Implementation:
- Vector clocks to track causality
- Order causally related operations
- Allow concurrent operations to be seen in different orders
Pros:
- Stronger than Eventual: Preserves causality
- Weaker than Strong: More available than strong consistency
Cons:
- Vector Clock Overhead: Must maintain vector clocks
- Complexity: More complex than eventual consistency
When to Use:
- Need causality preservation
- Acceptable complexity
- Need better availability than strong consistency
Read-Your-Writes Consistency
Guarantee: Client always sees its own writes.
Implementation:
- Track client’s write history
- Route reads to nodes with client’s writes
- Session affinity
Pros:
- User Experience: Users see their own updates immediately
- Weaker than Strong: More available
Cons:
- Per-Client State: Must track client state
- Routing Complexity: Must route based on client
When to Use:
- User-facing applications
- Need users to see their updates
- Acceptable per-client tracking
Monotonic Reads
Guarantee: Client never sees older data after seeing newer data.
Implementation:
- Track client’s read history
- Ensure reads are from nodes with at least same version
Pros:
- Prevents Regression: Users don’t see data go backwards
- Weaker than Strong: More available
Cons:
- Per-Client Tracking: Must track client read history
When to Use:
- User-facing applications
- Need to prevent seeing older data
- Acceptable tracking overhead
Decision: Choosing Consistency Model
Requirements Analysis:
- Correctness Requirements: How critical is correctness?
- Availability Requirements: How important is availability?
- Latency Requirements: What latency is acceptable?
- Throughput Requirements: What throughput is needed?
- Application Complexity: Can application handle inconsistencies?
Recommendation:
- Critical Financial Data: Strong consistency
- User Content (Social Media): Eventual consistency
- User Profiles: Read-your-writes or causal consistency
- Configuration Data: Strong consistency
- Logs/Metrics: Eventual consistency
System Architecture Comparison
Let’s compare how different real-world systems make these engineering decisions:
Amazon S3 Architecture
Metadata: DynamoDB-like sharded service Replication: 3x synchronous within region Placement: Consistent hashing Consistency: Read-after-write for new objects, eventual for overwrites Erasure Coding: Not used (replication only)
Design Philosophy: Simplicity, durability, eventual consistency acceptable.
Ceph Architecture
Metadata: Distributed across OSDs (Object Storage Daemons) Replication: Configurable (typically 3x) Placement: CRUSH algorithm Consistency: Strong consistency with configurable replication Erasure Coding: Supported via erasure-coded pools
Design Philosophy: Flexibility, self-healing, no single point of failure.
Google Cloud Storage
Metadata: Spanner (strongly consistent) Replication: Multi-regional or regional Placement: Google’s internal placement Consistency: Strong consistency options Erasure Coding: Used for durability
Design Philosophy: Strong consistency, global scale, integration with Google services.
Azure Blob Storage
Metadata: Azure Table Storage Replication: Geo-redundant options Placement: Azure’s internal placement Consistency: Strong consistency Erasure Coding: Local Reconstruction Codes (LRC)
Design Philosophy: Strong consistency, cost optimization via LRC, Azure integration.
MinIO Architecture
Metadata: Embedded (simpler architecture) Replication: Erasure coding (Reed-Solomon) Placement: Consistent hashing Consistency: Strong consistency Erasure Coding: Primary mechanism (not replication)
Design Philosophy: Simplicity, erasure coding focus, S3 API compatibility.
Building Our System: Design Decisions
Given the exploration above, here’s how we might design our system:
Phase 1: Core Requirements
Durability: 11 nines (99.999999999%) Availability: 99.99% (4 nines) Consistency: Read-after-write for new objects, eventual for overwrites Scale: Exabytes of data, millions of requests per second
Phase 2: Architecture Decisions
Metadata Architecture: Sharded metadata service
- Rationale: Need horizontal scalability, eventual consistency acceptable
- Sharding: Hash-based sharding for even distribution
- Replication: 3x replication for metadata service itself
Replication Strategy: Hybrid approach
- Hot Data: 3x synchronous replication (low latency, high durability)
- Warm Data: 3x asynchronous replication (balance)
- Cold Data: Erasure coding RS(10,4) (cost optimization)
Placement Algorithm: CRUSH-like algorithm
- Rationale: Need failure domain awareness, weighted nodes
- Failure Domains: Rack → Data Center → Region
- Placement Rules: 3 replicas across 3 different racks
Consistency Model: Read-after-write + eventual
- New Objects: Read-after-write consistency (synchronous replication)
- Overwrites: Eventual consistency (asynchronous replication)
- Rationale: Balance correctness and availability
Phase 3: Component Design
API Gateway:
- Authentication/authorization
- Rate limiting
- Request routing to appropriate shard
Metadata Service:
- Sharded key-value store
- Consistent hashing for sharding
- 3x replication per shard
- Caching layer for hot metadata
Data Storage Service:
- Storage nodes organized in failure domains
- CRUSH-like placement
- Hybrid replication/erasure coding
- Self-healing (detect and repair failures)
Monitoring:
- Comprehensive metrics
- Alerting on SLA violations
- Capacity planning
Trade-offs Summary
Every engineering decision involves trade-offs:
Replication vs Erasure Coding:
- Replication: Simpler, faster, more expensive
- Erasure Coding: Complex, slower reconstruction, cheaper
Strong vs Eventual Consistency:
- Strong: Correct but less available
- Eventual: Available but may be inconsistent
Centralized vs Distributed Metadata:
- Centralized: Simple but doesn’t scale
- Distributed: Complex but scales
Synchronous vs Asynchronous Replication:
- Synchronous: Durable but slower
- Asynchronous: Fast but may lose data
Consistent Hashing vs CRUSH:
- Consistent Hashing: Simple but less flexible
- CRUSH: Complex but more flexible
Conclusion
Designing a planet-scale distributed object storage system requires making fundamental engineering decisions at every layer. There are no universally correct answers - each decision depends on requirements, constraints, and trade-offs.
The key is understanding:
- Multiple Approaches: For each problem, there are multiple solutions
- Trade-offs: Each solution has pros and cons
- Requirements: Decisions must align with requirements
- Real-World Constraints: Consider operational complexity, cost, performance
By exploring different approaches, understanding their trade-offs, and making informed decisions based on requirements, we can design systems that meet our specific needs while understanding the implications of our choices.
The systems we examined (S3, Ceph, GCS, Azure, MinIO) each made different choices based on their priorities. There’s no single “best” architecture - only architectures that best fit specific requirements and constraints.