Spotify's Rate Limiting Journey with Envoy: Design, Implementation, and Challenges

Introduction

Rate limiting is a critical component of modern distributed systems, ensuring fair resource allocation and preventing abuse. As Spotify scales to serve millions of users with high traffic volumes, the need for a robust, scalable rate limiting solution becomes paramount. This article explores Spotify's journey in developing a custom rate limiting solution using Envoy, focusing on the design, implementation, and challenges faced during the process.

Technical Background and Existing Solutions

Rate Limiting Fundamentals

Rate limiting involves controlling the number of requests a user can make within a specified time window. It is typically applied at the L7 HTTP layer and can be based on attributes such as IP addresses, user IDs, or request paths. Two key concepts are:

Static Buckets: Predefined attributes like URLs or upstream clusters.
Dynamic Buckets: Runtime-determined attributes like IP addresses or user IDs.

Evaluation of Existing Solutions

Spotify evaluated several existing rate limiting approaches, each with distinct trade-offs:

Local Rate Limiting: Offers flexibility but struggles with high cardinality dynamic buckets (e.g., IP addresses).
Global Rate Limiting: Provides cross-Envoy state sharing but requires external storage (e.g., Redis) and suffers from performance and cost limitations.
RLQS (Rate Limit Quota Service): Non-path-executed, reducing latency, but is still in development and introduces significant internal request overhead.
Cloud Services/CDN: Managed services simplify deployment but lack support for Envoy-layer attributes like upstream clusters.

Spotify's Use Cases and Challenges

Key Requirements

Spotify's rate limiting solution must address:

High-scale traffic: 1 billion requests daily, with peaks during major events.
Dynamic bucket support: Essential for attributes like user IDs.
Low latency and high throughput: Critical for maintaining user experience.

Why Existing Solutions Fall Short

Local rate limiting cannot handle high cardinality dynamic buckets.
Global rate limiting is operationally complex and costly.
RLQS is not production-ready and introduces excessive internal requests.
Cloud services lack support for Envoy-layer attributes.

Time Cop: A Custom Rate Limiting Solution

Design Goals

Time Cop was designed to:

Reduce system rigidity: Only enforce limits on abnormal traffic.
Enhance scalability: Avoid hard limits that could impact normal users.

Core Architecture

Aggregator: Uses consistent hashing to distribute dynamic buckets across aggregators, minimizing data migration.
Counting Mechanism: Reports request counts periodically rather than per request, reducing internal request volume.
Blocking Actions: Distributes blocking actions via Google Cloud PubSub to all Envoy instances, eliminating direct aggregator connections.

Comparison with RLQS

RLQS requires per-request communication, leading to high internal request volume and data migration challenges.
Time Cop mitigates these issues by periodic reporting and consistent hashing.

Envoy Extension Selection and Go Implementation

Language Evaluation

Spotify evaluated several options for Envoy extensions:

C++: Mature API but high development barrier and complex build process.
Lua: Simple but lacks background processing and custom metrics support.
Wasm: Experimental with unclear performance and documentation.
Go: Offers background processing, native Envoy metrics, and faster development cycles.

Choosing Go

Background Processing: Go Routines enable efficient counting and aggregation.
Native Metrics: Aligns with DevOps requirements.
Development Efficiency: Leverages Spotify's existing Go ecosystem.

Implementation Details

Envoy Filter: Written in Go using GBC (Go Buffer Channel) for communication with aggregators.
PubSub Integration: Ensures timely propagation of blocking actions.
Fault Tolerance: Aggregators can handle edge cases via consistent hashing redirection.

System Characteristics and Trade-offs

Scalability

Aggregator Sharding: Reduces system load through periodic counting.
Consistent Hashing: Minimizes data migration during scaling.

Fault Tolerance

Aggregator Failover: Other aggregators handle most traffic during failures.

Performance Optimization

Reduced Internal Requests: From per-request to periodic reporting.
Low Latency: Minimal impact on user experience.

User Experience

Lenient Limits: Avoids blocking normal users during transient spikes.

Technical Evaluation and Language Selection

Language Comparison

C++: Reliable but high development barrier.
Lua: Simple but lacks advanced features.
Wasm: Experimental with unclear maturity.
Go: Fast development with native metrics support.

Key Requirements

Background Processing: Essential for real-time counting.
Custom Metrics: Required for monitoring.
Development Efficiency: Critical for rapid iteration.

Implementation Challenges and Production Deployment

Performance Testing

Go Filter Overhead: Comparable to C++ with negligible differences.
Latency Impact: P50 increased by 3ms, P99 by 11ms.
Global Rate Limiting Timeout: Occasional impact on user experience.

Technical Bottlenecks

Metric Naming: Inconsistent labels like timecop_request_count{}.
Envoy API Limitations: No direct access to listeners/clusters.
Aggregator Communication: Requires indirect Unix domain socket calls.

Production Deployment

Initial Success: Validated basic functionality in production.
Post-Deployment Adjustments: Switched to C++ due to API limitations and technical debt.
C++ Migration: 90% code rewritten with improved maintainability.

Key Technical Details

Aggregator Fault Tolerance

Health Checks: Envoy's cluster health checks enable smooth failover.
Data Loss Tolerance: Prioritizes service availability over perfect data consistency.

Cross-Platform Compilation

ARM vs. x86_64: Manual flag adjustments required for consistent builds.

Metrics and Monitoring

Go Filter Metrics: Poorly named, requiring external systems for processing.
C++ Improvements: Expected better metric management in future iterations.

Conclusion

Core Takeaways

Scalable Rate Limiting: No off-the-shelf solution met Spotify's requirements, necessitating a custom approach.
Envoy Extension Limitations: Existing options (Go, Lua, Wasm) lack maturity, with C++ as the only stable choice despite high development costs.
Community Needs: Demand for non-C++ stable filter development options to improve developer productivity.

Final Thoughts

Spotify's journey highlights the complexities of implementing rate limiting at scale. While the initial Go-based solution provided a viable starting point, the decision to migrate to C++ underscores the trade-offs between development speed and long-term maintainability. Future improvements may leverage Envoy's evolving features to simplify such implementations.

Spotify's Rate Limiting Journey with Envoy: Design, Implementation, and Challenges

Introduction

Technical Background and Existing Solutions

Rate Limiting Fundamentals

Evaluation of Existing Solutions

Spotify's Use Cases and Challenges

Key Requirements

Why Existing Solutions Fall Short

Time Cop: A Custom Rate Limiting Solution

Design Goals

Core Architecture

Comparison with RLQS

Envoy Extension Selection and Go Implementation

Language Evaluation

Choosing Go

Implementation Details

System Characteristics and Trade-offs

Scalability

Fault Tolerance

Performance Optimization

User Experience

Technical Evaluation and Language Selection

Language Comparison

Key Requirements

Implementation Challenges and Production Deployment

Performance Testing

Technical Bottlenecks

Production Deployment

Key Technical Details

Aggregator Fault Tolerance

Cross-Platform Compilation

Metrics and Monitoring

Conclusion

Core Takeaways

Final Thoughts

推薦閱讀