Introduction
Rate limiting is a critical component of modern distributed systems, ensuring fair resource allocation and preventing abuse. As Spotify scales to serve millions of users with high traffic volumes, the need for a robust, scalable rate limiting solution becomes paramount. This article explores Spotify's journey in developing a custom rate limiting solution using Envoy, focusing on the design, implementation, and challenges faced during the process.
Technical Background and Existing Solutions
Rate Limiting Fundamentals
Rate limiting involves controlling the number of requests a user can make within a specified time window. It is typically applied at the L7 HTTP layer and can be based on attributes such as IP addresses, user IDs, or request paths. Two key concepts are:
- Static Buckets: Predefined attributes like URLs or upstream clusters.
- Dynamic Buckets: Runtime-determined attributes like IP addresses or user IDs.
Evaluation of Existing Solutions
Spotify evaluated several existing rate limiting approaches, each with distinct trade-offs:
- Local Rate Limiting: Offers flexibility but struggles with high cardinality dynamic buckets (e.g., IP addresses).
- Global Rate Limiting: Provides cross-Envoy state sharing but requires external storage (e.g., Redis) and suffers from performance and cost limitations.
- RLQS (Rate Limit Quota Service): Non-path-executed, reducing latency, but is still in development and introduces significant internal request overhead.
- Cloud Services/CDN: Managed services simplify deployment but lack support for Envoy-layer attributes like upstream clusters.
Spotify's Use Cases and Challenges
Key Requirements
Spotify's rate limiting solution must address:
- High-scale traffic: 1 billion requests daily, with peaks during major events.
- Dynamic bucket support: Essential for attributes like user IDs.
- Low latency and high throughput: Critical for maintaining user experience.
Why Existing Solutions Fall Short
- Local rate limiting cannot handle high cardinality dynamic buckets.
- Global rate limiting is operationally complex and costly.
- RLQS is not production-ready and introduces excessive internal requests.
- Cloud services lack support for Envoy-layer attributes.
Time Cop: A Custom Rate Limiting Solution
Design Goals
Time Cop was designed to:
- Reduce system rigidity: Only enforce limits on abnormal traffic.
- Enhance scalability: Avoid hard limits that could impact normal users.
Core Architecture
- Aggregator: Uses consistent hashing to distribute dynamic buckets across aggregators, minimizing data migration.
- Counting Mechanism: Reports request counts periodically rather than per request, reducing internal request volume.
- Blocking Actions: Distributes blocking actions via Google Cloud PubSub to all Envoy instances, eliminating direct aggregator connections.
Comparison with RLQS
- RLQS requires per-request communication, leading to high internal request volume and data migration challenges.
- Time Cop mitigates these issues by periodic reporting and consistent hashing.
Envoy Extension Selection and Go Implementation
Language Evaluation
Spotify evaluated several options for Envoy extensions:
- C++: Mature API but high development barrier and complex build process.
- Lua: Simple but lacks background processing and custom metrics support.
- Wasm: Experimental with unclear performance and documentation.
- Go: Offers background processing, native Envoy metrics, and faster development cycles.
Choosing Go
- Background Processing: Go Routines enable efficient counting and aggregation.
- Native Metrics: Aligns with DevOps requirements.
- Development Efficiency: Leverages Spotify's existing Go ecosystem.
Implementation Details
- Envoy Filter: Written in Go using GBC (Go Buffer Channel) for communication with aggregators.
- PubSub Integration: Ensures timely propagation of blocking actions.
- Fault Tolerance: Aggregators can handle edge cases via consistent hashing redirection.
System Characteristics and Trade-offs
Scalability
- Aggregator Sharding: Reduces system load through periodic counting.
- Consistent Hashing: Minimizes data migration during scaling.
Fault Tolerance
- Aggregator Failover: Other aggregators handle most traffic during failures.
Performance Optimization
- Reduced Internal Requests: From per-request to periodic reporting.
- Low Latency: Minimal impact on user experience.
User Experience
- Lenient Limits: Avoids blocking normal users during transient spikes.
Technical Evaluation and Language Selection
Language Comparison
- C++: Reliable but high development barrier.
- Lua: Simple but lacks advanced features.
- Wasm: Experimental with unclear maturity.
- Go: Fast development with native metrics support.
Key Requirements
- Background Processing: Essential for real-time counting.
- Custom Metrics: Required for monitoring.
- Development Efficiency: Critical for rapid iteration.
Implementation Challenges and Production Deployment
Performance Testing
- Go Filter Overhead: Comparable to C++ with negligible differences.
- Latency Impact: P50 increased by 3ms, P99 by 11ms.
- Global Rate Limiting Timeout: Occasional impact on user experience.
Technical Bottlenecks
- Metric Naming: Inconsistent labels like
timecop_request_count{}
.
- Envoy API Limitations: No direct access to listeners/clusters.
- Aggregator Communication: Requires indirect Unix domain socket calls.
Production Deployment
- Initial Success: Validated basic functionality in production.
- Post-Deployment Adjustments: Switched to C++ due to API limitations and technical debt.
- C++ Migration: 90% code rewritten with improved maintainability.
Key Technical Details
Aggregator Fault Tolerance
- Health Checks: Envoy's cluster health checks enable smooth failover.
- Data Loss Tolerance: Prioritizes service availability over perfect data consistency.
Cross-Platform Compilation
- ARM vs. x86_64: Manual flag adjustments required for consistent builds.
Metrics and Monitoring
- Go Filter Metrics: Poorly named, requiring external systems for processing.
- C++ Improvements: Expected better metric management in future iterations.
Conclusion
Core Takeaways
- Scalable Rate Limiting: No off-the-shelf solution met Spotify's requirements, necessitating a custom approach.
- Envoy Extension Limitations: Existing options (Go, Lua, Wasm) lack maturity, with C++ as the only stable choice despite high development costs.
- Community Needs: Demand for non-C++ stable filter development options to improve developer productivity.
Final Thoughts
Spotify's journey highlights the complexities of implementing rate limiting at scale. While the initial Go-based solution provided a viable starting point, the decision to migrate to C++ underscores the trade-offs between development speed and long-term maintainability. Future improvements may leverage Envoy's evolving features to simplify such implementations.