Multi-Stage Ranking - Greg Bruss

Multi-stage ranking is a system design pattern used in large-scale recommendation and search systems to efficiently rank millions or billions of items while maintaining both computational efficiency and high relevance. The key insight is to structure the ranking process as a cascade of increasingly complex models, where each stage narrows down the candidate set while applying more sophisticated ranking criteria. This pattern is essential for systems that need to select from extremely large item pools (billions of posts, videos, products, etc.) while serving results with low latency (typically under 100ms end-to-end). ## Pattern Structure A typical multi-stage ranking architecture consists of 3-4 stages: ### 1. Retrieval/Candidate Generation - **Purpose**: Efficiently retrieve a subset of potentially relevant items (hundreds to thousands) from a corpus of millions or billions - **Characteristics**: - Optimized for high recall (not precision) - Extremely fast execution (milliseconds) - Relatively simple models or heuristics - **Common approaches**: - Embedding-based retrieval using approximate nearest neighbors (ANN) - Inverted indices for keyword or metadata matching - Graph-based approaches (social connections, co-engagement) - Heuristic-based selection (recent, popular, etc.) ### 2. Light Ranking/Candidate Scoring - **Purpose**: Quickly score candidates from the retrieval stage to reduce to hundreds of items - **Characteristics**: - Balanced speed and accuracy - Lightweight models that can run in milliseconds - More features than retrieval but fewer than final ranking - **Common approaches**: - Gradient Boosted Decision Trees (GBDTs) - Shallow neural networks - Linear or logistic regression - Simple two-tower models ### 3. Heavy Ranking/Re-ranking - **Purpose**: Apply sophisticated models to the reduced candidate set for final ranking - **Characteristics**: - Optimized for precision and relevance - Complex models with rich feature interactions - Higher computational cost per item (acceptable because candidate set is small) - **Common approaches**: - Deep neural networks (DNNs) - Transformers or attention-based models - DLRM (Deep Learning Recommendation Model) - Feature-rich ensemble models ### 4. Post-Processing & Diversification (Optional) - **Purpose**: Ensure final result set meets business objectives beyond pure relevance - **Characteristics**: - Rule-based adjustments - Diversity injection - Business constraints application - **Common approaches**: - Re-ordering for diversity (content type, source) - Interleaving of different content types - Applying business rules and constraints - Deduplication ## Implementation Details ### Feature Progression Across Stages A key principle is that the feature complexity and richness increases with each stage: 1. **Retrieval features** (minimal but effective): - Pre-computed embeddings - Basic user and item IDs - Simple categorical features - Historical engagement signals 2. **Light ranking features** (more detailed but computationally efficient): - Basic user-item interactions - More granular categorical features - Recent user behavior - Simple contextual signals 3. **Heavy ranking features** (rich, detailed, and computationally intensive): - Cross-features and interactions - Full user history - Complete context - Text/image/video features - Temporal patterns - Fine-grained engagement predictions ### Performance Considerations - **Caching strategy**: Cache results from each stage strategically - Retrieval results might be cached for minutes - Re-use intermediate results for similar queries - Implement consistent cache invalidation when user state changes - **Model serving infrastructure**: - Dedicated serving clusters for each stage - Optimized inference engines tailored to model complexity - Batch processing where applicable to amortize computational costs - **Monitoring and debugging**: - Track performance at each stage independently - Log cutoffs, scores, and feature values between stages - Monitor stage-specific metrics (recall, precision, latency) ## Real-World Example: News Feed Ranking at Meta Meta's News Feed uses a multi-stage ranking approach: 1. **Retrieval (~10,000 items)**: - Candidate posts from connections and followed pages - Candidates from groups the user belongs to - Recommended content based on interests - Ads matching basic targeting criteria 2. **Light Ranking (~500 items)**: - Fast ML models score posts based on: - Post type and recency - Basic connection strength - Historical engagement with similar content - Basic relevance signals 3. **Heavy Ranking (~50 items)**: - Deep neural networks incorporating: - Detailed user-post interaction predictions - Content understanding features - Session context and user state - Relevance, meaningfulness, and engagement prediction 4. **Post-Processing**: - Diversity injection to prevent content homogeneity - Ad load determination and placement - Integrity filters to catch problematic content - Final ordering and presentation ## Key Tradeoffs and Decisions ### Candidate Set Size Between Stages |Stage Transition|Typical Size|Considerations| |---|---|---| |Corpus → Retrieval|Millions to thousands|Too few: missing relevant items<br>Too many: increased computational load| |Retrieval → Light Ranking|Thousands to hundreds|Too few: sub-optimal final results<br>Too many: higher latency| |Light → Heavy Ranking|Hundreds to dozens|Too few: limited selection for final display<br>Too many: wasted computation| ### Model Complexity vs. Latency |Stage|Model Complexity|Computation Budget|Latency Target| |---|---|---|---| |Retrieval|Very low|Nanoseconds per item|~10-20ms total| |Light Ranking|Low to medium|Microseconds per item|~20-30ms total| |Heavy Ranking|High|Milliseconds per item|~30-50ms total| ### Feature Freshness |Stage|Feature Freshness|Update Frequency| |---|---|---| |Retrieval|Can use somewhat stale data|Hourly/daily updates| |Light Ranking|Mixed stale and fresh|Minutes/hourly updates| |Heavy Ranking|Requires fresh data|Real-time/near-real-time| ## Case Studies from Research Papers ### 1. YouTube Deep Neural Network Recommendations (2016) This influential paper described YouTube's two-stage approach: 1. **Candidate Generation**: Used a deep neural network to retrieve a few hundred videos from millions 2. **Ranking**: Applied a separate ranking model with richer features to score and order candidates Key insights: - Separate models allowed optimizing for different objectives (recall vs. precision) - Demonstrated the effectiveness of embeddings for retrieval - Showed how watch time prediction could be incorporated into ranking ### 2. Meta's DLRM (Deep Learning Recommendation Model) Meta developed DLRM specifically for the ranking stage in their multi-stage system: - Efficiently handles both dense and sparse features - Optimized for specific hardware accelerators - Scales to large embedding tables - Allows complex interaction modeling through feature crosses This model is primarily used in the heavy ranking phase where computation per item can be more intensive. ## Common Pitfalls and Challenges ### 1. Stage Misalignment **Problem**: Each stage optimizes for different metrics, potentially creating inconsistencies. **Solutions**: - Use consistent engagement signals across stages - Train later stages on data selected by earlier stages - Implement joint optimization techniques - Regular end-to-end evaluations ### 2. Cold Start Items **Problem**: New items lack the historical data needed for effective multi-stage ranking. **Solutions**: - Exploration mechanisms in the retrieval stage - Content-based features in early stages - Explicit boosting of new items - Separate pipelines for cold-start content ### 3. Feedback Loops **Problem**: Items selected by the system get more data, which reinforces their selection. **Solutions**: - Counterfactual learning techniques - Explore-exploit strategies - Positional debiasing - Periodic model retraining ### 4. Debugging Complexity **Problem**: When overall performance degrades, identifying which stage caused the issue is challenging. **Solutions**: - Comprehensive logging between stages - Stage-isolated A/B tests - Clear metrics for each stage's performance - Regular stage-specific evaluations ## Implementation Best Practices ### 1. Stage-Specific Optimization Each stage should be optimized for its specific role: - **Retrieval**: Optimize for recall and speed - **Light Ranking**: Balance precision and computational efficiency - **Heavy Ranking**: Maximize relevance and engagement prediction ### 2. Feature Engineering Strategy - **Pre-computed features**: Calculate expensive features offline when possible - **Real-time features**: Prioritize the most impactful ones for online computation - **Progressive feature computation**: Add more complex features in later stages ### 3. Model Serving Infrastructure - **Tiered architecture**: Different serving systems for each stage - **Caching strategy**: Cache intermediate results where appropriate - **Fallback mechanisms**: Ensure degraded experiences still function if a stage fails ### 4. Experimentation Framework - **Stage-isolated experiments**: Test changes to individual stages - **End-to-end evaluation**: Measure overall system impact - **Counterfactual analysis**: Estimate offline what would happen with changes ## Variants and Extensions ### 1. Hybrid Multi-Stage Ranking Combines multiple types of models in each stage: - Content-based and collaborative filtering - Different model architectures in ensemble - Specialized models for different content types ### 2. Adaptive Multi-Stage Ranking Dynamically adjusts the pipeline based on context: - Skips stages for certain query types - Varies candidate set sizes based on query difficulty - Allocates different computational budgets based on user importance ### 3. Personalized Multi-Stage Ranking Customizes each stage based on user preferences: - User-specific retrieval sources - Personalized scoring functions - Custom post-processing rules ## Evaluation and Metrics ### Stage-Specific Metrics |Stage|Primary Metrics|Secondary Metrics| |---|---|---| |Retrieval|Recall@K, QPS|Diversity, Coverage| |Light Ranking|NDCG@K, AUC|Calibration, Group fairness| |Heavy Ranking|Engagement prediction, Click-through rate|Session success, Long-term metrics| |Post-Processing|Diversity measures, Business KPIs|User satisfaction, Retention| ### Holistic System Metrics - **Engagement metrics**: CTR, session length, return rate - **Relevance metrics**: Explicit feedback, surveys - **Efficiency metrics**: End-to-end latency, computational cost - **Business metrics**: Revenue, retention, growth ## When to Use Multi-Stage Ranking This pattern is best suited for: - Very large corpus sizes (millions to billions of items) - Strict latency requirements (typically <100ms) - Complex ranking criteria requiring rich features - Systems where ranking quality directly impacts business metrics It may be unnecessary for: - Small candidate pools (thousands of items or fewer) - Simple ranking criteria - Systems where latency is not critical - Offline or batch processing scenarios ## Questions: 1. **Conceptual Understanding** - How do you decide the optimal number of candidates to pass between stages? - Why use a multi-stage ranking architecture instead of a single complex model? - What are the key differences in optimization objectives between the retrieval and ranking stages? - How would you handle the trade-off between relevance and diversity in a multi-stage system? - Explain how latency requirements influence the design of each stage. 2. **System Design** - How would you design the monitoring system for a multi-stage ranking pipeline? - What caching strategies would you implement at different stages? - How would you handle a scenario where one stage in your pipeline fails or slows down? - Describe how you would evolve a single-stage system to a multi-stage architecture with minimal disruption. - How would you handle model updates in a production multi-stage system? 3. **Implementation Details** - What metrics would you use to evaluate each stage independently? - How would you debug a situation where overall relevance decreases after adding a new stage? - How do you ensure consistency in scoring between different stages? - What approaches can mitigate position bias in the training data for ranking models? - How would you implement feature re-use across stages to improve efficiency? 4. **Specialized Scenarios** - How would you adapt a multi-stage ranking system to handle viral content that rapidly gains popularity? - What modifications would you make to handle recommendation of rare or long-tail content? - How would you design a multi-stage system that optimizes for multiple, potentially competing objectives? - What approaches would you use to personalize each stage of the ranking pipeline? - How would you incorporate real-time context (like user's current session behavior) into a multi-stage system?