Two-Tower Architectures

## Pattern Structure ### High-Level Architecture ``` User Features Item Features │ │ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ │ User Tower │ │ Item Tower │ │ (Embedding Model)│ │ (Embedding Model)│ └─────────────────┘ └─────────────────┘ │ │ ▼ ▼ User Embedding Item Embedding │ │ └──────────────┬──────────────┘ │ ▼ Similarity Score (dot product or cosine similarity) │ ▼ Output ``` ### Components #### 1. User Tower - **Input**: User features (demographics, historical interactions, contextual signals) - **Architecture**: Neural network (MLP, CNN, or Transformer-based) - **Output**: User embedding vector (typically 128-1024 dimensions) - **Characteristics**: - Must run efficiently at inference time - Often includes both static and dynamic user features - Optimized for real-time computation #### 2. Item Tower - **Input**: Item features (content attributes, engagement statistics, metadata) - **Architecture**: Neural network (often similar to user tower, but can differ) - **Output**: Item embedding vector (same dimensionality as user embeddings) - **Characteristics**: - Can be pre-computed and indexed offline - Updated periodically (hourly, daily, or weekly) - Optimized for comprehensive representation #### 3. Similarity Computation - **Method**: Typically dot product or cosine similarity between embeddings - **Properties**: - Must be computationally efficient - Should correlate with user engagement probability - Often normalized to produce interpretable scores #### 4. Training Objective - **Common approaches**: - Contrastive learning (positive vs. negative samples) - Softmax cross-entropy loss - Sampled softmax (for efficiency with large item catalogs) - Triplet loss and margin-based approaches ## Implementation Details ### Embedding Dimensions and Representation The choice of embedding dimension is crucial for balancing expressiveness against computational cost: |Dimension|Typical Use Case|Tradeoffs| |---|---|---| |32-64|Mobile, latency-sensitive|Fast inference, smaller index, less expressive| |128-256|Standard recommendation|Good balance, common in production systems| |512-1024|Complex content (videos, long text)|More expressive, higher computational cost| ### Training Strategies #### Negative Sampling Approaches Effective training requires carefully selected negative examples: 1. **Random negatives**: Simple but often ineffective for large catalogs 2. **Popular negatives**: Oversampling popular items that aren't interacted with 3. **Hard negatives**: Items similar to positive examples but not interacted with 4. **In-batch negatives**: Using other examples in the training batch as negatives 5. **Mined negatives**: Finding items that are mistakenly ranked highly #### Batch Size Considerations |Batch Size|Advantages|Disadvantages| |---|---|---| |Small (32-128)|Faster iterations, less memory|Limited negative examples, noisy gradients| |Large (1024+)|Better convergence, more in-batch negatives|More compute resources, diminishing returns| ### Approximate Nearest Neighbor (ANN) Integration Two-tower architectures typically use ANN search for efficient retrieval: 1. **Index Building**: - Pre-compute all item embeddings - Build index structure (HNSW, IVF, etc.) for fast retrieval - Update index periodically as items change 2. **Query Process**: - Compute user embedding in real-time - Query ANN index to retrieve top-k nearest items - Filter results based on business rules 3. **Common ANN Libraries**: - Faiss (Meta): High-performance C++ library with Python bindings - ScaNN (Google): Optimized for frequency-normalized embeddings - NMSLIB/hnswlib: Graph-based approximate search - Annoy: Simple implementation with memory-mapping capabilities ## Real-World Example: Meta's Two-Tower Model for Post Retrieval Meta uses two-tower architectures extensively in their recommendation systems. One example implementation for post retrieval: ### User Tower - **Features**: - User embeddings based on historical engagement - Recently engaged topics/entities - Device and session context - Time-aware signals (time of day, day of week) - **Architecture**: - Sequential component (for recent history) - Dense component (for user attributes) - Cross-feature component (for interactions) ### Post Tower - **Features**: - Text embeddings from language models - Image/video embeddings from visual encoders - Creator features - Engagement statistics - Topic and entity tags - **Architecture**: - Multi-modal processing layers - Regularization to prevent overfitting - Normalization layers for stable embeddings ### Production Configuration - **Embedding dimension**: 256 (balancing expressiveness and efficiency) - **Index updates**: Continuous updates for new content, full refresh daily - **Negative sampling**: Mix of in-batch, hard negatives, and diversified sampling - **Serving infrastructure**: Distributed embedding servers with specialized hardware ## Key Tradeoffs and Decisions ### Tower Symmetry vs. Asymmetry |Approach|Description|When to Use| |---|---|---| |Symmetric|Similar architectures for both towers|Similar feature complexity on both sides| |Asymmetric|Different architectures for each tower|One side has more complex features (e.g., images vs. user actions)| ### Joint vs. Separate Training |Approach|Description|Tradeoffs| |---|---|---| |Joint Training|Both towers trained simultaneously|Better alignment, more complex training| |Separate Training|Each tower trained independently, aligned later|More flexibility, potentially weaker alignment| ### Embedding Normalization |Approach|Description|Impact| |---|---|---| |No Normalization|Raw embedding vectors|Magnitude carries information, sensitive to outliers| |L2 Normalization|Unit vectors|Direction matters only, improves training stability| |Learned Normalization|Trainable scaling factors|Balance between magnitude and direction| ## Case Studies from Research Papers ### 1. Google's YouTube Recommendations Google's seminal paper on YouTube recommendations describes a two-tower approach where: - The user tower incorporates watch history and search history - The candidate tower encodes video content and metadata - The system serves billions of users with latency requirements under 10ms Key innovations included: - Multi-task training objectives to balance different engagement signals - Efficient serving infrastructure for real-time embedding computation - Integration with multiple candidate sources beyond embedding similarity ### 2. Meta's DHEN (Deep Heterogeneous Embedding Network) Meta developed DHEN to handle heterogeneous data types in their recommendation systems: - Specialized encoders for different feature types (text, image, categorical) - Hierarchical embedding computation for efficiency - Calibration techniques to handle different engagement patterns The system demonstrates how two-tower architectures can be extended to handle complex multi-modal content while maintaining serving efficiency. ## Common Pitfalls and Challenges ### 1. Embedding Space Drift **Problem**: As content and user behavior evolve, embeddings become outdated and less effective. **Solutions**: - Regular retraining schedules (daily/weekly) - Online learning components to adapt quickly - Monitoring embedding distribution shifts - Time-aware embeddings that incorporate recency ### 2. Cold Start Items **Problem**: New items have no historical engagement data to learn effective embeddings. **Solutions**: - Content-based initial embeddings - Zero-shot embeddings based on item metadata - Exploration strategies for new items - Transfer learning from similar items ### 3. Popularity Bias **Problem**: Two-tower models often favor popular items due to training data imbalance. **Solutions**: - Balanced sampling during training - Popularity-aware regularization - Explicit diversity objectives - Post-retrieval correction factors ### 4. Training-Serving Skew **Problem**: Discrepancy between offline training and online serving can degrade performance. **Solutions**: - Mimic serving conditions during training - Calibration layers to address distribution shifts - Online evaluation and adaptation - Feature normalization consistency between training and serving ## Implementation Best Practices ### 1. Feature Engineering - **Common features for User Tower**: - Historical interactions (watch, click, purchase history) - Explicit preferences and profile information - Session context (platform, time, query if applicable) - Recent behavior sequences - **Common features for Item Tower**: - Content attributes (text, image, video features) - Engagement statistics (normalized and binned) - Creator/source information - Category and taxonomy information ### 2. Model Architecture Considerations - Start with simpler architectures (MLPs) before adding complexity - Consider transformer blocks for sequential user data - Use residual connections for deeper towers - Add bottleneck layers before final embeddings for efficiency ### 3. Training Pipeline Design - Implement data balancing to avoid popularity bias - Use curriculum learning (start simple, increase difficulty) - Mix historical and fresh data to balance stability and recency - Implement robust evaluation with held-out data ### 4. Serving Infrastructure - Separate serving paths for user and item towers - Cache user embeddings for session consistency - Implement fallback strategies for ANN service disruptions - Use hardware acceleration (GPUs, TPUs) strategically ## Variants and Extensions ### 1. Multi-Tower Architecture An extension that uses more than two towers to handle different entity types: - User tower - Item tower - Context tower - Query tower (for search) This approach enables more complex relationships while maintaining the core efficiency benefits. ### 2. Hybrid Two-Tower Models Combines embedding similarity with additional ranking signals: - Base retrieval uses two-tower architecture - Additional features feed into a separate scoring component - Final score combines embedding similarity with explicit rules ### 3. Sequential Two-Tower Models Extends the architecture to better capture sequential user behavior: - User tower incorporates recurrent or transformer layers - Historical interactions encoded in sequence-aware manner - Temporal decay applied to older interactions ### 4. Hierarchical Two-Tower Models Uses a cascade of increasingly specific two-tower models: - First level: Broad category matching - Second level: Sub-category specific models - Final level: Fine-grained item matching ## Evaluation and Metrics ### Offline Metrics |Metric|Description|When to Use| |---|---|---| |Recall@K|Percentage of relevant items in top K|Evaluating retrieval performance| |Hit Rate|Whether any relevant item appears in results|Quick quality check| |Mean Average Precision|Precision considering rank positions|When rank order matters| |NDCG|Discounted gain based on position|When both relevance and position matter| |AUC|Area under ROC curve|Classification accuracy| ### Online Metrics |Metric|Description|Considerations| |---|---|---| |Click-Through Rate|Percentage of recommendations clicked|Easy to game with clickbait| |Engagement Time|Time spent with recommended content|Better quality signal than CTR| |Long-term Retention|User return rate over time|Ultimate success metric but slow feedback| |A/B Test Lift|Performance compared to baseline|Most reliable but requires experimentation| ### Embedding Quality Metrics |Metric|Description|Use Case| |---|---|---| |Embedding Norm Distribution|Distribution of embedding magnitudes|Detect training issues| |Nearest Neighbor Coherence|Semantic similarity of nearest neighbors|Qualitative evaluation| |Embedding Space Coverage|How embeddings cover the space|Detect mode collapse| |Clustering Coefficient|How embeddings cluster|Understand content grouping| ## When to Use Two-Tower Architecture This pattern is best suited for: - Large-scale retrieval systems (millions to billions of items) - Systems with strict latency requirements - When user-item interactions are the primary signal - First-stage retrieval in multi-stage ranking systems It may be less effective for: - Systems requiring complex user-item interactions that can't be captured by dot product - Very sparse interaction data without good content features - When business logic dominates retrieval decisions - Small-scale applications where full comparison is feasible ## Questions 1. **Architecture and Design** - Why choose a two-tower architecture over a more complex interaction model for recommendations? - How do you determine the optimal embedding dimension for a two-tower model? - What are the tradeoffs between symmetric vs. asymmetric tower architectures? - How would you decide between dot product, cosine similarity, or other similarity metrics? - Explain how a two-tower model fits into a multi-stage recommendation system. 2. **Training and Optimization** - What negative sampling strategies are most effective for training two-tower models? - How do you handle the cold start problem for new users or items in a two-tower system? - What approaches can mitigate popularity bias in two-tower models? - How would you implement efficient training for a two-tower model with billions of items? - What strategies help prevent embedding space drift over time? 3. **Implementation and Serving** - How would you integrate a two-tower model with an approximate nearest neighbor (ANN) search system? - What approaches can optimize inference latency for the user tower? - How would you design the update strategy for the item tower in a production system? - What caching strategies are appropriate for user and item embeddings? - How does batch size affect training of two-tower models, and what are the tradeoffs? 4. **Evaluation and Iteration** - What metrics best evaluate the quality of embeddings in a two-tower model? - How would you diagnose and address poor performance in a deployed two-tower system? - What A/B testing approaches are effective for evaluating embedding model improvements? - How can you evaluate the quality of a two-tower model offline? - What monitoring is necessary for a production two-tower recommendation system? 5. **Advanced Applications** - How would you extend the two-tower architecture to handle sequential user behavior? - What modifications would you make to support multi-modal content (text, images, video)? - How could you incorporate contextual information that varies per recommendation request? - What approaches would enable cross-domain recommendations using two-tower architectures? - How would you design a two-tower system that optimizes for multiple objectives simultaneously?