How to Design a Recommendation System That Actually Works

DS
LDS Team
Let's Data Science
31 min readAudio
Listen Along
0:00 / 0:00
AI voice

The difference between a successful platform and a ghost town often comes down to one algorithm: the recommendation system. Netflix estimates its recommendation engine saves over 1 billion USD annually by reducing churn. TikTok built a global empire not on social graphs, but on an algorithmic feed that knows you better than you know yourself.

But building a system that scales to 100 million users while returning results in under 200 milliseconds is a massive engineering challenge. It requires balancing mathematical rigor with heavy infrastructure constraints.

This guide moves beyond simple collaborative filtering tutorials. We will design a production-grade recommendation system capable of serving video content at the scale of YouTube or Netflix. We will cover the "funnel" architecture, vector databases, two-tower neural networks, and the critical trade-offs that senior engineers face daily.

1. Clarifying Requirements

In a system design interview or a real-world scoping meeting, you never start coding immediately. You start by defining the boundaries.

Interviewer: "Design a recommendation system for a video streaming platform."

Candidate: "Let's narrow the scope. Are we designing the homepage feed (discovery) or the 'Up Next' sidebar (related content)?"

Interviewer: "Focus on the homepage feed. We want users to discover new content they'll love."

Candidate: "What is the scale? Are we a startup or a tech giant?"

Interviewer: "Assume we are at scale. 100 million daily active users (DAU) and 10 million videos."

Candidate: "What are the latency constraints? Does the feed update in real-time as I click things?"

Interviewer: "Latency must be under 200ms. The feed should refresh near real-time, but slightly delayed updates (minutes) are acceptable for user profiles."

Candidate: "How do we handle new users with no watch history?"

Interviewer: "Good question. We need a fallback strategy for cold start scenarios."

System Requirements Summary

Functional Requirements:

  • Generate Feed: Users see a personalized list of videos upon login.
  • Record Interactions: System tracks clicks, likes, watches, and "not interested" signals.
  • Handle Cold Start: System must handle new users and new videos gracefully.

Non-Functional Requirements (The Numbers):

  • DAU: 100 Million.
  • Content Library: 10 Million videos.
  • QPS (Queries Per Second): Assuming 5 visits/user/day roughly distributed, peak traffic could hit ~50,000 QPS.
  • Latency: P99 latency < 200ms.
  • Availability: 99.99% (The feed must always load, even if recommendations are slightly stale).
  • Storage:
    • Metadata: 10M videos × 1KB ≈ 10GB (Small).
    • Interaction Logs: 100M users × 50 events/day × 100 bytes ≈ 500GB/day. This is the heavy lifter.

Interview Tip: Always calculate QPS and storage on the whiteboard. For 500GB/day, you immediately know you need a distributed log system like Kafka, not a direct database write.

2. High-Level Architecture

At 50,000 QPS, you cannot simply query a database for "videos user X likes." You need a decoupled architecture that separates Online Serving (fast) from Offline Training (slow). Many production systems also add a Nearline Layer between the two — processes that run in minutes (not milliseconds or hours). For example, updating user embeddings from Kafka events every few minutes via Flink or Spark Streaming, so recommendations reflect recent behavior without the latency cost of true real-time inference.

Loading diagram...

Understanding the Architecture:

The diagram shows three distinct layers that work together to serve recommendations in under 200ms:

Online Serving Layer (Top Row): The request flows left-to-right through five components:

  1. Client App sends a request for personalized recommendations
  2. API Gateway handles authentication, rate limiting, and request validation
  3. Rec Service (Recommendation Service) orchestrates the entire flow—it doesn't compute recommendations itself but coordinates between other services
  4. Inference serves the ML models (both retrieval and ranking)
  5. Vector DB (Faiss/Milvus) stores pre-computed video embeddings for fast similarity search

Data Storage Layer (Middle Row):

  • Kafka captures all user events (clicks, watches, scrolls) in real-time
  • PostgreSQL stores structured metadata (user profiles, video details)
  • Redis acts as the Feature Store—pre-computed features ready for instant lookup during inference

Offline Training Layer (Bottom Row): This runs daily/weekly, not during user requests:

  1. Data Lake (S3/HDFS) stores historical event logs
  2. Spark processes logs into training features
  3. ML Training trains the recommendation models on GPU clusters
  4. Model Registry versions and stores trained models, which are then deployed to the Inference service

Key Data Flows:

  • User actions flow from Client → Kafka → Data Lake (for training)
  • Trained models flow from Model Registry → Inference (for serving)
  • Pre-computed features flow from Spark → Redis (for real-time lookup)

The Recommendation Service is the brain. It doesn't do the heavy lifting itself; it orchestrates the retrieval and ranking steps.

3. Data Model & Storage

Data is the fuel for recommendation engines. We need to store three types of data efficiently.

Loading diagram...

Understanding the Data Model:

The diagram shows the three core entities and their relationships:

User Data (Left):

  • Primary key: user_id
  • Fields: demographic information (age, location, language, device preferences)
  • Storage: PostgreSQL (~10GB total for 100M users)
  • Why PostgreSQL? User profiles are read-heavy with occasional updates—perfect for relational databases

Interaction Logs (Center):

  • Primary key: event_id
  • Fields: event_type (click, watch, like, skip), timestamp, context
  • Storage: Kafka for ingestion → S3/Parquet for long-term storage
  • Volume: ~500GB/day at our scale (100M users × 50 events/day × 100 bytes)
  • This is the heaviest data and powers all model training

Video Data (Right):

  • Primary key: video_id
  • Fields: title, tags, duration, uploader, upload_time
  • Storage: PostgreSQL (~10GB for 10M videos)
  • Updated infrequently (video metadata rarely changes)

The Relationships:

  • User 1:N Interaction — One user generates many interactions
  • Interaction N:1 Video — Many interactions point to each video

This schema enables two critical queries:

  1. "What has this user interacted with?" (personalization)
  2. "Which users interacted with this video?" (collaborative filtering)

1. User & Video Metadata (Relational)

We need fast random access to profile data and video details. Database: PostgreSQL or Cassandra (if write-heavy).

  • User: user_id, age, location, language, device_type.
  • Video: video_id, title, tags, duration, uploader_id, upload_time.

2. User Interaction Logs (Time-Series / Stream)

Every click, scroll, and hover is a signal. Storage: Apache Kafka (ingestion) → Parquet on S3 (long-term).

json
{
  "event_id": "evt_98765",
  "user_id": "u_12345",
  "video_id": "v_555",
  "event_type": "watch_75_percent",
  "timestamp": 1672531200,
  "context": {"connection": "4g", "device": "iphone13"}
}

3. Derived Features (Key-Value Store)

During inference, we can't compute "average watch time last 30 days" on the fly. We pre-compute these features. Storage: Redis or Cassandra.

  • Key: u_12345
  • Value: {"avg_watch_time": 450s, "top_genre": "scifi", "recent_clicks": [...]}

Common Pitfall: Beginners often query the raw interaction logs during the recommendation request. This kills latency. Always use a Feature Store (like Redis) with pre-aggregated stats for real-time inference.

Loading diagram...

Understanding the Feature Store:

The Feature Store solves a critical problem: you cannot compute "user's average watch time over 30 days" during a 200ms request. Instead, you pre-compute features offline and serve them instantly.

Offline Path (Left Side):

  1. Data Lake stores all historical events
  2. Spark/Flink runs batch jobs to compute aggregate features (e.g., "user watched 45% sci-fi content last month")
  3. Offline Store (Parquet/Hive) stores these features for model training
  4. ML Training uses these features to train models

Sync Process (Center):

  • Spark pushes computed features to Redis hourly
  • This ensures the online store has fresh features without real-time computation

Online Path (Right Side):

  1. User Request arrives with user_id
  2. Redis looks up pre-computed features in <5ms
  3. Features (avg_watch_time, top_genre, recent_clicks) are returned
  4. Inference uses these features to run the model

Why This Matters:

  • Computing features in real-time: ~500ms (too slow)
  • Looking up pre-computed features: ~5ms (acceptable)
  • The Feature Store trades storage for latency

4. The Funnel Architecture (Serving Pipeline)

This is the most critical concept in recommendation system design. You cannot rank 10 million videos for every user request—it's computationally impossible. Instead, we use a multi-stage funnel.

Loading diagram...

Stage 1: Retrieval (Candidate Generation)

Goal: Quickly eliminate 99.99% of irrelevant videos. Method: Collaborative Filtering, Matrix Factorization, or Two-Tower Embeddings. Output: ~500 raw candidates.

Stage 2: Ranking (Scoring)

Goal: Precision. Order the 500 candidates by probability of engagement. Method: Complex Deep Learning models (e.g., Wide & Deep, DLRM) using dense features. Output: Sorted list of 500 videos with scores.

Stage 3: Re-Ranking (Business Logic)

Goal: Optimization and Policy. Logic:

  • Diversity: Don't show 10 cat videos in a row.
  • Freshness: Boost newly uploaded content.
  • Fairness: Ensure creator diversity.
  • Deduplication: Remove videos already watched. Output: Final 10-20 videos sent to the user.

5. Core Algorithm: The Two-Tower Model

Modern retrieval uses "Two-Tower" Neural Networks to create embeddings. An embedding is a vector (a list of numbers, e.g., [0.1, -0.5, 0.8]) that represents the "essence" of a user or video.

Architecture

Loading diagram...

The Math

The similarity score S(u,v)S(u, v) is usually the dot product of the user vector uu and video vector vv:

S(u,v)=i=1duivi=uTvS(u, v) = \sum_{i=1}^{d} u_i \cdot v_i = u^T v

To turn this into a probability (e.g., probability of a click), we often use the Sigmoid function:

P(clicku,v)=11+e(uTv)P(\text{click} | u, v) = \frac{1}{1 + e^{-(u^T v)}}

In Plain English: The "Two Towers" effectively map users and videos into the same geometric space. The model learns to place a user close to the videos they will like. The dot product is just a mathematical ruler measuring "how aligned are these two vectors?" If the user vector points North and the video vector points North, the score is high. If they point in opposite directions, the score is low.

Interview Tip: Why two separate towers? Because the Video Tower can be run offline. We compute vectors for all 10M videos once and store them. The User Tower runs online (real-time) to capture their latest mood, producing a query vector to search against the stored video vectors.

Training: Loss Function & Negative Sampling

The model learns from positive pairs (user clicked video) and negative pairs (user did NOT click). Without negatives, the model would think every video is relevant.

Loss Function: Binary Cross-Entropy

L=1Ni=1N[yilog(σ(uiTvi))+(1yi)log(1σ(uiTvi))]L = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\sigma(u_i^T v_i)) + (1 - y_i) \log(1 - \sigma(u_i^T v_i)) \right]

Where:

  • yi=1y_i = 1 for positive interactions (clicks, watches)
  • yi=0y_i = 0 for negative samples
  • σ\sigma = sigmoid function
  • uiTviu_i^T v_i = dot product of user and item embeddings

In Plain English: The loss function penalizes the model when it predicts "high relevance" for videos the user ignored, and "low relevance" for videos the user loved. Over millions of examples, the model learns to distinguish good recommendations from bad ones.

Negative Sampling Strategies

StrategyDescriptionProsCons
Random NegativesSample random videos as negativesSimple, fastToo easy—model doesn't learn edge cases
Hard NegativesUse videos user almost clicked but didn'tStrong learning signalCan destabilize training
In-Batch NegativesOther users' positives become your negativesComputationally efficientPopularity bias (popular items over-sampled)
Mixed StrategyCombine 80% random + 20% hardBalanced learningMore complex to implement

Interview Tip: When discussing negative sampling, mention that "in-batch negatives" is the industry standard at scale (used by Google, YouTube, Pinterest) because it's computationally free—you reuse the other positive examples in the same training batch as negatives.

Training Pipeline Architecture

How do embeddings get updated as user behavior changes?

Loading diagram...

Understanding the Training Pipeline:

This pipeline runs daily (incremental) and weekly (full retrain) to keep models fresh.

Data Pipeline (Top Row):

  1. Kafka ingests ~5 billion events per day from user interactions
  2. S3 (Raw Logs) stores events in compressed Parquet format
  3. Spark transforms raw logs into training features:
    • Joins user features with video features
    • Creates positive pairs (user clicked) and negative samples (user didn't click)
    • Outputs training-ready datasets
  4. Training (GPU Cluster) trains the Two-Tower and ranking models

Model Deployment (Bottom Row):

  1. Model Artifacts are saved after training (weights, config)
  2. Validate runs offline metrics (AUC, NDCG) on held-out data
  3. Export Embeddings generates video vectors from the trained Video Tower
  4. Faiss Index builds a searchable index of 10M video embeddings

Timeline:

  • Events → Features: 2-4 hours (Spark job)
  • Features → Trained Model: 4-8 hours (GPU training)
  • Model → Deployed Index: 1-2 hours (export + indexing)
  • Total: ~8-14 hours for a full refresh

Key Decisions:

  • User embeddings: Update frequently (hourly or real-time) to capture current session intent.
  • Video embeddings: Update daily (content doesn't change as fast as user mood).
  • Full retraining: Weekly, to incorporate new videos and decaying old patterns.

6. Deep Dive: Vector Search (The Retrieval Engine)

Once we have 10 million video vectors, how do we find the 500 closest ones to our user? A standard loop for video in all_videos: compute_score() is too slow (O(N)O(N)).

We use Approximate Nearest Neighbor (ANN) algorithms.

Technology Choices

TechnologyLatencyScaleBest For
Faiss (Meta)<1msBillionsHigh-performance, self-hosted clusters.
ScaNN (Google)<1msBillionsState-of-the-art accuracy/speed trade-off on CPUs.
Pinecone/Milvus~10ms100M+Managed services if you don't want to maintain infrastructure.

How It Works (HNSW Index)

HNSW (Hierarchical Navigable Small World) builds a multi-layered graph. It's like a highway system for data.

  1. Start at the top layer (interstates) to get to the general neighborhood.
  2. Drop down to local roads to find the exact destination.
  3. This reduces search from O(N)O(N) to O(logN)O(\log N).

Real-World Example: Pinterest uses HNSW to search billions of images. When you click a photo, they convert it to a vector and find visually similar images in milliseconds using this graph traversal.

7. Ranking Architecture (The Precision Layer)

After retrieval gives us ~500 candidates, we need precision. The ranking model is heavier because it only runs on 500 items, not 10 million.

Model: Deep Learning Recommendation Model (DLRM) or Wide & Deep.

Loading diagram...

Understanding Wide & Deep:

Google's Wide & Deep model (2016) combines two complementary learning approaches in a single architecture:

Wide Component (Left Path): A simple linear model that takes sparse features and cross-product features as input. It excels at memorization — learning specific rules like "users who watched Video A also clicked Video B." Think of it as a lookup table of known patterns.

Deep Component (Right Path): A multi-layer neural network that takes dense embeddings as input. It excels at generalization — discovering new patterns like "Sci-Fi fans tend to enjoy Tech documentaries" even if that exact combination hasn't been seen before. The embedding layer compresses high-dimensional sparse features into dense 32-dimensional vectors.

Why Both? Memorization alone leads to overfitting on historical patterns. Generalization alone misses important specific associations. The combined output captures both.

  • Wide Part: Memorizes specific interactions (e.g., "User 123 clicked Video 456"). Good for exceptions.
  • Deep Part: Generalizes patterns (e.g., "Sci-Fi fans usually like Tech reviews"). Good for exploration.

Feature Crossing

We don't just feed raw data; we combine features.

Common Feature Engineering Examples:

Feature TypeRaw DataEngineered FeatureWhy It Matters
TemporalUser timestamptime_since_last_watchUser returning after 5 min acts differently than after 5 days.
InteractionUser Historycategory_affinity_score"User watched 5 Sci-Fi videos in last 10 attempts" = 0.5 affinity.
ContextualDevice + Timeis_commutingShort videos work better on mobile during 8am-9am.
StatisticalVideo Clicksctr_at_position_kNormalizes click rate by position bias.
  • User_Country × Video_Language (Does the user speak the video's language?)
  • Time_of_Day × Video_Category (Cartoons in the morning, Horror at night?)

Key Insight: The ranking stage is where you optimize for business objectives. You might have one model predicting "Probability of Click" (CTR) and another predicting "Probability of Completion" (CVR).

Final Score=w1P(Click)+w2P(WatchTime)\text{Final Score} = w_1 \cdot P(\text{Click}) + w_2 \cdot P(\text{WatchTime})

If you only optimize for clicks, you get clickbait. If you optimize for watch time, you get quality.

For a deeper dive into the statistical metrics behind these predictions, check out our guide on The Bias-Variance Tradeoff.

8. API Design

A well-designed recommendation API balances simplicity for clients with the diagnostic information engineers need for debugging and iteration.

Core Endpoints

Feed Recommendations:

http
GET /v1/recommendations/feed?user_id=u_12345&limit=10

Response:
{
  "items": [
    {
      "video_id": "v_555",
      "title": "System Design Interview Guide",
      "thumbnail_url": "...",
      "score": 0.98,
      "source": "retrieval_algo_v2",
      "tracking_token": "token_abc123"
    },
    ...
  ],
  "next_page_token": "page_2_xyz",
  "latency_ms": 145
}

Key Design Decisions

FieldPurposeWhy It Matters
scoreRelevance score (0-1)Useful for debugging ranking issues
sourceWhich algorithm generated this itemHelps track A/B test performance
tracking_tokenOpaque token linking this impression to model stateCritical for joining impressions to clicks
next_page_tokenCursor for paginationAvoids offset-based pagination problems
latency_msServer-side latencyClient can log for performance monitoring

Why These Fields Matter

The tracking_token Pattern:

When a user sees a recommendation, you need to record which model version, which features, and which ranking produced that specific item. Later, when the user clicks (or doesn't click), you join this event with the impression data. Without this join, you cannot train your models properly.

The source Field:

When running A/B tests or using multiple retrieval algorithms, knowing which algorithm produced each recommendation lets you:

  • Debug why certain items are showing up
  • Measure per-algorithm performance
  • Roll back specific algorithms without affecting others

Interview Tip: Always include a tracking_token in the response. When the user clicks the video, the client sends this token back. This allows the backend to join the "Click" event with the exact model version and features that generated the recommendation, which is crucial for debugging and training.

Most recommendation systems also need these additional endpoints:

  • POST /v1/events/impression - Log when items are shown to users
  • POST /v1/events/click - Log when items are clicked
  • GET /v1/recommendations/similar?item_id=v_555 - Similar item recommendations
  • GET /v1/recommendations/search?query=ml+tutorials - Search with personalization

9. Cold Start Handling

The cold start problem is one of the most common interview questions for recommendation systems. It occurs when we have no interaction history for a user or an item.

Loading diagram...

The Decision Tree Logic:

  1. Is the User New? If yes, we can't use history. We fallback to Geographic Popularity (what's trending in their city) or Demographic Groups (if age/gender is known).
  2. Is the Item New? If yes, it has no interaction data. We rely on Content-Based Filtering (using BERT to analyze video title/description) or Bandits (force-showing it to a few users to gather data).

Understanding the Cold Start Decision Flow:

The diagram shows the decision tree that runs for every recommendation request:

Step 1: User Check When a request arrives with user context, the first question is "Is this user new?" The system checks if there's any interaction history.

If User is New (Top Branch): Two fallback strategies kick in:

  • Popularity: Show globally trending content that most users enjoy
  • Demographics: Use region/language to find similar users and show what they watch

Step 2: Item Check For users with history, the next question is "Are we considering new items?" Items without engagement data need special handling.

If Item is New (Middle Branch):

  • Content-Based: Use the video's title, tags, and thumbnail to create embeddings—no user interaction needed
  • Exploration Budget: Reserve 5% of impressions for new content to gather initial signals

Standard Path (Bottom): If both user and item have history, use the full Two-Tower pipeline with collaborative filtering.

The Key Insight: Cold start isn't one problem—it's a decision tree with different solutions at each branch.

New User Strategies

StrategyHow It WorksProsCons
Global PopularityShow trending/most-watched videosSimple, always worksNo personalization
Demographic SimilarityFind similar users by age/locationSome personalizationPrivacy concerns
Onboarding QuizAsk explicit preferencesHigh-quality signalUser friction
ExplorationShow diverse content, learn fastBuilds profile quicklyInitial experience may be poor

New Item Strategies

StrategyHow It WorksProsCons
Content FeaturesUse title, tags, thumbnail embeddingsImmediate availabilityIgnores taste
Creator TransferNew video inherits creator's audienceLeverages existing dataUnfair to new creators
Freshness BoostTemporarily boost new content scoreEnsures visibilityMay show low-quality content
Exploration BudgetReserve 5-10% of impressions for new itemsGathers signal fastSlight quality hit

Interview Tip: When asked about cold start, don't just say "use popularity." Show you understand the trade-off: popularity works but creates a "rich get richer" problem. Mention exploration/exploitation (Multi-Armed Bandits, Thompson Sampling) to show depth.

Exploration vs Exploitation: The Core Trade-off

The cold start interview tip above mentions exploration/exploitation — this concept is so important it deserves its own discussion.

Loading diagram...

Understanding the Trade-off:

Every recommendation is a choice: show what you know the user likes (exploit) or show something new to learn their preferences (explore). Pure exploitation creates filter bubbles. Pure exploration feels random.

The Three Main Strategies:

StrategyHow It WorksWhen to Use
Epsilon-GreedyShow the best item (1-epsilon)% of the time; random item epsilon% of the timeSimple baseline; works well when epsilon = 5-10%
UCB (Upper Confidence Bound)Score = estimated_reward + uncertainty_bonus. Items with fewer impressions get a higher bonus.When you need deterministic exploration
Thompson SamplingSample from each item's reward distribution. Items with high uncertainty get sampled more often.Best theoretical guarantees; production-proven at Spotify, Netflix

Thompson Sampling in Practice:

For each candidate item, maintain a Beta distribution of its click probability:

  • Start with Beta(1, 1) — uniform prior (no opinion)
  • After each click: Beta(alpha + 1, beta)
  • After each skip: Beta(alpha, beta + 1)
  • To rank: sample from each item's distribution and sort

Items with few impressions have wide distributions (high variance), so they occasionally sample high values and get shown — this is exploration. Items with many impressions have narrow distributions centered on their true CTR — this is exploitation.

Interview Tip: If asked "How do you balance exploration and exploitation?", mention Thompson Sampling by name. It's used at Netflix, Spotify, and LinkedIn in production. The key advantage over epsilon-greedy is that it explores intelligently — items that might be good get explored more than items that are likely bad.

10. Evaluation Metrics

You cannot improve what you cannot measure. Recommendation systems require BOTH offline and online metrics.

Offline Metrics (Before Deployment)

These are computed on historical data before pushing a model to production.

MetricFormulaWhat It MeasuresTarget
Precision@K(Relevant in Top K) / KOf the K shown, how many were relevant?> 0.3
Recall@K(Relevant in Top K) / Total RelevantOf ALL relevant items, how many did we find?> 0.5
NDCG@KDCG / IDCGRanking quality (rewards good items at top positions)> 0.7
AUCArea Under ROCClick prediction accuracy> 0.8

Online Metrics (A/B Testing in Production)

These are the real business metrics measured after deployment.

MetricWhat It MeasuresWhy It Matters
CTRClick-through rateImmediate engagement signal
Watch TimeTotal minutes watchedQuality of recommendations
Completion Rate% of videos watched to endContent-match quality
D1/D7 RetentionUsers returning after 1/7 daysLong-term platform health
DiversityUnique categories in feedPrevents filter bubbles
Loading diagram...

Understanding the Metrics Hierarchy:

The diagram shows how different metrics relate and why you need all of them.

North Star Metric (Top):

  • Watch Time is the primary metric for video platforms
  • It represents genuine user value—users don't watch content they don't enjoy
  • All other metrics should ultimately drive this number

Leading Indicators (Left - "Gameable"): These metrics move quickly but can be manipulated:

  • CTR (Click-Through Rate): How often users click recommendations
  • Completion Rate: What percentage of videos are watched to the end

Warning: Optimizing CTR alone leads to clickbait. High CTR with low completion means users clicked but didn't enjoy the content.

Lagging Indicators (Right - "Slow"): These metrics are the true measure of success but take time to measure:

  • D7 Retention: Do users return after 7 days?
  • Revenue/LTV: Long-term user value

The Balance: You need both types:

  • Leading indicators give fast feedback for iteration
  • Lagging indicators confirm you're building real value

Common Pitfall: Optimizing only for CTR leads to clickbait. Optimizing only for watch time leads to addictive content. The best systems use a weighted combination:

Score=w1P(Click)+w2P(Complete)+w3P(Return)\text{Score} = w_1 \cdot P(\text{Click}) + w_2 \cdot P(\text{Complete}) + w_3 \cdot P(\text{Return})

11. Trade-offs & Alternatives

Every design decision has a cost. Here are the major ones for this architecture.

Architectural Decisions

DecisionOption AOption BWe ChoseWhy
Embedding UpdateReal-timeBatch (Daily)HybridUser embeddings update near real-time (to catch immediate interests), Video embeddings update daily (content changes slowly).
DatabaseSQL (Postgres)NoSQL (Cassandra)BothPostgres for user metadata (ACID compliance), Cassandra/DynamoDB for interaction logs (massive write throughput).
ServingPre-compute (Cache)On-the-flyOn-the-flyPre-computing fails for active users whose context changes fast. We only cache the "Head" (popular) queries.
Model FormatTensorFlow SavedModelONNXONNXFramework-agnostic, works with Triton Inference Server, easier to optimize.
Vector DBSelf-hosted FaissManaged PineconeFaiss on K8sCost savings at scale. Pinecone gets expensive at 10M+ vectors.

Model Architecture Trade-offs

ApproachLatencyAccuracyCold StartInterpretability
Matrix FactorizationVery FastGoodPoorMedium
Two-Tower DNNFastVery GoodGoodLow
Transformer (Sequential)SlowExcellentExcellentVery Low
Graph Neural NetworkMediumVery GoodGoodLow

Interview Tip: When asked "why not use Transformers everywhere?", explain the latency constraint. Transformers are O(n2)O(n^2) in sequence length. For retrieval over 10M items, even a small Transformer per item is prohibitive. We use Transformers only in ranking (500 items) or for encoding text features offline.

Case Study: Netflix vs. TikTok

FeatureNetflixTikTok
Context"Lean Back" (TV, 2-hour movies)"Lean Forward" (Mobile, 15s clips)
Signal DensityLow (1 movie decision per night)Extremely High (100 swipes per session)
ExplorationLow Risk (Users stick to safe choices)High Risk (Must show new viral trends instantly)
ArchitectureHeavy reliance on long-term historyHeavy reliance on immediate session context (RNNs/Transformers)

Netflix focuses on accuracy because starting a bad 2-hour movie is painful. TikTok focuses on speed and adaptation because skipping a bad 15-second video is costless.

12. Real-World Company Architectures

Theory is valuable, but seeing how industry leaders solve these problems at scale is invaluable. Let's examine four companies that have publicly shared details about their recommendation systems.

Twitter/X: The For You Timeline

In March 2023, Twitter open-sourced their recommendation algorithm, providing unprecedented insight into a production system serving hundreds of millions of users.

Twitter/X Recommendation Algorithm ArchitectureTwitter/X Recommendation Algorithm Architecture

Source: Twitter Engineering Blog

Key Components:

ComponentWhat It DoesScale
GraphJetReal-time bipartite graph engine maintaining user-tweet interactions. Uses SALSA random walks to find relevant out-of-network tweets.Serves ~15% of timeline tweets
SimClustersMatrix factorization creating 145k community clusters. Users and tweets are embedded into this community space.Updated every 3 weeks
TwHINTwitter's Heterogeneous Information Network embeddings for content understandingBillions of edges
Heavy Ranker48 million parameter neural network optimizing for Likes, Retweets, and RepliesScores ~1,500 candidates per request

The Funnel in Practice:

  1. Candidate Sources generate ~1,500 tweets from In-Network (people you follow) and Out-of-Network (GraphJet, SimClusters)
  2. Heavy Ranker scores all candidates with the 48M-param model
  3. Heuristics & Filters apply business rules (diversity, author limits, content policy)
  4. Mixing blends different candidate sources into the final timeline

Interview Insight: When Twitter engineers mention "In-Network" vs "Out-of-Network," they're distinguishing between tweets from accounts you follow (easy) versus discovering tweets from accounts you don't follow (hard). The hard part—Out-of-Network—is where GraphJet and SimClusters shine.

TikTok/ByteDance: Monolith

TikTok's recommendation system is arguably the most sophisticated in the world. Their 2022 paper "Monolith: Real Time Recommendation System With Collisionless Embedding Table" reveals key innovations.

TikTok Monolith Online Training ArchitectureTikTok Monolith Online Training Architecture

Source: Monolith arXiv Paper - Figure 1

The Problem TikTok Solved:

Traditional recommendation systems have a fatal flaw: the model serving predictions was trained hours or days ago. User interests shift in minutes on TikTok. If you suddenly start watching cooking videos, the old model doesn't know.

Monolith's Solution: Real-Time Training

Key Innovations:

InnovationHow It WorksWhy It Matters
Collisionless Embedding TableUses Cuckoo Hashing instead of standard hash tables. Each ID gets a unique embedding without collisions.Eliminates quality degradation from hash collisions at scale
Dual Parameter ServersTraining-PS learns from live data; Serving-PS handles inference. They sync every few minutes.Model adapts in near real-time
Expirable EmbeddingsOld user/video embeddings automatically expire after N daysKeeps memory bounded, removes stale data
Frequency FilteringUsers must have X logins, videos must have Y views before getting embeddingsReduces noise, focuses on engaged users

The Math Behind Cuckoo Hashing:

Standard embedding tables use hash(user_id) % table_size which causes collisions—two different users mapping to the same embedding. At TikTok's scale (billions of users), collisions degrade recommendation quality.

Cuckoo hashing uses two hash functions. If position h1(key) is occupied, the existing item is "kicked out" to its alternate position h2(key). This continues until everyone finds a home, guaranteeing no collisions.

Interview Insight: When asked "How would you make recommendations real-time?", mention TikTok's Training-PS / Serving-PS architecture with minute-level parameter synchronization. This shows you understand that "real-time" doesn't mean training on every click—it means fast feedback loops.

Pinterest: PinSage and Pinnability

Pinterest faces a unique challenge: their content is primarily visual, and user intent is often aspirational (planning a wedding, remodeling a kitchen) rather than immediate.

Pinterest PinSage Graph Neural Network ArchitecturePinterest PinSage Graph Neural Network Architecture

Source: PinSage KDD 2018 Paper

PinSage: Graph Neural Networks at Scale

Pinterest developed PinSage, a Graph Convolutional Network operating on a graph with 3 billion nodes and 18 billion edges—10,000x larger than typical GCN applications.

How PinSage Works:

  1. Graph Construction: Pins (images) and boards are nodes. Edges connect pins that appear on the same board.
  2. Random Walk Sampling: Instead of processing the entire graph, PinSage samples neighborhoods via random walks.
  3. Importance Pooling: Neighbors are weighted by visit frequency from random walks, not treated equally.
  4. Aggregation: Each pin's embedding is computed by aggregating its neighbors' features through neural network layers.

Pinnability: The Ranking Model

The Pinnability model predicts personalized pin relevance using:

  • Pin Signals: PinSage embedding, visual features, text annotations
  • User Signals: Long-term interests (yearly), short-term context (session-level)
  • Context Signals: Time of day, device, surface (home feed vs search)

Dual Timeframe Architecture:

Pinterest separates user modeling into two distinct timeframes:

TimeframeModelUpdatesCaptures
Long-termYearly Transformer predicting 14-day engagementWeeklyStable interests (home decor style, cuisine preferences)
Short-termSession-level sequence modelReal-timeImmediate intent (currently planning a birthday party)

Interview Insight: Pinterest's dual-timeframe approach is brilliant for interview discussions. It shows you understand that "user preferences" aren't monolithic—someone's decade-long love of minimalist design is different from their current obsession with planning a tropical vacation.

YouTube: Multi-Objective Optimization

YouTube's 2016 paper "Deep Neural Networks for YouTube Recommendations" introduced the two-tower architecture. Their 2019 follow-up introduced multi-objective optimization.

YouTube Deep Neural Network Recommendation ArchitectureYouTube Deep Neural Network Recommendation Architecture

Source: Google Research RecSys 2016

The Problem with Single Objectives:

Early YouTube optimized purely for click-through rate (CTR). Result? Clickbait. Thumbnails promised drama that videos didn't deliver. Users clicked but didn't watch.

Then they optimized for watch time. Result? Addictive, often low-quality content that kept users watching but left them feeling worse.

The Solution: Multi-Objective Ranking

YouTube now uses Multiple Objectives with the MMoE (Multi-gate Mixture-of-Experts) architecture:

Loading diagram...

Understanding the MMoE Architecture:

The diagram above shows the key innovation: instead of a single neural network trying to predict everything, MMoE uses multiple expert sub-networks that each learn different patterns in the data. Two gating networks (one per task) learn which experts to weight for each objective.

  • The Engagement Gate might weight Expert 1 and Expert 3 heavily (they learned CTR patterns)
  • The Satisfaction Gate might weight Expert 2 and Expert 4 (they learned quality patterns)
  • This allows shared learning (experts are shared) while task-specific specialization (gates are separate)

Objective Categories:

CategoryMetricsWhat They Capture
EngagementClicks, Watch Time, Completion RateUser attention (but not necessarily satisfaction)
SatisfactionLikes, Shares, Survey ResponsesUser value (did they enjoy it?)
QualityAuthoritative sources, E-A-T signalsContent trustworthiness

The MMoE Architecture:

Instead of a single neural network, MMoE uses multiple "expert" sub-networks, each specializing in different patterns. A gating network learns which experts to consult for each objective.

Final Score=w1P(Click)+w2P(LongWatch)+w3P(Like)+w4P(Share)w5P(Regret)\text{Final Score} = w_1 \cdot P(\text{Click}) + w_2 \cdot P(\text{LongWatch}) + w_3 \cdot P(\text{Like}) + w_4 \cdot P(\text{Share}) - w_5 \cdot P(\text{Regret})

The weights (w₁, w₂, ...) are tuned based on business priorities and A/B test results.

Handling Position Bias:

Users tend to click items shown at the top regardless of relevance. YouTube adds a "shallow tower" that learns position bias separately, then subtracts it from the main prediction. This prevents the model from learning "position 1 is always good."

Interview Insight: If asked "How do you balance engagement vs quality?", reference YouTube's multi-objective approach. The key insight is that you cannot optimize for a single metric—you need a weighted combination, and the weights are a business decision informed by A/B tests.

13. Beyond Two-Tower: Sequential & Transformer Models

The Two-Tower architecture excels at capturing static preferences, but users aren't static. Their interests evolve within a single session. Sequential models capture this dynamic behavior.

Loading diagram...

The Limitation of Two-Tower

Two-Tower creates a single user embedding from their history. But consider:

  • User watches: [Action Movie → Action Movie → Cooking Show → Cooking Show → Cooking Show]
  • Two-Tower embedding: "50% action, 50% cooking"
  • Reality: User clearly shifted from action to cooking!

SASRec: Self-Attentive Sequential Recommendation

SASRec (2018) applies self-attention (the mechanism behind Transformers) to sequential recommendation.

Architecture:

SASRec Pipeline:

  1. Input: User interaction sequence [item₁, item₂, item₃, ..., itemₙ]
  2. Embedding Layer: Item embeddings + positional encodings
  3. Self-Attention Blocks: L transformer layers
  4. Output: Prediction for next item (itemₙ₊₁)

Key Insight: Self-attention allows each position to attend to all previous positions, learning which past interactions are most relevant for predicting the next one. Recent items often matter more, but not always—a user might return to an interest from weeks ago.

Complexity Trade-off:

ModelComplexitySequential PatternsLong-Range Dependencies
Markov ChainO(1)Only last itemNone
RNN/LSTMO(n)YesWeak (vanishing gradient)
SASRec (Transformer)O(n²)YesStrong

BERT4Rec: Bidirectional Understanding

BERT4Rec (2019) adapts BERT's masked language modeling to recommendations.

Training Approach:

Instead of predicting the next item, BERT4Rec randomly masks items in the sequence and predicts them:

SequenceContent
OriginalMovie A → Movie B → Movie C → Movie D → Movie E
MaskedMovie A → [MASK] → Movie C → [MASK] → Movie E
TaskPredict Movie B and Movie D from surrounding context

Why Bidirectional?

SASRec only looks backward (causal attention). BERT4Rec looks both ways, understanding that Movie E might inform what Movie B should be. This captures richer patterns but requires careful handling at inference time.

When to Use Sequential Models

ScenarioBest ApproachRationale
Long-form content (Netflix movies)Two-Tower + light sequence featuresUsers make few decisions; each is deliberate
Short-form content (TikTok, YouTube Shorts)Full sequential modelRapid interactions; intent changes fast
E-commerce browsingHybridSession intent matters, but so do long-term preferences
Search rankingMostly non-sequentialQuery intent dominates session history

Interview Tip: When discussing Transformers for recommendations, always mention the latency constraint. You might say: "We'd use a Transformer-based model like SASRec in the ranking stage (500 candidates) but not in retrieval (10M candidates) due to O(n²) complexity. For retrieval, we'd encode the sequence offline into a single query vector."

14. What Breaks in Production

Recommendation systems that work perfectly in offline evaluation often fail spectacularly in production. Here's what goes wrong and how to fix it.

Loading diagram...

System Failures & Solutions Matrix:

Failure ModeSymptomRoot CauseSolution
Position BiasTop results get 90% clicksUsers are lazy; trust the ranking implicitlyShallow Tower: Learn bias as a feature and subtract it during serving.
Feedback LoopUser only sees Political NewsModel over-optimizes for engagement on past clicksExploration Budget: Force 5% random/new content into feed.
StalenessRecommendations feel "yesterday"Embeddings not updating fast enoughReal-time Updates: Move User Profile to Feature Store (Redis).
Cold StartNew users bounce immediatelyNo history to base rank onHeuristics: Serve "Global Best" or "Geographic Trending".

Understanding Production Failure Modes:

The diagram shows four categories of problems and how they interconnect:

Bias Issues (Top Row): Three types of bias that corrupt recommendations:

  • Position Bias: Items at position 1 get clicked regardless of quality—the model learns "top = good"
  • Popularity Bias: Popular items get more impressions → more clicks → higher scores → even more impressions
  • Echo Chambers: Users only see content similar to what they've seen before, narrowing their experience

The Feedback Loop (Middle): This is the core problem. Notice the circular flow:

  1. User clicks on recommendations
  2. Rec System uses those clicks to rank future items
  3. Training Data is updated with click history
  4. Cycle repeats → biases amplify over time

The arrows from the bias row show how each bias type feeds into this loop.

Operational Issues (Third Row): Technical problems that break systems in production:

  • Feature Drift: User preferences change, but features computed yesterday don't reflect today
  • Latency Issues: When the ML service times out, what fallback do you show?
  • Cold Start: New users/items without data need special handling

Mitigation Strategies (Bottom Row): Solutions that feed back into the system:

  • Debiasing: Techniques like IPW (Inverse Propensity Weighting) or position-aware models
  • Diversification: Algorithms like MMR or DPP that force variety in results
  • Monitoring: Drift detection to catch when model predictions diverge from reality

Position Bias

The Problem: Users click items at the top of the list regardless of relevance. Your model learns "position 1 = good" instead of "this item is relevant."

How to Detect: Randomly shuffle recommendations for a small percentage of traffic. If position 1's CTR drops dramatically while position 5's rises, you have position bias.

Solutions:

ApproachHow It WorksTrade-off
Position as FeatureInclude position in training; zero it out at inferenceSimple but imperfect
Inverse Propensity WeightingWeight training examples by 1/P(position)Theoretically sound; high variance
Shallow Tower (YouTube)Separate network learns position effect; subtract from main predictionMore complex; very effective

Popularity Bias (The Rich Get Richer)

The Problem: Popular items get more impressions → more clicks → higher predicted scores → more impressions. New and niche content never surfaces.

How to Detect: Track coverage metrics—what percentage of your catalog gets recommended? If 1% of items get 90% of impressions, you have a problem.

Solutions:

ApproachHow It WorksTrade-off
Exploration BudgetReserve 5-10% of impressions for random/new itemsSlight quality hit; gathers data
Popularity PenaltySubtract α × log(popularity) from scoresHelps niche items; may hurt user experience
Multi-Armed BanditsThompson Sampling balances exploration/exploitation dynamicallyMore complex; theoretically optimal
Creator-Side FairnessEnsure minimum exposure for all creators meeting quality barPolicy decision; may reduce engagement

Feedback Loops (Echo Chambers)

The Problem: Model recommends X → User engages with X (it was shown!) → Model learns "user likes X" → Recommends more X. The user never sees alternatives.

Example: User watches one political video → Algorithm shows more → User's feed becomes entirely political, regardless of their actual breadth of interests.

Solutions:

ApproachHow It WorksTrade-off
Diversity ConstraintsForce top-K to include N different categoriesMay feel random to users
Counterfactual TrainingTrain on "what would user do if shown Y instead of X?"Requires causal inference; complex
Periodic ExplorationOccasionally reset user profiles or inject diverse contentDisrupts personalization

Cold Start Edge Cases

Even with cold start strategies, edge cases break systems:

New User + New Item: Neither has history. Fall back to global popularity, but this creates homogeneous experiences.

Returning User After Long Absence: Their old preferences may be stale. TikTok's expirable embeddings address this—old data automatically ages out.

Seasonal Interests: User searches for "Halloween costumes" every October. Simple models don't capture this periodicity.

Feature Drift and Staleness

The Problem: User embeddings computed from yesterday's data don't reflect today's interests. Video embeddings don't update when engagement patterns change.

How to Detect: Monitor prediction calibration over time. If P(click) = 0.1 predictions have actual CTR of 0.05 after a week, your features are drifting.

Solutions:

ComponentUpdate FrequencyRationale
User short-term featuresReal-time (Kafka → Feature Store)Session intent changes fast
User long-term embeddingsHourlyStable but should reflect recent activity
Item embeddingsDailyContent doesn't change; engagement patterns do
Model weightsWeekly full retrain + daily incrementalBalance freshness vs stability

Production Wisdom: The most common production issue isn't algorithmic—it's data pipeline failures. A broken Kafka consumer means stale features. A failed embedding job means missing vectors. Build monitoring for data freshness, not just model accuracy.

Data Pipeline Monitoring Checklist

Production-grade recommendation systems need monitoring at every layer of the data pipeline, not just the model:

MonitorWhat to TrackAlert ThresholdWhy It Matters
Event IngestionKafka consumer lag, events/secLag > 10 min or throughput drops 50%Stale features lead to stale recommendations
Feature FreshnessTime since last Redis update per featureAny feature > 1 hour staleOld user embeddings miss recent behavior
Embedding Coverage% of active users/items with valid embeddingsCoverage drops below 95%Missing embeddings fall back to cold start
Model Prediction DriftKL divergence between predicted and actual CTR distributionsKL > 0.1 over 24h windowModel is losing calibration
Serving LatencyP50, P95, P99 of inference timeP99 > 150msUsers see loading spinners or timeouts
Null Feature Rate% of inference requests with missing featuresAny feature > 5% nullSignals a broken pipeline upstream

Pro Tip: Set up a "data SLA dashboard" that shows feature freshness for each pipeline stage. When recommendations feel stale, the first question should be "is the data fresh?" — not "is the model wrong?"

15. A/B Testing: The Complete Guide

Deep Dive Available: For a comprehensive treatment of A/B testing fundamentals, statistical foundations, and practical implementation, see our dedicated article: A/B Testing Design and Analysis: How to Prove Causality with Data

"We shipped the model and engagement went up 5%!" means nothing without proper A/B testing. This section covers A/B testing specifically for recommendation systems.

Experiment Design

Random Assignment:

Loading diagram...

Users must be randomly assigned to control (A) or treatment (B). This ensures the only difference between groups is the change you're testing.

Bucket Assignment:

  • User ID → Hash Function → Bucket (0-99)
  • Buckets 0-49: Control (existing model)
  • Buckets 50-99: Treatment (new model)

Sample Size Calculation:

Before running the test, calculate required sample size:

ParameterTypical ValueImpact
Baseline metrice.g., 5% CTRHigher baseline → fewer samples needed
Minimum Detectable Effect (MDE)e.g., 2% relative liftSmaller MDE → more samples needed
Statistical Power80% (standard)Higher power → more samples needed
Significance Level5% (α = 0.05)Lower α → more samples needed

Rule of Thumb: For a 2% relative lift on 5% CTR with 80% power and 95% confidence, you need roughly 400,000 users per group.

What Netflix Does

Netflix shared their A/B testing philosophy:

  1. Hypothesis First: Every test starts with a hypothesis: "Showing Top 10 will help members find content, increasing satisfaction."

  2. Local + Global Metrics: They track both the specific feature metric (Top 10 clicks) AND global metrics (overall viewing hours). An improvement in local metrics that hurts global metrics is a red flag.

  3. Holdout Groups: Even after shipping, Netflix maintains a small percentage of users on the old experience to measure long-term effects.

  4. Skepticism of Large Effects: If a test shows 20% improvement, they're suspicious. Large effects often indicate bugs, not breakthroughs. They re-run such tests.

What Spotify Does

Spotify runs hundreds of experiments simultaneously:

  1. Precise Sample Sizing: They calculate exact sample sizes to avoid underpowered tests that waste traffic without delivering answers.

  2. Pre-Registration: Hypotheses and success metrics are documented before the test runs. This prevents cherry-picking positive results.

  3. Minimum Duration: Tests run at least 1-2 weeks to capture weekly patterns (weekend vs weekday behavior differs).

  4. Learning from Failures: Failed tests are documented. Understanding WHY a hypothesis was wrong often yields more insight than confirming it was right.

Common A/B Testing Mistakes

MistakeWhy It's WrongHow to Avoid
Peeking at results earlyMultiple comparisons inflate false positive ratePre-commit to sample size; use sequential testing methods if peeking is necessary
Stopping at significancep = 0.049 today might be p = 0.06 tomorrowWait for pre-calculated sample size
Ignoring segmentsOverall neutral effect may hide +10% for new users, -5% for power usersPre-register key segments; analyze heterogeneous effects
Only measuring short-termA feature that boosts CTR might hurt 30-day retentionInclude long-term holdouts; track delayed metrics
Network effectsUser A's treatment affects User B's control (e.g., in social features)Use cluster randomization (randomize by friend groups, not individuals)

Metrics Hierarchy for Recommendations

Metrics Hierarchy:

LevelCategoryMetrics
North StarBusinessMonthly Active Users, Revenue
PrimaryEngagementSessions, Watch Time, Clicks, Impressions
PrimarySatisfactionRatings, Surveys, Likes, Shares
PrimaryQualityDiversity, Freshness, Coverage, Creator Mix
ModelOfflineAUC, NDCG@K, Recall@K, Precision@K

Interview Insight: When discussing A/B testing, mention the distinction between "guardrail metrics" (things that must not get worse, like app crashes) and "success metrics" (things you're trying to improve, like engagement). A test can improve success metrics but fail if it hurts guardrails.

16. Cost Estimation at Scale

Understanding the cost of recommendation infrastructure helps you make architectural decisions and defend them in interviews.

Reference Architecture Costs

For a system serving 100M DAU with 10M items:

ComponentServiceScaleEstimated Monthly Cost
Event IngestionAWS MSK (Kafka)500GB/day, 50K events/sec$3,000 - $5,000
Feature StoreRedis Cluster100GB hot data, 100K QPS$2,000 - $4,000
Vector DatabaseSelf-hosted Faiss on EC210M vectors, 50K QPS$1,500 - $3,000
Model ServingTriton on GPU (g4dn.xlarge)50K QPS, P99 < 50ms$5,000 - $10,000
TrainingSageMaker / EC2 GPUDaily training, weekly full retrain$3,000 - $8,000
Interaction LogsS3 + Athena500GB/day, 30-day retention$500 - $1,000
Metadata DBRDS PostgreSQL10M items, high-read$500 - $1,000
CDN / API GatewayCloudFront + API Gateway50K QPS$1,000 - $2,000

Total Estimated Range: 17,000 - 37,000 USD/month

Cost Drivers and Optimizations

Cost DriverOptimizationSavings
GPU ServingQuantization (FP32 → INT8), batching requests50-70%
Vector SearchProduct Quantization (PQ) reduces memory 4-8x60-80%
Feature StoreTiered storage (hot/warm/cold), TTLs on old data40-60%
TrainingSpot instances, incremental training instead of full retrains50-70%
StorageParquet compression, lifecycle policies to Glacier60-80%

Build vs Buy: Vector Database

Option10M Vectors Cost100M Vectors CostProsCons
Pinecone~$70/month (Starter)$700+/monthFully managed, easyExpensive at scale, vendor lock-in
Self-hosted Faiss~$500/month (3x r5.xlarge)~$2,000/monthFull control, cheapest at scaleOperational burden
AWS OpenSearch~$300/month~$1,500/monthManaged, AWS integrationNot specialized for vectors
Milvus (Self-hosted)~$600/month~$2,500/monthFeature-rich, open sourceComplex to operate

Interview Insight: When asked about costs, don't just quote numbers—explain the trade-offs. "Pinecone is great for startups at ~70 USD/month, but at our scale of 100M vectors, self-hosted Faiss saves ~30K USD/month with ~2 engineer-weeks of setup. That's a clear ROI."

Scaling Economics

Sublinear Scaling (Good):

  • Vector search: 10x more vectors ≠ 10x more cost (indexing amortizes)
  • Model serving: Batching requests improves GPU utilization

Linear Scaling (Expected):

  • Event ingestion: 10x more users = ~10x more events = ~10x cost
  • Storage: 10x more logs = ~10x storage cost

Superlinear Scaling (Dangerous):

  • Model complexity: Adding features can increase serving cost quadratically
  • A/B testing: More experiments = more traffic splits = reduced power per test

17. Conclusion

Designing a recommendation system at scale is about managing the funnel. You start with millions of items, filter them down with fast approximate algorithms (retrieval), and then carefully rank the survivors with heavy deep learning models.

Loading diagram...

The Complete Picture:

This summary diagram shows all three layers working together at scale.

Data Layer (Left Column): The foundation that captures and stores all information:

  • Kafka: Real-time event streaming (clicks, watches, scrolls)
  • S3: Long-term storage for training data
  • Redis: Feature Store for instant feature lookup
  • PostgreSQL: Metadata for users and videos

ML Layer (Center Column): The intelligence that powers recommendations:

  • Two-Tower Model: Fast retrieval of candidate videos (500 from 10M)
  • DLRM (Deep Learning Ranking Model): Precise ranking of candidates
  • Rules Engine: Business logic (diversity, freshness, content policy)
  • Cold Start: Fallback strategies for new users/items

Serving Layer (Right Column): The infrastructure that delivers recommendations in <200ms:

  • API Gateway: Authentication, rate limiting, request routing
  • Load Balancer: Distributes traffic across servers
  • Triton: GPU-optimized model inference
  • Faiss: Sub-millisecond vector similarity search

The Flow:

  1. User request hits API Gateway
  2. Rec Service fetches user features from Redis
  3. Two-Tower model retrieves 500 candidates via Faiss
  4. DLRM ranks the 500 candidates
  5. Rules Engine applies business logic
  6. Top 10-20 videos returned to user

At Scale:

  • 100M Daily Active Users
  • 50,000 Queries Per Second
  • <200ms P99 latency
  • 10M videos in the catalog

Looking Ahead: LLMs Meet Recommendations

The next frontier in recommendation systems is the integration of Large Language Models (LLMs). Several trends are emerging:

LLMs as Feature Extractors: Instead of training custom content encoders, companies are using LLM embeddings (from models like GPT-4 or Gemini) to represent item content. These embeddings capture nuanced semantic meaning — a video about "beginner-friendly Python data visualization" and one about "matplotlib tutorial for newcomers" are understood as deeply similar, something traditional tag-based systems miss.

Conversational Recommendations: Platforms like Spotify ("DJ" feature) and Netflix are experimenting with chat-based interfaces where users describe what they want in natural language: "Something funny but not too silly, maybe a heist movie." An LLM interprets the intent and maps it to recommendation candidates.

Reasoning over User History: LLMs can analyze a user's watch history and generate explanations: "Because you watched three documentaries about space this week, you might enjoy this new series about the Mars mission." This combines recommendation with interpretability.

Where This Is Heading: Expect hybrid systems where traditional Two-Tower/DLRM models handle the high-throughput retrieval and ranking (50K QPS), while LLMs handle the "last mile" — re-ranking the final 20-50 items with richer contextual understanding, or powering conversational discovery features that complement the algorithmic feed.

Key Takeaways for Interviews

  1. Always start with requirements. Clarify scale, latency, and cold start before drawing boxes.

  2. Separate retrieval from ranking. This is the most important architectural decision. Retrieval prioritizes recall (fast, approximate). Ranking prioritizes precision (slow, accurate).

  3. Don't forget cold start. New users and new items break collaborative filtering. Always have a fallback (popularity, content-based features, exploration).

  4. Measure both offline AND online metrics. High offline NDCG doesn't guarantee high online watch time. Always A/B test.

  5. Negative sampling matters. Without negatives, the model thinks everything is relevant. In-batch negatives are the industry standard.

The "secret sauce" isn't just the algorithm—it's the infrastructure that feeds it:

  • Kafka for real-time data ingestion at 500GB/day.
  • Vector Databases (Faiss) for sub-millisecond search over 10M embeddings.
  • Feature Stores (Redis) to provide instant user context without querying raw logs.
  • Model Serving (Triton) for low-latency GPU inference at 50K QPS.

Further Reading

To master the concepts powering these models:

Practice Problems

  1. Trending Now: Design a "Trending Now" section. Does it fit into the main funnel, or is it a separate cache?

  2. Diversity vs. Relevance: How would you ensure users see content from at least 5 different categories? Where in the pipeline do you enforce this?

  3. Creator Fairness: Small creators complain their videos never get recommended. How would you address this without hurting engagement metrics?

  4. Real-Time Personalization: A user just watched 3 cooking videos in a row. How fast can you update their recommendations?

These questions test whether you understand not just the "what" but the "why" behind each architectural decision.