Part 2: Finding 'Same Nutrition, Different Meal' with Cosine Similarity

Introduction

In Part 1, I built a master dataset of 19,312 recipes, each with a complete nutritional profile. Now the question is: what can we actually do with all that data?

The first problem I wanted to solve was this: “I want something different for dinner, but with the same nutritional balance.”

If you have ever tried to eat healthily while cooking every day, you know the tension. You find a meal that hits the right nutritional targets, but after eating it three times in a week, nobody wants to see it again. What you need is a way to find alternatives — dishes that are nutritionally equivalent but taste completely different.

This is a textbook case for cosine similarity.


Why Cosine Similarity?

When comparing recipes by nutrition, the naive approach would be to calculate the Euclidean distance between their nutritional values. But this has a problem: it is dominated by the absolute scale of each nutrient.

For example, calories might range from 50 to 800, while vitamin B2 might range from 0.01 to 0.5. In Euclidean distance, a difference of 100 calories would completely overshadow a difference of 0.1 mg of vitamin B2, even though both differences might be equally significant nutritionally.

Cosine similarity solves this by comparing the direction of vectors rather than their magnitude. Two recipes with similar nutritional proportions will have a high cosine similarity, regardless of portion size.

cosine_similarity = cos(θ) = (A · B) / (||A|| × ||B||)

A value of 1.0 means the nutritional profiles point in exactly the same direction (identical proportions). A value of 0.0 means they are completely unrelated.


Setting Up the Vectors

I started with recipe-level similarity. Each recipe is represented as a 19-dimensional vector, with one dimension for each nutritional attribute:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity

# Load the master dataset
df = pd.read_csv("recipe_master.csv")

# Select nutritional columns (19 dimensions)
nutrition_cols = [
    "energy_kcal", "protein_g", "fat_g", "carbohydrate_g",
    "sodium_mg", "potassium_mg", "calcium_mg", "magnesium_mg",
    "iron_mg", "zinc_mg", "vitamin_a_ug", "vitamin_b1_mg",
    "vitamin_b2_mg", "vitamin_c_mg", "dietary_fiber_g",
    "salt_equivalent_g", "cholesterol_mg", "saturated_fat_g",
    "sugar_g"
]

# Extract nutritional data
X = df[nutrition_cols].fillna(0).values
print(f"Recipe matrix shape: {X.shape}")  # (18174, 19)

After filtering out recipes with too many missing nutritional values, I had 18,174 recipes across 19 dimensions.

Why StandardScaler Matters

Before computing similarity, normalization is critical. Without it, nutrients measured in hundreds (like calories) would dominate over nutrients measured in fractions (like vitamins):

# Normalize with StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Verify: each column now has mean=0, std=1
print(f"Mean after scaling: {X_scaled.mean(axis=0).round(4)}")
print(f"Std after scaling:  {X_scaled.std(axis=0).round(4)}")

StandardScaler transforms each nutritional dimension to have a mean of 0 and a standard deviation of 1. This ensures that every nutrient contributes equally to the similarity calculation.

Computing Similarity

# Compute pairwise cosine similarity matrix
similarity_matrix = cosine_similarity(X_scaled)
print(f"Similarity matrix shape: {similarity_matrix.shape}")
# (18174, 18174)

def find_similar_recipes(recipe_index, top_n=30):
    """Find the top-N most nutritionally similar recipes."""
    similarities = similarity_matrix[recipe_index]

    # Get indices sorted by similarity (descending), excluding self
    similar_indices = np.argsort(similarities)[::-1][1:top_n+1]

    results = []
    for idx in similar_indices:
        results.append({
            "title": df.iloc[idx]["recipe_title"],
            "category": df.iloc[idx]["category"],
            "similarity": round(similarities[idx], 4),
            "calories": df.iloc[idx]["energy_kcal"]
        })

    return results

Results: Surprising Alternatives

Let me show you what this looks like in practice. Searching for recipes similar to a standard chicken stir-fry:

# Example: Find recipes similar to recipe at index 42
target = df.iloc[42]
print(f"Target recipe: {target['recipe_title']}")
print(f"Category: {target['category']}")
print(f"Calories: {target['energy_kcal']} kcal\n")

similar = find_similar_recipes(42, top_n=10)
for i, r in enumerate(similar, 1):
    print(f"{i}. {r['title']} ({r['category']}) "
          f"- similarity: {r['similarity']}, {r['calories']} kcal")

The results were genuinely useful. A chicken stir-fry would match with dishes like a pork and vegetable simmered dish, a tofu-based casserole, or a fish with root vegetables — all nutritionally similar but completely different in taste and preparation. The system was surfacing substitutions that a human might never think of, precisely because the nutritional similarity was hidden behind very different ingredient lists.


Recipe-level search is useful, but real meal planning works at the menu level — a full meal with a main dish, side dishes, soup, and rice. I extended the similarity search to work with complete menus.

Building Menu Vectors

A menu is the sum of its component dishes. Each menu’s nutritional vector is the aggregation of its recipes’ vectors:

# Load menu composition data (which recipes make up each menu)
menu_df = pd.read_csv("menu_composition.csv")

# Aggregate nutrition per menu (27 dimensions at menu level)
menu_nutrition_cols = [
    "energy_kcal", "protein_g", "fat_g", "carbohydrate_g",
    "sodium_mg", "potassium_mg", "calcium_mg", "magnesium_mg",
    "iron_mg", "zinc_mg", "vitamin_a_ug", "vitamin_d_ug",
    "vitamin_e_mg", "vitamin_b1_mg", "vitamin_b2_mg",
    "vitamin_b6_mg", "vitamin_b12_ug", "vitamin_c_mg",
    "folate_ug", "pantothenic_acid_mg", "dietary_fiber_g",
    "salt_equivalent_g", "cholesterol_mg", "saturated_fat_g",
    "n3_fatty_acid_g", "n6_fatty_acid_g", "sugar_g"
]

menu_vectors = menu_df.groupby("menu_id")[menu_nutrition_cols].sum()
print(f"Menu matrix shape: {menu_vectors.shape}")  # (1526, 27)

At the menu level, I used 27 nutritional dimensions — more granular than the recipe-level search, because when you are planning a full day’s meals, details like folate, pantothenic acid, and fatty acid ratios become relevant.

# Normalize menu vectors
menu_scaler = StandardScaler()
menu_scaled = menu_scaler.fit_transform(menu_vectors.values)

# Compute menu-level cosine similarity
menu_similarity = cosine_similarity(menu_scaled)
print(f"Menu similarity matrix: {menu_similarity.shape}")
# (1526, 1526)

def find_similar_menus(menu_index, top_n=10):
    """Find menus with similar overall nutritional profiles."""
    similarities = menu_similarity[menu_index]
    similar_indices = np.argsort(similarities)[::-1][1:top_n+1]

    results = []
    for idx in similar_indices:
        mid = menu_vectors.index[idx]
        results.append({
            "menu_id": mid,
            "similarity": round(similarities[idx], 4),
            "total_calories": menu_vectors.iloc[idx]["energy_kcal"]
        })
    return results

With 1,526 menus in the database, the system could now answer questions like: “Last Tuesday’s dinner was nutritionally great. What else can I cook this week that hits the same targets but feels completely fresh?”


Analysis: What Worked and What Did Not

What Worked Well

  1. Surprising discoveries: The system consistently surfaced non-obvious alternatives. A grilled salmon dinner would match with a completely different cuisine — say, a simmered tofu and vegetable set — because their nutritional profiles aligned closely.

  2. StandardScaler was essential: Without normalization, the results were dominated by calorie differences and ignored micronutrient patterns. After scaling, the similarity scores became much more meaningful.

  3. Menu-level search outperformed recipe-level: Individual recipe similarity was interesting but less practical. Menu-level similarity, where the full nutritional picture of a meal is compared, produced more actionable results.

Limitations

  1. No temporal awareness: Cosine similarity is stateless. It does not know what you ate yesterday, so it might suggest the same style of cooking three days in a row. This is the problem I address in Part 3 with LSTM.

  2. No taste or preference modeling: Two recipes can be nutritionally identical but one might use ingredients you dislike. The system has no way to account for personal preferences.

  3. Portion size ambiguity: While cosine similarity is scale-invariant (which is usually a strength), it means a recipe serving 1 person and a recipe serving 4 people could appear equally similar, even though the actual nutritional intake per person would differ.


The Bigger Picture

Cosine similarity gave me a powerful tool for finding nutritional equivalents, but meal planning is more than nutrition matching. The next challenge was temporal: how do you suggest menus that avoid repetition over time?

That is fundamentally a sequence prediction problem — given what someone ate over the past week, predict what they should eat next. And sequence prediction is where recurrent neural networks shine.


Previous: Part 1: Solving the Daily Meal Planning Problem with Data

Next: Part 3: Predicting “Non-Boring” Menus with LSTM Time Series

Share this article

Related Posts