Part 1: Solving the Daily Meal Planning Problem with Data

Introduction

“What should we cook tonight?”

If you cook at home regularly, you know this question is far more complicated than it sounds. You need to consider nutrition, avoid repeating the same meals, work with what’s in the fridge, keep things interesting for everyone at the table, and somehow do all of this every single day.

Professional nutritionists spend years learning to balance all these factors. For the rest of us, it is an exhausting daily puzzle with no clear solution.

This series is about my attempt to solve that problem with machine learning — and the journey spans over a decade, from classical data science all the way to modern LLMs.


Why I Started This Before the AI Boom

I began this project more than ten years ago, well before “AI” became a household word. At the time, deep learning was still an academic curiosity, and GPT did not exist. What I had were traditional tools: pandas, scikit-learn, and a lot of CSV files.

The motivation was simple. I was cooking every day, and the mental load of planning nutritionally balanced, non-repetitive meals was real. I thought: if I could represent recipes as numerical vectors, maybe I could use math to find better meal combinations.

That instinct turned out to be right — though the path from idea to working system was anything but straightforward.


Three Data Sources

The foundation of any ML project is data. I started with three datasets:

DatasetFileRecordsContents
Recipesrecipe.csv19,902Recipe names, categories, serving sizes, cooking instructions
Ingredientsmaterial.csv196,126Ingredient names, quantities, and units linked to each recipe
Nutritionnutrition.csvPer-recipe nutritional breakdown (calories, protein, fat, vitamins, minerals)

The recipes came from a publicly available Japanese recipe database. Each recipe had a unique ID, and the ingredient and nutrition tables linked back to that ID — a classic relational structure.


Step 1: Cleansing the Recipe Data

Raw data is never clean. The recipe dataset had missing values, inconsistent formatting, and duplicate entries. Here is how I approached the cleansing:

import pandas as pd

# Load raw recipe data
recipe = pd.read_csv("recipe.csv", encoding="cp932")

# Check the shape and missing values
print(recipe.shape)  # (19902, 8)
print(recipe.isnull().sum())

The first thing I noticed was that the recipe_id column had some null values, and certain text fields contained inconsistent encodings (a common problem with Japanese text data in CSV format).

# Drop rows with missing recipe IDs
recipe = recipe.dropna(subset=["recipe_id"])

# Convert recipe_id to integer for consistent joining
recipe["recipe_id"] = recipe["recipe_id"].astype(int)

# Strip whitespace from string columns
str_cols = recipe.select_dtypes(include="object").columns
recipe[str_cols] = recipe[str_cols].apply(lambda x: x.str.strip())

print(f"After cleansing: {len(recipe)} recipes")

After removing entries with missing IDs and cleaning up the text fields, I had a solid base of recipes to work with.


Step 2: Converting Recipes to JSON

For downstream ML processing, I needed each recipe as a self-contained JSON object rather than a flat CSV row. This made it much easier to work with nested ingredient lists later.

import json

def recipe_to_json(row):
    """Convert a single recipe row to a structured JSON object."""
    return {
        "recipe_id": int(row["recipe_id"]),
        "title": row["recipe_title"],
        "category": row["category"],
        "servings": row["servings"],
        "instructions": row["instructions"]
    }

recipes_json = recipe.apply(recipe_to_json, axis=1).tolist()

# Save as JSON for inspection
with open("recipes_structured.json", "w", encoding="utf-8") as f:
    json.dump(recipes_json, f, ensure_ascii=False, indent=2)

print(f"Converted {len(recipes_json)} recipes to JSON")

Step 3: Aggregating Ingredient Data

The ingredient table had multiple rows per recipe — one row per ingredient. I needed to aggregate these into a single record per recipe using groupby.

# Load ingredient data
material = pd.read_csv("material.csv", encoding="cp932")

print(material.shape)  # (196126, 5)
print(material.head())

Each row contained a recipe_id, ingredient name, quantity, and unit. To create a per-recipe summary:

# Count ingredients per recipe
ingredient_counts = material.groupby("recipe_id").agg(
    ingredient_count=("material_name", "count"),
    ingredients=("material_name", lambda x: list(x))
).reset_index()

print(f"Ingredient data aggregated for {len(ingredient_counts)} recipes")
print(ingredient_counts.head())

This gave me a DataFrame where each row represented one recipe with its full ingredient list and a count of how many ingredients it required.


Step 4: Aggregating Nutrition Data

The nutrition data was similarly structured — multiple nutritional values per recipe that needed to be consolidated.

# Load nutrition data
nutrition = pd.read_csv("nutrition.csv", encoding="cp932")

# Group by recipe_id and aggregate nutritional values
nutrition_agg = nutrition.groupby("recipe_id").agg({
    "energy_kcal": "sum",
    "protein_g": "sum",
    "fat_g": "sum",
    "carbohydrate_g": "sum",
    "sodium_mg": "sum",
    "calcium_mg": "sum",
    "iron_mg": "sum",
    "vitamin_a_ug": "sum",
    "vitamin_b1_mg": "sum",
    "vitamin_b2_mg": "sum",
    "vitamin_c_mg": "sum",
    "dietary_fiber_g": "sum",
    "salt_equivalent_g": "sum"
}).reset_index()

print(f"Nutrition data aggregated for {len(nutrition_agg)} recipes")

Each recipe now had a single row with its complete nutritional profile — the foundation for the cosine similarity search I would build in Part 2.


Step 5: The Final Merge

With all three datasets cleaned and aggregated, I merged them into a single master dataset:

# Merge recipe + ingredients
master = pd.merge(recipe, ingredient_counts, on="recipe_id", how="inner")

# Merge with nutrition
master = pd.merge(master, nutrition_agg, on="recipe_id", how="inner")

print(f"Final master dataset: {master.shape}")
# (19312, 22)

The inner join dropped recipes that lacked either ingredient or nutrition data — a reasonable trade-off for data quality. The final result: 19,312 recipes with complete information.

# Save the master dataset
master.to_csv("recipe_master.csv", index=False, encoding="utf-8")
print(f"Saved recipe_master.csv with {len(master)} recipes and {master.shape[1]} columns")

What This Dataset Enables

With recipe_master.csv, each recipe is now a rich data point containing:

  • Metadata: title, category, servings, instructions
  • Ingredients: full ingredient list with counts
  • Nutrition: 13+ nutritional dimensions (calories, protein, fat, vitamins, minerals, etc.)

This is the foundation for everything that follows in this series:

  • Part 2: Using nutritional vectors and cosine similarity to find “same nutrition, different meal” substitutions
  • Part 3: Training an LSTM to predict non-repetitive menus over time
  • Part 4: Leveraging ChatGPT to transform recipes into simpler, weeknight-friendly versions

The data pipeline may seem straightforward, but getting it right was crucial. Inconsistent IDs, missing values, and encoding issues in the raw data would have propagated errors through every downstream model. The unglamorous work of data cleansing is what makes everything else possible.


Looking Back: Data Quality as the Foundation

One lesson that has stayed with me across a decade of working on this project: no amount of algorithmic sophistication can compensate for poor data quality. The time I spent on cleansing and validation in this first phase paid dividends in every subsequent step.

In the next installment, I will show how I turned these 19,000+ nutritional vectors into a similarity search engine — finding recipes that are nutritionally equivalent but completely different meals.


Next: Part 2: Finding “Same Nutrition, Different Meal” with Cosine Similarity

Share this article

Related Posts