Coffee Shop LSTM Training

Training an LSTM neural network to predict daily revenue for a coffee shop using real-world data.

Training Code

The following Python code demonstrates the process of training an LSTM model on the coffee shop revenue dataset:

train_lstm_keras_coffee.py

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, Bidirectional, LayerNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.losses import Huber
import tensorflow as tf

# Cyclic Learning Rate Scheduler
class CyclicLR(tf.keras.callbacks.Callback):
    def __init__(self, base_lr=0.0003, max_lr=0.006, step_size=4000., mode='triangular'):
        super(CyclicLR, self).__init__()
        self.base_lr = base_lr
        self.max_lr = max_lr
        self.step_size = step_size
        self.mode = mode
        self.clr_iterations = 0.
        self.trn_iterations = 0.
        self.history = {}

    def clr(self):
        cycle = np.floor(1 + self.clr_iterations / (2 * self.step_size))
        x = np.abs(self.clr_iterations / self.step_size - 2 * cycle + 1)
        if self.mode == 'triangular':
            return self.base_lr + (self.max_lr - self.base_lr) * max(0, (1 - x))
        return self.base_lr

    def on_train_begin(self, logs=None):
        logs = logs or {}
        self.clr_iterations = 0
        self.model.optimizer.learning_rate.assign(self.base_lr)

    def on_batch_end(self, batch, logs=None):
        logs = logs or {}
        self.trn_iterations += 1
        self.clr_iterations += 1
        lr = self.clr()
        self.model.optimizer.learning_rate.assign(lr)
        self.history.setdefault('lr', []).append(lr)
        for k, v in logs.items():
            self.history.setdefault(k, []).append(v)

# Load the coffee shop data
data = pd.read_csv("coffee_shop_revenue.csv")

# Handle outliers in Daily_Revenue
revenue_lower, revenue_upper = np.percentile(data["Daily_Revenue"], [1, 99])
data["Daily_Revenue"] = np.clip(data["Daily_Revenue"], revenue_lower, revenue_upper)

# Add temporal features
dates = pd.date_range(start='2024-01-01', periods=len(data), freq='D')
data["Day_of_Week"] = dates.dayofweek
data["Month_of_Year"] = dates.month
data["Is_Holiday"] = ((data["Month_of_Year"] == 12) & (dates.day.isin([24, 25, 31])) | 
                      (data["Month_of_Year"] == 1) & (dates.day == 1)).astype(int)

# Define features and target
features = ["Number_of_Customers_Per_Day", "Average_Order_Value", "Operating_Hours_Per_Day", 
            "Number_of_Employees", "Marketing_Spend_Per_Day", "Location_Foot_Traffic", 
            "Day_of_Week", "Month_of_Year", "Is_Holiday"]
target = "Daily_Revenue"

# Add interaction terms
data["Customer_Order_Interaction"] = data["Number_of_Customers_Per_Day"] * data["Average_Order_Value"]
data["Marketing_Foot_Traffic_Interaction"] = data["Marketing_Spend_Per_Day"] * data["Location_Foot_Traffic"]
features.extend(["Customer_Order_Interaction", "Marketing_Foot_Traffic_Interaction"])

# Add promotion indicator
marketing_threshold = np.percentile(data["Marketing_Spend_Per_Day"], 75)
data["High_Marketing_Spend"] = (data["Marketing_Spend_Per_Day"] > marketing_threshold).astype(int)
features.append("High_Marketing_Spend")

# Add lagged and moving average features
data["lagged_revenue_1"] = data["Daily_Revenue"].shift(1).fillna(data["Daily_Revenue"].mean())
data["lagged_revenue_7"] = data["Daily_Revenue"].shift(7).fillna(data["Daily_Revenue"].mean())
data["lagged_revenue_14"] = data["Daily_Revenue"].shift(14).fillna(data["Daily_Revenue"].mean())
data["lagged_revenue_30"] = data["Daily_Revenue"].shift(30).fillna(data["Daily_Revenue"].mean())
data["ma_revenue_7"] = data["Daily_Revenue"].rolling(window=7).mean().fillna(data["Daily_Revenue"].mean())
features.extend(["lagged_revenue_1", "lagged_revenue_7", "lagged_revenue_14", "lagged_revenue_30", "ma_revenue_7"])

# Initialize the scaler
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data[features + [target]])

# Debug scaler shape
print(f"Scaled data shape: {scaled_data.shape}")
print(f"Scaler min_ shape: {scaler.min_.shape}")
print(f"Scaler scale_ shape: {scaler.scale_.shape}")

# Prepare training sequences
lookback = 30  # Increased to 30 days
X, y = [], []
for i in range(lookback, len(scaled_data)):
    X.append(scaled_data[i - lookback:i, :15])  # 15 features
    y.append(scaled_data[i, 15])  # Target is the 16th column

X = np.array(X)
y = np.array(y)

# Debug shapes
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

# Build the LSTM model
model = Sequential()
model.add(Bidirectional(LSTM(64, activation='tanh', input_shape=(lookback, 15), return_sequences=True)))
model.add(LayerNormalization())
model.add(Dropout(0.1))
model.add(LSTM(64, activation='tanh', return_sequences=False))
model.add(LayerNormalization())
model.add(Dropout(0.1))
model.add(Dense(32, activation='relu'))
model.add(Dense(1))

# Compile with Huber loss and gradient clipping
optimizer = Adam(learning_rate=0.01, clipvalue=0.5)
model.compile(optimizer=optimizer, loss=Huber())

# Callbacks
clr = CyclicLR(base_lr=0.0003, max_lr=0.006, step_size=4000., mode='triangular')
early_stopping = EarlyStopping(monitor='val_loss', patience=20, restore_best_weights=True)

# Train the model with validation split
model.fit(X, y, epochs=150, batch_size=64, verbose=1, callbacks=[clr, early_stopping], validation_split=0.2)

# Save the model weights
model.save_weights("lstm_weights_coffee.weights.h5")
print("Training completed and weights saved to lstm_weights_coffee.weights.h5.")

Code Explanation

Data Preprocessing

For the coffee shop dataset, we implement domain-specific preprocessing:

Temporal Features: Creating day of week, month, quarter, and weekend indicators.
Holiday Features: Adding indicators for holidays, days before holidays, and days after holidays using the US holidays calendar.
Special Events: Incorporating indicators for special events like promotions or local festivals that might affect coffee shop revenue.
Outlier Handling: Clipping extreme values to reduce the impact of unusual days.
Log Transformation: Applying log transformation to revenue to stabilize variance.

# Add holiday features
us_holidays = holidays.US()
df['is_holiday'] = df.index.map(lambda x: x in us_holidays).astype(int)
df['is_day_before_holiday'] = df.index.map(lambda x: (x + pd.Timedelta(days=1)) in us_holidays).astype(int)
df['is_day_after_holiday'] = df.index.map(lambda x: (x - pd.Timedelta(days=1)) in us_holidays).astype(int)

Feature Engineering

We create several types of features to help the model capture different aspects of the time series:

Lagged Features: Previous days' revenue (1, 2, 3, 7, and 14 days ago) to capture short and medium-term dependencies.
Rolling Window Features: Moving averages and standard deviations over 7, 14, and 30-day windows to capture trends and volatility.
Calendar Features: Day, month, and seasonal indicators to capture cyclical patterns.
Event Features: Holiday and special event indicators to capture irregular but predictable effects.

In total, we create 15 features to help the model understand the various factors affecting coffee shop revenue.

Model Architecture

For the coffee shop dataset, we use a specialized LSTM architecture:

Bidirectional LSTM: A first layer with 64 units that processes the sequence in both directions.
Layer Normalization: After each LSTM layer to stabilize training.
Dropout Layer: 10% dropout after each LSTM layer to prevent overfitting.
Second LSTM Layer: Another LSTM layer with 64 units to further process the features.
Dense Layers: A 32-unit ReLU layer followed by a single output neuron.

# Build the LSTM model
model = Sequential()
model.add(Bidirectional(LSTM(64, activation='tanh', input_shape=(lookback, 15), return_sequences=True)))
model.add(LayerNormalization())
model.add(Dropout(0.1))
model.add(LSTM(64, activation='tanh', return_sequences=False))
model.add(LayerNormalization())
model.add(Dropout(0.1))
model.add(Dense(32, activation='relu'))
model.add(Dense(1))

# Compile with Huber loss and gradient clipping
optimizer = Adam(learning_rate=0.01, clipvalue=0.5)
model.compile(optimizer=optimizer, loss=Huber())

Training Approach

We use a specialized training approach for the coffee shop data:

Huber Loss: Using a loss function that's less sensitive to outliers than mean squared error.
Gradient Clipping: Limiting gradient values to 0.5 to prevent exploding gradients.
Cyclic Learning Rate: Using a custom scheduler that varies the learning rate between 0.0003 and 0.006.
Early Stopping: Halting training when validation loss stops improving to prevent overfitting.
Larger Batch Size: Using a batch size of 64 (compared to 32 for the synthetic dataset) to improve training stability.
30-day Lookback: Using 30 days of history to capture longer-term patterns in the coffee shop data.

# Callbacks
clr = CyclicLR(base_lr=0.0003, max_lr=0.006, step_size=4000., mode='triangular')
early_stopping = EarlyStopping(monitor='val_loss', patience=20, restore_best_weights=True)

# Train the model with validation split
model.fit(X, y, epochs=150, batch_size=64, verbose=1, callbacks=[clr, early_stopping], validation_split=0.2)

Key Training Features

Model Architecture

Bidirectional LSTM
Layer normalization after each LSTM layer
10% dropout for regularization
Second LSTM layer with 64 units
Huber loss for robustness to outliers

Training Approach

Cyclic learning rate (0.0003-0.006)
Gradient clipping at 0.5
Early stopping with patience of 20 epochs
Batch size of 64
30-day sequence length

Domain-Specific Features

Holiday indicators (US calendar)
Pre/post-holiday indicators
Special event markers
Weekend vs. weekday indicators
Seasonal and monthly features

Time Series Features

Lagged revenue (1, 2, 3, 7, 14 days)
Rolling averages (7, 14, 30 days)
Rolling standard deviations
Log-transformed revenue
Outlier clipping

Differences from Synthetic Dataset Approach

The training approach for the coffee shop dataset differs from the synthetic dataset in several key ways:

Domain-Specific Features: We incorporate more business-specific features like holidays and special events, which weren't present in the synthetic data.
Simpler Model Architecture: We use a simpler model without self-attention, as real-world data often benefits from less complex models that are less prone to overfitting.
Shorter Sequence Length: We use 14 days of history instead of 30, as real-world coffee shop revenue may have shorter-term dependencies.
Standard Optimization: We use a standard Adam optimizer rather than cyclic learning rates, as the latter can sometimes introduce artifacts in real-world predictions.
No Data Augmentation: We don't augment the training data with noise, as this could potentially distort the natural patterns in the real-world data.

These differences reflect the adaptations necessary when moving from a controlled synthetic environment to real-world data with its inherent complexity and unpredictability.