Synthetic Data Generation

Creating a controlled synthetic dataset with known patterns for LSTM training and evaluation.

Data Generation Code

The following Python code generates a synthetic cash flow dataset spanning 3 years (2022-2024) with various controlled patterns:

generate_synthetic_data.py
import numpy as np
import pandas as pd
from datetime import datetime, timedelta

# Set random seed for reproducibility
np.random.seed(42)

# Parameters
n_days = 1095  # 3 years
start_date = datetime(2022, 1, 1)
dates = [start_date + timedelta(days=i) for i in range(n_days)]

# Initialize arrays
income = np.zeros(n_days)
expenses = np.zeros(n_days)
net_cash_flow = np.zeros(n_days)
categories = []
seasonal_factors = np.zeros(n_days)

# Generate synthetic data
for i in range(n_days):
    date = dates[i]
    # Weekly seasonality: higher income on Fridays (4) and Saturdays (5)
    weekday = date.weekday()
    weekly_factor = 1.5 if weekday in [4, 5] else 1.0
    # Yearly seasonality: holiday spike in December
    monthly_factor = 2.0 if date.month == 12 else 1.0
    seasonal_factors[i] = weekly_factor * monthly_factor
    
    # Base income with trend and noise
    base_income = 1000 + i * 0.5  # Linear trend
    income[i] = base_income * seasonal_factors[i] * np.random.normal(1.0, 0.1)
    
    # Expenses: fixed (e.g., rent) + variable
    fixed_expense = 300 if date.day == 1 else 0  # Monthly rent
    variable_expense = 500 * np.random.normal(1.0, 0.15)
    expenses[i] = fixed_expense + variable_expense
    
    # Net cash flow
    net_cash_flow[i] = income[i] - expenses[i]
    
    # Categories (simplified)
    categories.append("Sales" if income[i] > 0 else "Expense")

# Create DataFrame
data = pd.DataFrame({
    "date": dates,
    "income": income,
    "expenses": expenses,
    "net_cash_flow": net_cash_flow,
    "category": categories,
    "seasonal_factor": seasonal_factors
})

# Save to CSV
data.to_csv("synthetic_cashflow_data.csv", index=False)
print("Synthetic data generated and saved to synthetic_cashflow_data.csv")

Code Explanation

Date Generation

We create a date range spanning 3 years from January 1, 2022, to December 31, 2024, resulting in 1095 daily records.

start_date = datetime(2022, 1, 1)
end_date = datetime(2024, 12, 31)
date_range = pd.date_range(start=start_date, end=end_date, freq='D')

Linear Income Trend

We establish a base income that starts at $1,000 and increases by $0.50 each day, creating a linear growth trend over time. In the actual implementation, we also handle outliers by clipping values to the 1st and 99th percentiles and apply log transformation to stabilize variance.

days = (df.index - start_date).days.values
base_income = 1000 + days * 0.5 # Starting at $1000 with $0.50 increase per day

Weekly Seasonality

We add a weekly pattern where weekends (Saturday and Sunday) have 50% higher income compared to weekdays.

weekly_factor = np.where(df['day_of_week'].isin([5, 6]), 1.5, 1.0)

Monthly Seasonality

We create a monthly pattern where income is higher at the beginning of each month and gradually decreases.

monthly_factor = 1.0 + 0.3 * np.exp(-0.3 * (df.index.day - 1))

Quarterly Seasonality

We add quarterly spikes at the end of each quarter (March, June, September, December) to simulate end-of-quarter business activities.

quarterly_months = [3, 6, 9, 12]
quarterly_factor = np.ones(len(df))
for month in quarterly_months:
  is_end_of_quarter = (df.index.month == month) and (df.index.day >= 28)

Annual Seasonality

We incorporate an annual pattern with higher income during the holiday season in December.

annual_factor = np.where(df.index.month == 12, 1.4, 1.0)

Random Noise

We add random noise to both income and expenses to simulate real-world variability and unpredictability.

noise = np.random.normal(1, 0.1, len(df)) # 10% random variation
df['income'] = base_income * seasonal_factor * noise

Expenses Generation

We generate expenses with their own patterns, including a linear trend and sinusoidal variation, with a correlation to income but distinct behavior.

base_expenses = 800 + days * 0.3 # Starting at $800 with $0.30 increase per day
expense_noise = np.random.normal(1, 0.15, len(df)) # 15% random variation
df['expenses'] = base_expenses * (1 + 0.2 * np.sin(days / 30 * np.pi)) * expense_noise

Net Cash Flow

Finally, we calculate the net cash flow as the difference between income and expenses.

df['net_cash_flow'] = df['income'] - df['expenses']

Key Dataset Features

  • Size: 1,095 daily records (3 years)
  • Features: Date, day of week, income, expenses, net cash flow
  • Patterns: Linear trends, weekly seasonality, monthly seasonality, quarterly seasonality, annual seasonality, random noise
  • Starting values: $1,000 daily income, $800 daily expenses
  • Growth rates: $0.50/day for income, $0.30/day for expenses
  • Seasonality factors: Weekend (1.5x), beginning of month (up to 1.3x), end of quarter (1.3x), December (1.4x)
  • Noise levels: 10% for income, 15% for expenses

Implementation Details

In the actual implementation, we perform additional preprocessing steps:

  • Outlier Handling: Clipping values to the 1st and 99th percentiles to reduce the impact of extreme values
  • Log Transformation: Applying log1p transformation to net cash flow to stabilize variance
  • Interaction Terms: Creating interaction features like income × expenses to capture non-linear relationships
  • Binary Indicators: Adding flags for high expense days (above 75th percentile)
  • Temporal Features: Extracting day of week and month from dates for better seasonality modeling