Synthetic Data Generation
Creating a controlled synthetic dataset with known patterns for LSTM training and evaluation.
Data Generation Code
The following Python code generates a synthetic cash flow dataset spanning 3 years (2022-2024) with various controlled patterns:
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
# Set random seed for reproducibility
np.random.seed(42)
# Parameters
n_days = 1095 # 3 years
start_date = datetime(2022, 1, 1)
dates = [start_date + timedelta(days=i) for i in range(n_days)]
# Initialize arrays
income = np.zeros(n_days)
expenses = np.zeros(n_days)
net_cash_flow = np.zeros(n_days)
categories = []
seasonal_factors = np.zeros(n_days)
# Generate synthetic data
for i in range(n_days):
date = dates[i]
# Weekly seasonality: higher income on Fridays (4) and Saturdays (5)
weekday = date.weekday()
weekly_factor = 1.5 if weekday in [4, 5] else 1.0
# Yearly seasonality: holiday spike in December
monthly_factor = 2.0 if date.month == 12 else 1.0
seasonal_factors[i] = weekly_factor * monthly_factor
# Base income with trend and noise
base_income = 1000 + i * 0.5 # Linear trend
income[i] = base_income * seasonal_factors[i] * np.random.normal(1.0, 0.1)
# Expenses: fixed (e.g., rent) + variable
fixed_expense = 300 if date.day == 1 else 0 # Monthly rent
variable_expense = 500 * np.random.normal(1.0, 0.15)
expenses[i] = fixed_expense + variable_expense
# Net cash flow
net_cash_flow[i] = income[i] - expenses[i]
# Categories (simplified)
categories.append("Sales" if income[i] > 0 else "Expense")
# Create DataFrame
data = pd.DataFrame({
"date": dates,
"income": income,
"expenses": expenses,
"net_cash_flow": net_cash_flow,
"category": categories,
"seasonal_factor": seasonal_factors
})
# Save to CSV
data.to_csv("synthetic_cashflow_data.csv", index=False)
print("Synthetic data generated and saved to synthetic_cashflow_data.csv")
Code Explanation
Date Generation
We create a date range spanning 3 years from January 1, 2022, to December 31, 2024, resulting in 1095 daily records.
start_date = datetime(2022, 1, 1)
end_date = datetime(2024, 12, 31)
date_range = pd.date_range(start=start_date, end=end_date, freq='D')
Linear Income Trend
We establish a base income that starts at $1,000 and increases by $0.50 each day, creating a linear growth trend over time. In the actual implementation, we also handle outliers by clipping values to the 1st and 99th percentiles and apply log transformation to stabilize variance.
days = (df.index - start_date).days.values
base_income = 1000 + days * 0.5 # Starting at $1000 with $0.50 increase per day
Weekly Seasonality
We add a weekly pattern where weekends (Saturday and Sunday) have 50% higher income compared to weekdays.
weekly_factor = np.where(df['day_of_week'].isin([5, 6]), 1.5, 1.0)
Monthly Seasonality
We create a monthly pattern where income is higher at the beginning of each month and gradually decreases.
monthly_factor = 1.0 + 0.3 * np.exp(-0.3 * (df.index.day - 1))
Quarterly Seasonality
We add quarterly spikes at the end of each quarter (March, June, September, December) to simulate end-of-quarter business activities.
quarterly_months = [3, 6, 9, 12]
quarterly_factor = np.ones(len(df))
for month in quarterly_months:
is_end_of_quarter = (df.index.month == month) and (df.index.day >= 28)
Annual Seasonality
We incorporate an annual pattern with higher income during the holiday season in December.
annual_factor = np.where(df.index.month == 12, 1.4, 1.0)
Random Noise
We add random noise to both income and expenses to simulate real-world variability and unpredictability.
noise = np.random.normal(1, 0.1, len(df)) # 10% random variation
df['income'] = base_income * seasonal_factor * noise
Expenses Generation
We generate expenses with their own patterns, including a linear trend and sinusoidal variation, with a correlation to income but distinct behavior.
base_expenses = 800 + days * 0.3 # Starting at $800 with $0.30 increase per day
expense_noise = np.random.normal(1, 0.15, len(df)) # 15% random variation
df['expenses'] = base_expenses * (1 + 0.2 * np.sin(days / 30 * np.pi)) * expense_noise
Net Cash Flow
Finally, we calculate the net cash flow as the difference between income and expenses.
df['net_cash_flow'] = df['income'] - df['expenses']
Key Dataset Features
- Size: 1,095 daily records (3 years)
- Features: Date, day of week, income, expenses, net cash flow
- Patterns: Linear trends, weekly seasonality, monthly seasonality, quarterly seasonality, annual seasonality, random noise
- Starting values: $1,000 daily income, $800 daily expenses
- Growth rates: $0.50/day for income, $0.30/day for expenses
- Seasonality factors: Weekend (1.5x), beginning of month (up to 1.3x), end of quarter (1.3x), December (1.4x)
- Noise levels: 10% for income, 15% for expenses
Implementation Details
In the actual implementation, we perform additional preprocessing steps:
- Outlier Handling: Clipping values to the 1st and 99th percentiles to reduce the impact of extreme values
- Log Transformation: Applying log1p transformation to net cash flow to stabilize variance
- Interaction Terms: Creating interaction features like income × expenses to capture non-linear relationships
- Binary Indicators: Adding flags for high expense days (above 75th percentile)
- Temporal Features: Extracting day of week and month from dates for better seasonality modeling