MagicInput Dataset Builder

Overview

The MagicInput Dataset Builder plays a crucial role in preparing high-quality, labeled training data for AI-based crypto trading models. It simulates trade outcomes based on a wide range of strategy presets and historical conditions to generate a rich dataset in .parquet format, optimized for high-performance model training.

🛠 Configuration (config.yaml)

daysBack: 30
exportFolder: exports
categories: 
  - meme
datasetDir: datasets
baseDir: presets
direction: Both     # Long, Short, Both
strategy: scalp     # balance_midterm, long_term, scalp, swing
maxVariations: 100  # 0: Skip / test cap
isDryRun: false
writeThreshold: 500
maxMemoryMB: 8000

database:
  provider: sqlite
  connectionString: Data Source=trade.db

This configuration guides the simulator on:

Target symbol and timeframe (e.g., BTC, 1m)
Backtest range in days
Categories & strategies to include
Directional simulation: Long / Short / Both
Memory caps and thresholds
Optional dry-run testing

🧩 Category / Direction / Strategy Hierarchy

Each dataset is grouped using a 3-level folder hierarchy:

Category: e.g. meme, layer1, AI
Direction: Long, Short, Both
Strategy: balance_midterm, long_term, scalp, swing

Each preset YAML defines parametric input ranges for dataset generation.

🧠 Example Preset

name: Scalping Strategy - Long
description: High-frequency short-term trading profile with tight SL/TP and high leverage

leverage: [25, 50, 75]
strategy: [0, 1, 2]
virtualBalance: [100, 250]
riskPercent: [3, 5]

stopLoss: [0.3, 0.5, 1.0]
takeProfit: [0.5, 1.0, 2.0]
trailingSLOffset: [0.2, 0.5]
breakevenActivation: [0.5, 1.0]
breakevenBuffer: [0.1, 0.2]
trailingTPTrigger: [1.0, 2.0]
trailingTPOffset: [0.5]

timeTriggerEnabled: [true]
timeTriggerMinutes: [1, 3, 5]
timeTriggerModes: [2]

change: [0.3, 0.5, 1.0]
direction: [0, 1]
interval: [1, 3, 5]
match: [0, 1, 2]

📦 Dataset Output

The final datasets are written to disk under: /datasets/parquet_{YYYYMMDD}/{category}/{direction}/{symbol}.parquet

Example:

/datasets/parquet_20250713/meme/Both/WIF.parquet

🚀 Simulation Pipeline Features

✔️ Preset loading & filtering
✔️ Directional control (Long / Short / Both)
✔️ Memory-safe parallel simulation
✔️ Smart batching, ETA logging, run ID tracking
✔️ Optional dry-run mode for preview
✔️ CSV export support

🧼 Dataset Validation & Repair

✔️ Header structure validation
✔️ Auto-skip rows with null/NaN or zero-trades
✔️ .bak backup before overwriting bad files
✔️ Compressed re-save as .gz
✔️ Debug logs + relocation of bad files to __bad__/
✔️ Thread-safe processing for large datasets

📚 Supported Categories

These categories represent logical market sectors or token types. You can configure any combination in your config.yaml.

AI
bitcoin-layer2
bluechip / bluechip-alt
defin / defin-alt
enterprise-alt
gaming
highcap-alt
high-volatility-alt
identity
infrastructure
layer1 / layer2 / layer1-highcap
legacy / legacy-alt
meme
metaverse
new-alt
ordinals
pow-alt
stablecoin
storage
video
zero-knowledge

💡 Notes

Dataset building is CPU/memory intensive. Configure maxMemoryMB and writeThreshold accordingly.
Use isDryRun: true to validate config and preview variations without writing files.
Max variations helps prevent memory overflow on extremely large permutation sets.
All simulations are reproducible using the same RunId and configuration.