MagicInput Dataset Builder

Overview

The MagicInput Dataset Builder plays a crucial role in preparing high-quality, labeled training data for AI-based crypto trading models. It simulates trade outcomes based on a wide range of strategy presets and historical conditions to generate a rich dataset in .parquet format, optimized for high-performance model training.

🛠 Configuration (config.yaml)

daysBack: 30
exportFolder: exports
categories: 
  - meme
datasetDir: datasets
baseDir: presets
direction: Both     # Long, Short, Both
strategy: scalp     # balance_midterm, long_term, scalp, swing
maxVariations: 100  # 0: Skip / test cap
isDryRun: false
writeThreshold: 500
maxMemoryMB: 8000

database:
  provider: sqlite
  connectionString: Data Source=trade.db

This configuration guides the simulator on:

  • Target symbol and timeframe (e.g., BTC, 1m)
  • Backtest range in days
  • Categories & strategies to include
  • Directional simulation: Long / Short / Both
  • Memory caps and thresholds
  • Optional dry-run testing

🧩 Category / Direction / Strategy Hierarchy

Each dataset is grouped using a 3-level folder hierarchy:

  • Category: e.g. meme, layer1, AI
  • Direction: Long, Short, Both
  • Strategy: balance_midterm, long_term, scalp, swing

Each preset YAML defines parametric input ranges for dataset generation.

🧠 Example Preset

name: Scalping Strategy - Long
description: High-frequency short-term trading profile with tight SL/TP and high leverage

leverage: [25, 50, 75]
strategy: [0, 1, 2]
virtualBalance: [100, 250]
riskPercent: [3, 5]

stopLoss: [0.3, 0.5, 1.0]
takeProfit: [0.5, 1.0, 2.0]
trailingSLOffset: [0.2, 0.5]
breakevenActivation: [0.5, 1.0]
breakevenBuffer: [0.1, 0.2]
trailingTPTrigger: [1.0, 2.0]
trailingTPOffset: [0.5]

timeTriggerEnabled: [true]
timeTriggerMinutes: [1, 3, 5]
timeTriggerModes: [2]

change: [0.3, 0.5, 1.0]
direction: [0, 1]
interval: [1, 3, 5]
match: [0, 1, 2]

📦 Dataset Output

The final datasets are written to disk under: /datasets/parquet_{YYYYMMDD}/{category}/{direction}/{symbol}.parquet

Example:

/datasets/parquet_20250713/meme/Both/WIF.parquet

🚀 Simulation Pipeline Features

  • ✔️ Preset loading & filtering
  • ✔️ Directional control (Long / Short / Both)
  • ✔️ Memory-safe parallel simulation
  • ✔️ Smart batching, ETA logging, run ID tracking
  • ✔️ Optional dry-run mode for preview
  • ✔️ CSV export support

🧼 Dataset Validation & Repair

  • ✔️ Header structure validation
  • ✔️ Auto-skip rows with null/NaN or zero-trades
  • ✔️ .bak backup before overwriting bad files
  • ✔️ Compressed re-save as .gz
  • ✔️ Debug logs + relocation of bad files to __bad__/
  • ✔️ Thread-safe processing for large datasets

📚 Supported Categories

These categories represent logical market sectors or token types. You can configure any combination in your config.yaml.

  • AI
  • bitcoin-layer2
  • bluechip / bluechip-alt
  • defin / defin-alt
  • enterprise-alt
  • gaming
  • highcap-alt
  • high-volatility-alt
  • identity
  • infrastructure
  • layer1 / layer2 / layer1-highcap
  • legacy / legacy-alt
  • meme
  • metaverse
  • new-alt
  • ordinals
  • pow-alt
  • stablecoin
  • storage
  • video
  • zero-knowledge

💡 Notes

  • Dataset building is CPU/memory intensive. Configure maxMemoryMB and writeThreshold accordingly.
  • Use isDryRun: true to validate config and preview variations without writing files.
  • Max variations helps prevent memory overflow on extremely large permutation sets.
  • All simulations are reproducible using the same RunId and configuration.

📎 Related Topics