Overview
The MagicInput Dataset Builder plays a crucial role in preparing high-quality, labeled training data
for AI-based crypto trading models. It simulates trade outcomes based on a wide range of strategy presets
and historical conditions to generate a rich dataset in .parquet
format, optimized for
high-performance model training.
🛠 Configuration (config.yaml)
daysBack: 30
exportFolder: exports
categories:
- meme
datasetDir: datasets
baseDir: presets
direction: Both # Long, Short, Both
strategy: scalp # balance_midterm, long_term, scalp, swing
maxVariations: 100 # 0: Skip / test cap
isDryRun: false
writeThreshold: 500
maxMemoryMB: 8000
database:
provider: sqlite
connectionString: Data Source=trade.db
This configuration guides the simulator on:
- Target symbol and timeframe (e.g., BTC, 1m)
- Backtest range in days
- Categories & strategies to include
- Directional simulation: Long / Short / Both
- Memory caps and thresholds
- Optional dry-run testing
🧩 Category / Direction / Strategy Hierarchy
Each dataset is grouped using a 3-level folder hierarchy:
- Category: e.g.
meme
,layer1
,AI
- Direction:
Long
,Short
,Both
- Strategy:
balance_midterm
,long_term
,scalp
,swing
Each preset YAML defines parametric input ranges for dataset generation.
🧠 Example Preset
name: Scalping Strategy - Long
description: High-frequency short-term trading profile with tight SL/TP and high leverage
leverage: [25, 50, 75]
strategy: [0, 1, 2]
virtualBalance: [100, 250]
riskPercent: [3, 5]
stopLoss: [0.3, 0.5, 1.0]
takeProfit: [0.5, 1.0, 2.0]
trailingSLOffset: [0.2, 0.5]
breakevenActivation: [0.5, 1.0]
breakevenBuffer: [0.1, 0.2]
trailingTPTrigger: [1.0, 2.0]
trailingTPOffset: [0.5]
timeTriggerEnabled: [true]
timeTriggerMinutes: [1, 3, 5]
timeTriggerModes: [2]
change: [0.3, 0.5, 1.0]
direction: [0, 1]
interval: [1, 3, 5]
match: [0, 1, 2]
📦 Dataset Output
The final datasets are written to disk under:
/datasets/parquet_{YYYYMMDD}/{category}/{direction}/{symbol}.parquet
Example:
/datasets/parquet_20250713/meme/Both/WIF.parquet
🚀 Simulation Pipeline Features
- ✔️ Preset loading & filtering
- ✔️ Directional control (
Long
/Short
/Both
) - ✔️ Memory-safe parallel simulation
- ✔️ Smart batching, ETA logging, run ID tracking
- ✔️ Optional dry-run mode for preview
- ✔️ CSV export support
🧼 Dataset Validation & Repair
- ✔️ Header structure validation
- ✔️ Auto-skip rows with null/NaN or zero-trades
- ✔️ .bak backup before overwriting bad files
- ✔️ Compressed re-save as
.gz
- ✔️ Debug logs + relocation of bad files to
__bad__/
- ✔️ Thread-safe processing for large datasets
📚 Supported Categories
These categories represent logical market sectors or token types.
You can configure any combination in your config.yaml
.
- AI
- bitcoin-layer2
- bluechip / bluechip-alt
- defin / defin-alt
- enterprise-alt
- gaming
- highcap-alt
- high-volatility-alt
- identity
- infrastructure
- layer1 / layer2 / layer1-highcap
- legacy / legacy-alt
- meme
- metaverse
- new-alt
- ordinals
- pow-alt
- stablecoin
- storage
- video
- zero-knowledge
💡 Notes
- Dataset building is CPU/memory intensive. Configure
maxMemoryMB
andwriteThreshold
accordingly. - Use
isDryRun: true
to validate config and preview variations without writing files. - Max variations helps prevent memory overflow on extremely large permutation sets.
- All simulations are reproducible using the same
RunId
and configuration.