1 Project Overview
Goal
Real-person LoRA training
Architecture
Blackwell FP8
Environment
Windows 11 + StabilityMatrix
Target dataset
~29 curated
2 Reference: djdante's Validated Settings
djdante's Key Insights
1
Klein 9B = organic realism
Better than Z-Image at lens distortion, skin texture, and overall realism.
2
Has structural weaknesses
Extra limbs, helmet structure errors, etc. Avoidable via a chest-up strategy.
3
29 chest-up images = best consistency
Most consistent results achieved with all 29 images chest-up.
4
Short, concise captions
Roughly: "djdanteman Portrait photo, smiling at the camera wearing a white collared shirt".
5
Face only = 29 images suffice
For face + body, far more images are required.
djdante's Original Config
FluxTraining.yaml
trigger_word: "djdanteman"
images: 29 # all chest-up
num_repeats: 1
batch_size: 1
gradient_accumulation: 2 # effective batch = 2
steps: 4000 # ~276 epochs
lr: 0.00008
optimizer: adamw
dtype: bf16
quantize: true (qfloat8)
model: black-forest-labs/FLUX.2-klein-base-9B
ema: true (decay 0.99)
save_every: 100
sample_every: 100
sample_steps: 50
walk_seed: true
resolution: [1024, 768]
network: lora (linear 16, conv 8)
Sample Structure for Overfit Detection
Default (1.0)
Prompt A
seed: 42 / network_multiplier: 1.0
Compare (0.8)
Prompt A
seed: 42 / network_multiplier: 0.8
Default (1.0)
Prompt B
seed: 43 / network_multiplier: 1.0
Compare (0.8)
Prompt B
seed: 43 / network_multiplier: 0.8
Default (1.0)
Prompt C
seed: 44 / network_multiplier: 1.0
Compare (0.8)
Prompt C
seed: 44 / network_multiplier: 0.8
The moment 0.8 starts looking better than 1.0 = overfit threshold → the checkpoint just before is optimal.
3 Bulk Photos → 29 Curated Images Pipeline
1. Receive bulk photos
→
2. ArcFace identity check
→
3. YOLOv8 chest-up crop
→
4. MagFace quality score
→
5. imagededup deduplication
→
6. Manual final selection
→
7. Florence-2 captioning
→
8. AI-Toolkit training
ArcFace
Drop cosine similarity < 0.4
YOLOv8
Face detection → chest-up crop
imagededup
Cut 90%+ similarity
Florence-2
Auto-caption + trigger word
4 djdante Original vs Our Settings
| Parameter | djdante original | Our settings (v3) |
| trigger_word | djdanteman | joseonnamja / mzyeoja |
| Image count | 29 | 16 (male) / 12 (female) |
| num_repeats | 1 | 1 (male) / 2 (female) |
| batch_size | 1 | 1 |
| gradient_accumulation | 2 | 2 |
| steps | 4000 | 3000 (scaled to image count) |
| lr | 8e-5 | 8e-5 |
| optimizer | adamw | adamw |
| dtype | bf16 | bf16 |
| quantize | qfloat8 | qfloat8 |
| EMA | true (0.99) | true (0.99) |
| save_every | 100 | 100 |
| sample_every | 100 | 100 |
| sample_steps | 50 | 50 |
| resolution | [1024, 768] | [1024, 768] |
| network | lora (L16, C8) | lora (L16, C8) |
| Sample count | 3 (x2 mult) | 6 (1.0 + 0.8 compare) |
| cache_latents_to_disk | true | false (2x faster) |
5 Lessons (From the First Training Failure)
Captions Decide LoRA Quality
djdante style: "trigger Portrait photo, smiling wearing a white collared shirt"
Bad example: "trigger, a young man, front view, upper body, soft lighting, high quality photo"
Don't use quality tags like "soft lighting" or "high quality photo" (djdante doesn't either). Natural-language sentences, not tag lists.
Steps Must Scale With Image Count
djdante baseline
29 images / 4000 steps = ~276 epochs
Recommended steps formula
image_count x 138
Below 29 images, scale steps down proportionally to prevent overfit.
ArcFace vs AdaFace: ArcFace Is Right for LoRA Training
AdaFace: handles low-res better but lets low-quality images through → noise overfit.
ArcFace: strict → only high-quality passes → cleaner training data. ComfyUI IP-Adapter / FaceID is also ArcFace-based.
Don't Skip the "Manual Reinforcement" Step in the Pipeline
dataset_pipeline.py prints a "manually reinforce captions with clothing/background" warning — ignore it and you overfit.
Florence-2 auto-captioning was designed but not implemented → substitute with manual or AI Vision.
6 Gemini Suggestions Filtered Out
| Suggestion | Decision | Reason |
| Disable walk_seed | Filtered | djdante used true with strong results. Track overfit via network_multiplier 0.8 comparison instead. |
| Use the same seed for all samples | Filtered | djdante uses seeds 42/43/44 differently. Evaluate across diverse compositions. |
7 Resolution History vs djdante Original
cache_latents_to_disk: true → false (root cause of 2x speed gap)
Samples: 3 → 6 (added network_multiplier 0.8 for overfit detection)
batch_size: 2 → 1 + gradient_accumulation 2 (matches djdante original)
steps: 4000 → 3000 (scaled to ~234 epochs at 16 images)
Captions reworked: template tags → djdante-style natural language + explicit clothing (resolved first-run failure)
Removed 3 duplicate suit photos: 19 → 16 images (clothing skew 52.6% → 43.75%)
8 File Locations
📁 G:\Work\JoseonPrince\Actor LoRA Dataset\ Project root
📁 G:\Work\JoseonPrince\Actor LoRA Dataset\Male\ Male dataset
📁 G:\Work\JoseonPrince\Actor LoRA Dataset\Female\ Female dataset
📁 G:\StabilityMatrix\Packages\AI-Toolkit\output\ Training output
📄 face_similarity.py ArcFace-based face similarity
Priority Principles
"No errors, clean, quality" — quality over speed. Use qfloat8 quantization (RTX 5090 Blackwell native FP8 → minimal quality loss, large speedup).