LoRA Training Strategy

Flux.2 Klein 9B LoRA dataset curation strategy based on djdante's validated settings

Flux.2 Klein 9B RTX 5090 (32GB) AI-Toolkit ArcFace verification

1 Project Overview

Goal
Real-person LoRA training
Male actor
joseonnamja
Female actor
mzyeoja
GPU
RTX 5090
VRAM
32GB
Architecture
Blackwell FP8
Environment
Windows 11 + StabilityMatrix
Target dataset
~29 curated

2 Reference: djdante's Validated Settings

djdante's Key Insights

1
Klein 9B = organic realism
Better than Z-Image at lens distortion, skin texture, and overall realism.
2
Has structural weaknesses
Extra limbs, helmet structure errors, etc. Avoidable via a chest-up strategy.
3
29 chest-up images = best consistency
Most consistent results achieved with all 29 images chest-up.
4
Short, concise captions
Roughly: "djdanteman Portrait photo, smiling at the camera wearing a white collared shirt".
5
Face only = 29 images suffice
For face + body, far more images are required.

djdante's Original Config

FluxTraining.yaml
trigger_word: "djdanteman" images: 29 # all chest-up num_repeats: 1 batch_size: 1 gradient_accumulation: 2 # effective batch = 2 steps: 4000 # ~276 epochs lr: 0.00008 optimizer: adamw dtype: bf16 quantize: true (qfloat8) model: black-forest-labs/FLUX.2-klein-base-9B ema: true (decay 0.99) save_every: 100 sample_every: 100 sample_steps: 50 walk_seed: true resolution: [1024, 768] network: lora (linear 16, conv 8)

Sample Structure for Overfit Detection

Default (1.0)
Prompt A
seed: 42 / network_multiplier: 1.0
Compare (0.8)
Prompt A
seed: 42 / network_multiplier: 0.8
Default (1.0)
Prompt B
seed: 43 / network_multiplier: 1.0
Compare (0.8)
Prompt B
seed: 43 / network_multiplier: 0.8
Default (1.0)
Prompt C
seed: 44 / network_multiplier: 1.0
Compare (0.8)
Prompt C
seed: 44 / network_multiplier: 0.8
The moment 0.8 starts looking better than 1.0 = overfit threshold → the checkpoint just before is optimal.

3 Bulk Photos → 29 Curated Images Pipeline

1. Receive bulk photos
2. ArcFace identity check
3. YOLOv8 chest-up crop
4. MagFace quality score
5. imagededup deduplication
6. Manual final selection
7. Florence-2 captioning
8. AI-Toolkit training
ArcFace
Drop cosine similarity < 0.4
YOLOv8
Face detection → chest-up crop
MagFace
Drop bottom 30%
imagededup
Cut 90%+ similarity
Florence-2
Auto-caption + trigger word
Target count
~29 images

4 djdante Original vs Our Settings

Parameterdjdante originalOur settings (v3)
trigger_worddjdantemanjoseonnamja / mzyeoja
Image count2916 (male) / 12 (female)
num_repeats11 (male) / 2 (female)
batch_size11
gradient_accumulation22
steps40003000 (scaled to image count)
lr8e-58e-5
optimizeradamwadamw
dtypebf16bf16
quantizeqfloat8qfloat8
EMAtrue (0.99)true (0.99)
save_every100100
sample_every100100
sample_steps5050
resolution[1024, 768][1024, 768]
networklora (L16, C8)lora (L16, C8)
Sample count3 (x2 mult)6 (1.0 + 0.8 compare)
cache_latents_to_disktruefalse (2x faster)

5 Lessons (From the First Training Failure)

Captions Decide LoRA Quality

djdante style: "trigger Portrait photo, smiling wearing a white collared shirt"
Bad example: "trigger, a young man, front view, upper body, soft lighting, high quality photo"
Don't use quality tags like "soft lighting" or "high quality photo" (djdante doesn't either). Natural-language sentences, not tag lists.

Steps Must Scale With Image Count

djdante baseline
29 images / 4000 steps = ~276 epochs
Recommended steps formula
image_count x 138
Below 29 images, scale steps down proportionally to prevent overfit.

ArcFace vs AdaFace: ArcFace Is Right for LoRA Training

AdaFace: handles low-res better but lets low-quality images through → noise overfit.
ArcFace: strict → only high-quality passes → cleaner training data. ComfyUI IP-Adapter / FaceID is also ArcFace-based.

Don't Skip the "Manual Reinforcement" Step in the Pipeline

dataset_pipeline.py prints a "manually reinforce captions with clothing/background" warning — ignore it and you overfit.
Florence-2 auto-captioning was designed but not implemented → substitute with manual or AI Vision.

6 Gemini Suggestions Filtered Out

SuggestionDecisionReason
Disable walk_seedFiltereddjdante used true with strong results. Track overfit via network_multiplier 0.8 comparison instead.
Use the same seed for all samplesFiltereddjdante uses seeds 42/43/44 differently. Evaluate across diverse compositions.

7 Resolution History vs djdante Original

cache_latents_to_disk: true → false (root cause of 2x speed gap)
Samples: 3 → 6 (added network_multiplier 0.8 for overfit detection)
batch_size: 2 → 1 + gradient_accumulation 2 (matches djdante original)
steps: 4000 → 3000 (scaled to ~234 epochs at 16 images)
Captions reworked: template tags → djdante-style natural language + explicit clothing (resolved first-run failure)
Removed 3 duplicate suit photos: 19 → 16 images (clothing skew 52.6% → 43.75%)

8 File Locations

📁 G:\Work\JoseonPrince\Actor LoRA Dataset\ Project root
📁 G:\Work\JoseonPrince\Actor LoRA Dataset\Male\ Male dataset
📁 G:\Work\JoseonPrince\Actor LoRA Dataset\Female\ Female dataset
📁 G:\StabilityMatrix\Packages\AI-Toolkit\output\ Training output
📄 face_similarity.py ArcFace-based face similarity

Priority Principles

"No errors, clean, quality" — quality over speed. Use qfloat8 quantization (RTX 5090 Blackwell native FP8 → minimal quality loss, large speedup).