LoRA Training Strategy

Flux.2 Klein 9B LoRA dataset curation strategy based on djdante's validated settings

Flux.2 Klein 9B RTX 5090 (32GB) AI-Toolkit ArcFace verification

1 Project Overview

Goal

Real-person LoRA training

Male actor

joseonnamja

Female actor

mzyeoja

GPU

RTX 5090

VRAM

32GB

Architecture

Blackwell FP8

Environment

Windows 11 + StabilityMatrix

Target dataset

~29 curated

2 Reference: djdante's Validated Settings

Reddit thread → FluxTraining.json →

djdante's Key Insights

Klein 9B = organic realism

Better than Z-Image at lens distortion, skin texture, and overall realism.

Has structural weaknesses

Extra limbs, helmet structure errors, etc. Avoidable via a chest-up strategy.

29 chest-up images = best consistency

Most consistent results achieved with all 29 images chest-up.

Short, concise captions

Roughly: "djdanteman Portrait photo, smiling at the camera wearing a white collared shirt".

Face only = 29 images suffice

For face + body, far more images are required.

djdante's Original Config

FluxTraining.yaml

trigger_word: "djdanteman"
images: 29 # all chest-up
num_repeats: 1
batch_size: 1
gradient_accumulation: 2  # effective batch = 2
steps: 4000  # ~276 epochs
lr: 0.00008
optimizer: adamw
dtype: bf16
quantize: true (qfloat8)
model: black-forest-labs/FLUX.2-klein-base-9B
ema: true (decay 0.99)
save_every: 100
sample_every: 100
sample_steps: 50
walk_seed: true
resolution: [1024, 768]
network: lora (linear 16, conv 8)

Sample Structure for Overfit Detection

Default (1.0)

Prompt A

seed: 42 / network_multiplier: 1.0

Compare (0.8)

Prompt A

seed: 42 / network_multiplier: 0.8

Default (1.0)

Prompt B

seed: 43 / network_multiplier: 1.0

Compare (0.8)

Prompt B

seed: 43 / network_multiplier: 0.8

Default (1.0)

Prompt C

seed: 44 / network_multiplier: 1.0

Compare (0.8)

Prompt C

seed: 44 / network_multiplier: 0.8

The moment 0.8 starts looking better than 1.0 = overfit threshold → the checkpoint just before is optimal.

3 Bulk Photos → 29 Curated Images Pipeline

1. Receive bulk photos

→

2. ArcFace identity check

→

3. YOLOv8 chest-up crop

→

4. MagFace quality score

→

5. imagededup deduplication

→

6. Manual final selection

→

7. Florence-2 captioning

→

8. AI-Toolkit training

ArcFace

Drop cosine similarity < 0.4

YOLOv8

Face detection → chest-up crop

MagFace

Drop bottom 30%

imagededup

Cut 90%+ similarity

Florence-2

Auto-caption + trigger word

Target count

~29 images

4 djdante Original vs Our Settings

Parameter	djdante original	Our settings (v3)
trigger_word	djdanteman	joseonnamja / mzyeoja
Image count	29	16 (male) / 12 (female)
num_repeats	1	1 (male) / 2 (female)
batch_size	1	1
gradient_accumulation	2	2
steps	4000	3000 (scaled to image count)
lr	8e-5	8e-5
optimizer	adamw	adamw
dtype	bf16	bf16
quantize	qfloat8	qfloat8
EMA	true (0.99)	true (0.99)
save_every	100	100
sample_every	100	100
sample_steps	50	50
resolution	[1024, 768]	[1024, 768]
network	lora (L16, C8)	lora (L16, C8)
Sample count	3 (x2 mult)	6 (1.0 + 0.8 compare)
cache_latents_to_disk	true	false (2x faster)

5 Lessons (From the First Training Failure)

Captions Decide LoRA Quality

djdante style: "trigger Portrait photo, smiling wearing a white collared shirt"

Bad example: "trigger, a young man, front view, upper body, soft lighting, high quality photo"

Don't use quality tags like "soft lighting" or "high quality photo" (djdante doesn't either). Natural-language sentences, not tag lists.

Steps Must Scale With Image Count

djdante baseline

29 images / 4000 steps = ~276 epochs

Recommended steps formula

image_count x 138

Below 29 images, scale steps down proportionally to prevent overfit.

ArcFace vs AdaFace: ArcFace Is Right for LoRA Training

AdaFace: handles low-res better but lets low-quality images through → noise overfit.

ArcFace: strict → only high-quality passes → cleaner training data. ComfyUI IP-Adapter / FaceID is also ArcFace-based.

Don't Skip the "Manual Reinforcement" Step in the Pipeline

dataset_pipeline.py prints a "manually reinforce captions with clothing/background" warning — ignore it and you overfit.

Florence-2 auto-captioning was designed but not implemented → substitute with manual or AI Vision.

6 Gemini Suggestions Filtered Out

Suggestion	Decision	Reason
Disable walk_seed	Filtered	djdante used true with strong results. Track overfit via network_multiplier 0.8 comparison instead.
Use the same seed for all samples	Filtered	djdante uses seeds 42/43/44 differently. Evaluate across diverse compositions.

7 Resolution History vs djdante Original

cache_latents_to_disk: true → false (root cause of 2x speed gap)

Samples: 3 → 6 (added network_multiplier 0.8 for overfit detection)

batch_size: 2 → 1 + gradient_accumulation 2 (matches djdante original)

steps: 4000 → 3000 (scaled to ~234 epochs at 16 images)

Captions reworked: template tags → djdante-style natural language + explicit clothing (resolved first-run failure)

Removed 3 duplicate suit photos: 19 → 16 images (clothing skew 52.6% → 43.75%)

8 File Locations

📁 G:\Work\JoseonPrince\Actor LoRA Dataset\ Project root

📁 G:\Work\JoseonPrince\Actor LoRA Dataset\Male\ Male dataset

📁 G:\Work\JoseonPrince\Actor LoRA Dataset\Female\ Female dataset

📁 G:\StabilityMatrix\Packages\AI-Toolkit\output\ Training output

📄 face_similarity.py ArcFace-based face similarity

Priority Principles

"No errors, clean, quality" — quality over speed. Use qfloat8 quantization (RTX 5090 Blackwell native FP8 → minimal quality loss, large speedup).