Dataset Bible
"The dataset is 95%, the parameters are 5%."
FPHam (500+ LoRAs of experience) — synthesis of CivitAI, HuggingFace, and Reddit/SD community orthodoxy
01 — Optimal Image Count
- 5 or fewer — insufficient. Cannot secure variety in angle/expression.
- 10–30 — optimal. Flux-family orthodoxy. Variety + quality balance.
- 30–50 — caution. Compromising quality just to add count backfires.
- 50+ — risky. Reports of degraded performance (CivitAI: 110 < 16).
Principle: 25 great images > 75 bad. Quality over quantity.
02 — Golden Rule
Subject (Identity) — bone structure, features, body type → keep consistent. Do not describe in captions.
Everything else — angle, distance, expression, background, lighting, clothing → maximize variety. Describe specifically in captions.
Variety must come from real images. Captions alone cannot break statistical bias.
03 — Variety Checklist (5 Axes)
1
Angle
5+ front / 3+ side / 1–2 back-quarter
2
Distance
5+ close-up / 5+ bust / 3–5 full body
3
Expression
3+ types (smile / neutral / serious)
4
Background
3+ types (indoor / outdoor / studio)
5
Lighting
2+ types — #1 cause of "AI look"!
04 — Absolute Exclude List
X
Low resolution
Below 512px
Lacks information
X
JPEG compression
Block noise learned
as skin texture
X
Beauty filter
Destroys high-freq detail
Wax-figure learning
X
Other faces
Identity confusion
Doesn't know whom to learn
X
Watermarks
Semi-transparent patterns
Inseparable
05 — Makeup Ratios (for 20 images)
Bare face 5–8
Natural makeup 3–5
Full makeup 2–3
100% bare face → "always bare-faced" learning, no response to makeup prompts /
100% makeup → specific pattern baked-in /
Beauty filters → absolutely never, at any ratio
06 — Captioning Rules
Caption O (Variable)
- Gaze direction
- Angle / pose
- Clothing (specifically)
- Background (specifically)
- Lighting (specifically)
- Expression
- Makeup level
- Hairstyle / accessories
Caption X (Fixed Identity)
- Eye shape, nose shape
- Face shape, jawline
- Bone structure, body proportions
- Skin tone (natural)
Forbidden Caption Words
Abstract quality: high quality, 8k, masterpiece, detailed
Abstract lighting: soft lighting, cinematic lighting
Subjective evaluation: beautiful, attractive, pretty
Medium quality: realistic, photorealistic, sharp
Caption Detail Level
Insufficient: "wearing gray suit"
Adequate: "wearing fitted charcoal three-piece suit with white dress shirt"
Excessive: "wearing Ermenegildo Zegna 95% wool charcoal suit with 1.5cm pinstripes"
Principle: describe up to what a human looking at the image could distinguish.
07 — Dataset Composition Workflow
- Source collection — real photos (camera/phone), unretouched
- Layer 1 filter — remove "absolute X" (low-res, JPEG, beauty filter, multi-face, watermark)
- Quality score — Laplacian variance, face detection, face ratio, resolution → drop bottom 30%
- Tone consistency — remove LAB-color z-score > 2.0 outliers
- Variety verification — 5-axis checklist → if missing, shoot/collect more
- Captioning — djdante format: trigger + concrete description of variable elements
- Final verification — run dataset_validator.py → iterate if issues
08 — Pre-Training Verification Checklist
- ☐ Image count in 10–30 range?
- ☐ All 5 "absolute X" types removed?
- ☐ All 5 variety axes satisfied?
- ☐ No quality words in captions?
- ☐ No fixed-identity descriptions in captions?
- ☐ Bare-face / makeup ratio reasonable? (Not 100% bare face?)
- ☐ num_repeats x image count x steps ≈ 100,000?
- ☐ 5+ close-up shots? (face detail)
Parameters
Optimal LoRA Training Parameters
For Flux.2 family — djdante / FPHam community orthodoxy
Total Exposure
≈ 100,000
Formula
img x rep x steps
Images
10 – 30
Num Repeats
min (1–2)
LoRA Strategy
1 LoRA (unified)
Supplementary
IP-Adapter / CN
Understanding num_repeats: 30 images x 1 repeat = 30 exposure/step (best) — max variety. 15 x 2 = OK (second best). 10 x 5 = overfit risk. Repeat is "a trick to compensate for data scarcity with quantity"; it does not increase variety.
Overfit vs Undertrain
Underfitting
- + Doesn't resemble at certain angles
- + Weak Identity
- + No prompt response
- Response: add images, increase steps/repeat
Overfitting
- - Copies training data
- - Pose/background fixed
- - Ignores prompts
- Response: add variety, decrease steps, compare checkpoints
Decomposing the "AI Look"
- Lighting mismatch — highest impact. Light that doesn't match the environment (passport flash outdoors?).
- Waxy skin — exclude beauty filters, use unretouched originals.
- Stiff expression — expression variety to learn facial-muscle combos.
- Lifeless eyes — diverse gaze directions + pupil reflections.
- Unnatural pose — pose variety.
- Background mismatch — background variety.