1 Image Count Guide
5 or fewer
X — insufficient
10–30
Optimal
30–50
Caution
50+
Risky
Lacking angle/expression
Flux-family orthodoxy: variety + quality balance
Backfires if quality is compromised
Performance degradation (110 < 16)
Principle: 25 great images > 75 bad. Quality over quantity.
2 Master Principle: Consistency vs Variety (Golden Rule)
Golden Rule of Dataset Composition
Subject (Identity) consistent
Everything else varied
Keep consistent (NO caption)
- Bone structure, features, body type
- Face shape, jawline
- Skin tone (natural)
- Eye shape, nose shape
Maximize variety (caption ✓)
- Angle / distance / expression
- Background / lighting / clothing
- Hairstyle / accessories
- Gaze direction / pose
Caption mechanics: Uncaptioned features are learned as Identity (always reproduced). Captioned features are reproduced only when the matching text is present. But if all 20 images have a white background and the caption says "white background," the statistical bias is not broken — you actually need images with diverse backgrounds.
3 Variety Checklist (5 Axes)
| # | Axis | Minimum | Symptom If Missed |
| 1 |
Angle |
5+ front / 3+ side / 1–2 back-quarter |
Face collapses at certain angles |
| 2 |
Distance |
5+ close-up / 5+ bust / 3–5 full body |
Lacking face detail or body type unlearned |
| 3 |
Expression |
3+ types |
Wax-figure expression (facial muscles unlearned) |
| 4 |
Background |
3+ types |
Identity locked to a specific background |
| 5 |
Lighting |
2+ types |
#1 cause of "AI look"! Lighting that doesn't match the environment |
Angle Detail Guide
0–30°
Mandatory, primary
Front to slight angle (face detail)
30–45°
Mandatory
Side (profile)
45–70°
1–2 OK
Back-quarter (skull / ears / neck)
70–90°
Avoid
Pure back of head (zero face info, wasted ROI)
4 Absolute Exclude List
Five types that no caption or parameter can rescue.
1
Low resolution (<512px)
Physically lacks information; captions cannot fix
2
Heavy JPEG compression
Block noise gets learned as skin texture
3
Beauty filter / skin smoothing
Destroys high-frequency detail → wax-figure learning. Laplacian variance extremely low (12–25 vs normal hundreds–thousands)
4
Other people's faces in frame
Identity confusion (model doesn't know whom to learn)
5
Watermarks / logos
Semi-transparent patterns blend into skin/background pixels — inseparable, artifacts get learned
5 Retouch / Makeup Decision Tree
Does this photo have retouching?
Does this photo have "retouching"?
▼
Digital retouch
Beauty apps / Insta filters
Physical makeup
Real makeup
Skin smoothing → Always exclude (info destroyed)
Face reshaping → Always exclude (bone distortion)
Color/contrast → OK if mild
Eye enlarge / chin shave → Always exclude
Bare face → OK (base face learning)
Natural makeup → OK + caption "with natural makeup"
Full makeup → OK + caption "with full makeup, styled hair"
Decision rule: "Did this actually exist in front of the camera?"
Yes → OK (real makeup, real outfit, real lighting) | No → Risky (digital post-processing)
Recommended Makeup Ratios (for 20 images)
For 20 images
makeup ratio
! 100% bare face: "this person = always bare-faced" → no response to hair/makeup prompts
! 100% makeup: specific makeup pattern baked-in → bare face / different makeup impossible
! Beauty filters: always exclude, at any ratio
6 Captioning Rules
"{trigger}, {gaze}, {angle}, {specific clothing}, {specific background}"
// Examples
"joseonnamja, looking at viewer, front view, wearing fitted charcoal suit with white shirt, office lobby with glass walls"
"mzyeoja, smiling, three-quarter view, with natural makeup, wearing cream blouse, studio with gray backdrop"
Gaze direction
Angle / pose
Clothing (specific)
Background (specific)
Lighting (specific)
Expression
Makeup level
Hairstyle
Accessories
Eye shape, nose shape
Face shape, jawline
Bone structure, body proportions
Skin tone (natural)
-- boundary --
Dyed hair: caption ✓ "with blonde dyed hair"
Color contacts: caption ✓ "with blue contact lenses"
Forbidden Caption Words
| Type | Examples | Reason |
| Abstract quality | high quality 8k masterpiece | Base model already knows; no visual mapping |
| Abstract lighting | soft lighting cinematic lighting | Quality judgement, not concrete description |
| Subjective evaluation | beautiful attractive | Subjective, no visual mapping |
| Medium quality | realistic photorealistic | Resolution is handled by the image itself |
Caption Detail Level
Insufficient: "wearing gray suit" — lumps thousands of variants into one, overfits
Adequate: "wearing fitted charcoal three-piece suit with white dress shirt" — visually distinguishable level
Excessive: "wearing Ermenegildo Zegna 95% wool charcoal suit with 1.5cm pinstripes" — brand/fabric = invisible information
Principle: describe up to "what a human looking at the image could distinguish"
7 Overfit vs Undertrain
Undertrain
Optimal
Overfit
Undertrain
- Doesn't resemble at certain angles
- Weak Identity
- No prompt response
Response: add images, increase steps/repeat
Optimal
- ID held + diverse outputs
- Responsive to prompts
Right balance
Overfit
- Copies training data
- Pose/background fixed
- Ignores prompts
Response: add variety, decrease steps
Understanding num_repeats
Ideal
30 images x 1 repeat = 30 exposure/step (max variety)
Realistic
15 images x 2 repeat = 30 exposure/step (second best)
Risky
10 images x 5 repeat = 50 exposure/step (overfit risk, same image 5x)
djdante reference: total exposure = images x repeats x steps ≈ 100,000
8 Decomposing the "AI Look"
| Cause | Impact | Dataset Response |
| Lighting mismatch | Highest | Lighting variety (natural / indoor / studio) |
| Waxy skin | High | Exclude beauty filters; use unretouched originals |
| Stiff expression | High | Expression variety (learn facial muscle combos) |
| Lifeless eyes | Medium | Diverse gaze directions + pupil reflections |
| Unnatural pose | Medium | Pose variety |
| Background mismatch | Medium | Background variety |
9 1 LoRA vs 2 LoRA
1 LoRA (unified) — orthodox
Pros
- Natural cohesion
- Simple workflow
Cons
- Face resolution drops at full body
Consistency: face 90%+, body 70–80%
2 LoRA (face + body split)
Pros
- Each can learn at optimal resolution
Cons
- Composite seams unnatural
- Tone / lighting mismatch
- Complex workflow
In practice: 1 LoRA + distance variety + IP-Adapter / ControlNet support. No method achieves 100% "single-person" reproduction with LoRA alone.
10 Common Mistakes / Misconceptions
Mistake 1: "If captions are everything, I don't need variety, right?"
X Captions = magical tags
O Captions = weak hints. Images = strong signal. Both needed. Image variety is primary, captions secondary.
Mistake 2: "Bare face only is best"
X 100% bare face = best
O 100% unretouched originals = best (mix of bare face + real makeup). "Bare face" and "unretouched" are different concepts!
Mistake 3: "Forbidden quality words = I can include bad images"
X Forbidden captions = bad images allowed
O Image quality filter (Layer 1) + caption rules (Layer 2) are separate. Only include good images, but don't write quality-judgment words in captions.
Mistake 4: "Always exclude back-of-head"
X 45°+ all out
O 45–70° back-quarter, 1–2 OK (skull / ears / neck = identity). 70°+ pure back-of-head = avoid (wasted ROI).
Mistake 5: "Resolution can be patched with captions"
X Low-res + "8k" caption = OK
O Resolution is a physical limit. Missing pixels don't materialize. One of the 5 "absolute X" types.
Mistake 6: "Passport-photo strategy is best"
X 20 passport-style photos only
O 5–8 passport + 10–15 in varied environments. Passport-only risks lighting/background lock-in.
Mistake 7: "AI look = skin issue"
X Only cause of AI-look = waxy skin
O Lighting mismatch is often the #1 cause. Face ID matches and skin is fine but "something's off" = highly likely lighting.
11 Dataset Composition Workflow
1
Source collection
Real photos (camera / phone), unretouched
2
Layer 1 filter (remove "absolute X")
Low-res, JPEG compression, beauty filter, multi-face, watermarks
3
Quality-score filter
Laplacian variance (sharpness) / face detection confidence / face ratio / resolution / drop bottom 30%
4
Tone consistency filter
Remove LAB-color z-score > 2.0 outliers
5
Variety verification (5-axis checklist)
Angle / distance / expression / background / lighting. For underfilled axes, shoot/collect more of that type.
6
Captioning
ai-vision-mcp or manual. djdante format: trigger + concrete description of variable elements. No fixed-identity, no quality words.
7
Final verification
Run dataset_validator.py. Check report; if issues, repeat steps 2–6.
12 Pre-Training Verification Checklist
✓
Image count in 10–30 range?
✓
All 5 "absolute X" types removed?
✓
All 5 variety axes satisfied?
✓
No quality words in captions?
✓
No fixed-identity descriptions in captions?
✓
Bare-face / makeup ratio reasonable? (Not 100% bare face?)
✓
num_repeats x image count x steps ≈ 100,000?
✓
5+ close-up shots? (face detail)