How 468 Facial Landmarks Decide If You're Passport-Ready
I built a system that checks passport photos against real government requirements for 6 countries using face mesh landmarks, background segmentation, and more rules than I ever wanted to learn about chin positioning.
Why passport photos are harder than you think
Here's a fun experiment. Go take a selfie right now and try to use it as a passport photo. It won't pass. Your head is tilted 3 degrees. The background isn't white enough. Your eyes are slightly closed. The face is 2% too small for Singapore's requirements. And if you used your iPhone front camera, the image is mirrored.
I found this out the hard way when I needed passport photos for multiple countries and kept getting rejected. Each country has different rules: different dimensions, different face-to-frame ratios, different background requirements. Singapore wants 400x514 pixels with your eyes at 42% from the top. The US wants a perfect 600x600 square. The UK wants 900x1200. None of them agree on anything.
So I built Photo ID Studio to automate the whole thing. Upload a photo, pick your country, and the system runs 25+ compliance checks using computer vision, then tells you exactly what's wrong and how to fix it. If you're in "assist" mode, it also crops, straightens, and whitens the background for you.
Beyond solving my own problem, this was a great project to learn practical computer vision: face detection, landmark extraction, image segmentation, color space manipulation, and the art of making algorithms work on messy real-world photos.
Try it yourself
Head to studio.kooexperience.com and upload any photo. The system supports:
- 6 countries: Singapore, United States, United Kingdom, Canada, Australia, and India
- 2 modes: Strict (just tells you what's wrong) and Assist (fixes what it can)
- Beauty options: Color correction and soft-light smoothing
- Mirror handling: Auto-detects and corrects iPhone selfie mirroring
Your photo never leaves memory. Nothing is stored on disk or in a database. Once the response is sent, the image is gone. I take this seriously because nobody wants their face sitting on some random server.
Face detection: 468 points on your face
What is MediaPipe FaceMesh? MediaPipe is Google's open-source framework for on-device ML. FaceMesh is one of its models that detects 468 3D landmarks on a human face in real time. Each landmark is an (x, y, z) coordinate representing a specific point: the tip of your nose, the corner of your left eye, the edge of your jaw, etc.
Why 468 points? Because passport compliance isn't just "is there a face?" It's "are the eyes open? Is the head tilted? Is the mouth closed? How far apart are the eyes? Where exactly is the chin?" You need dense landmarks to answer these questions precisely.
Here's what I extract from those 468 points:
- Eye aspect ratio – the ratio of eye height to width. If it drops below a threshold, the eyes are closed or squinting. Passport offices reject closed eyes.
- Head roll – the angle of the line connecting both eyes. If your head is tilted left or right, this angle deviates from 0. Singapore allows up to 8 degrees.
- Head yaw – is your face turned left or right? Measured by the horizontal offset between your nose tip and the midpoint of your eyes. Beyond a 0.22 ratio, you're too far off-center.
- Head pitch – chin up or down? Measured by the nose-to-mouth distance ratio. Looking up or down beyond a 0.20 ratio fails the check.
- Mouth closure – the distance between upper and lower lip landmarks. Open mouth = neutral expression violation.
- Inter-eye distance – the pixel distance between eye centers. This is the baseline for all crop calculations. If it's too small, your face is too far from the camera.
Why MediaPipe over other options? I considered dlib/face_recognition, OpenCV Haar cascades, and heavier models like RetinaFace. MediaPipe won because: it runs on CPU in 40 to 80 ms (no GPU needed), it gives 468 landmarks (dlib gives 68), it includes a refined iris model, and it's actively maintained by Google. Haar cascades are fast but too brittle for varied poses and lighting. RetinaFace is more accurate but overkill for this use case and much heavier to deploy.
Face Landmark Mesh
468 points detected by MediaPipe FaceMesh
The 25+ things that can go wrong
Every check returns a structured result: pass, fail, warn, or manual review. Each includes a human-readable message and an action telling you what to do. Here's the full checklist grouped by category:
File and metadata
- Is the file format supported? (JPG, PNG, HEIC, WebP)
- Is the file under the size limit? (8 MB for Singapore, 10 MB for most others)
- Is the resolution high enough? (Singapore needs at least 800x1200 input pixels)
- Is the photo recent enough? (Checked via EXIF date if available)
Face geometry
- Was exactly one face detected? (Zero = no face; two or more = group photo)
- Is the face large enough in the frame? (Measured by inter-eye distance)
- Is the face height within the required range for the output crop?
Head pose
- Roll (tilt): within 8 degrees?
- Yaw (turning): within 0.22 ratio?
- Pitch (chin angle): within 0.20 ratio?
Expression and visibility
- Are both eyes clearly open?
- Is the mouth closed? (Neutral expression required)
Image quality
- Sharpness – measured using the Laplacian variance (explained below)
- Lighting uniformity – are there harsh shadows or uneven illumination?
- Text or watermarks – detected via contour analysis
Background and framing
- Is the background white/light enough?
- Is the background uniform (low standard deviation)?
- Are head and shoulders properly visible in the frame?
iPhone-specific
- Is the image mirrored? (Front camera EXIF detection)
- Was mirror correction applied?
Background removal: the hardest easy problem
What is image segmentation? It's the process of separating the "person" pixels from the "background" pixels. Sounds simple until you try it on a photo with patterned wallpaper, a cat in the background, and hair strands catching the light. Segmentation models produce a mask: each pixel gets a confidence score from 0 (definitely background) to 1 (definitely person).
Photo ID Studio uses a dual-backend approach:
-
Primary: rembg
with the
u2net_human_segmodel (~170 MB). This is a neural network specifically trained for human segmentation. It handles complex backgrounds well: bookshelves, outdoor scenes, cluttered rooms. The tradeoff is RAM: it uses about 800 MB when loaded. - Fallback: MediaPipe SelfieSegmentation (~20 MB). Lighter and faster, but struggles with non-uniform backgrounds. Used when rembg is disabled or the server is low on memory.
Why two backends? Because one size doesn't fit all. rembg is better but heavier. On a 2 GB Railway instance, you can't keep it loaded all the time. So I implemented lazy loading: rembg loads on first use and auto-unloads after 15 minutes of idle. This keeps memory usage manageable while still giving good results when someone actually uses the app. If rembg is unavailable, MediaPipe kicks in seamlessly.
The background whitening pipeline
Once you have the person mask, making the background white sounds trivial: just set non-person pixels to (255, 255, 255). In practice, the edges are where everything goes wrong. Hair strands, ear edges, and collar boundaries create a transition zone where the mask is uncertain. Naive replacement creates ugly halos.
The actual pipeline is a multi-stage process:
- Mask refinement – Gaussian blur + bilateral filter + morphological operations to smooth the mask edges
- Edge guard computation – gradient-based detection of high-frequency regions (hair detail, fabric texture) that need protection
- Color decontamination – unmixing the old background color from edge pixels so they don't carry a color cast
- Shadow lifting – boosting brightness in the HSV V channel for shadow regions near the person boundary
- Confidence-based blending – pixels with high background confidence get hard-overridden to white (RGB 252, 252, 252). Uncertain pixels get a weighted blend.
- Edge artifact suppression – clamping the outer pixel border to prevent seaming artifacts
This might sound over-engineered, but each step exists because of a real failure case I encountered. The first version had green halos on photos taken against grass. The second version had dark shadows along hair boundaries. The third version had visible seams at the image edge. Each bug added a stage to the pipeline.
Country rules: why your Singapore photo won't work in the US
Every country has its own passport photo specification. These aren't suggestions; they're hard requirements enforced by immigration offices. Here's a comparison:
Country Output Size Max File Min Input Eye Position Max Roll
SG 400 x 514 8 MB 800 x 1200 42% from top 8 deg
US 600 x 600 10 MB 900 x 900 varies varies
UK 900 x 1200 10 MB 1100 x 1400 varies varies
CA 826 x 1063 10 MB 1000 x 1300 varies varies
AU 826 x 1063 10 MB 1000 x 1300 varies varies
IN 413 x 531 8 MB 800 x 1100 varies varies
Notice that the US wants a square photo while everyone else wants a rectangle. Singapore has specific requirements for eye positioning (42% from the top of the frame). The UK needs the highest resolution output. India has the smallest output dimensions.
All of this is stored in a countries.yaml config file. Adding a new
country means adding a new YAML block with its requirements. No code changes needed.
# Example: Singapore profile (countries.yaml)
SG:
output_width: 400
output_height: 514
max_file_size_mb: 8
min_input_width: 800
min_input_height: 1200
min_eye_distance_px: 90
min_face_height_px: 420
eye_height_fraction_of_height: 0.42
max_roll_degrees: 8
max_yaw_ratio: 0.22
max_pitch_ratio: 0.20
min_background_brightness: 208
max_background_saturation: 40
min_blur_score: 55
min_even_lighting_score: 0.62
Design lesson: Putting country rules in config instead of code was one of the best early decisions. When I added India support, it took 10 minutes of YAML editing and zero code changes. Configuration-driven design scales better than hardcoded conditionals.
The iterative crop algorithm
Getting the crop right is surprisingly tricky. The goal: position the person's eyes at exactly the right height in the output image, keep the head centered horizontally, and include enough of the shoulders. Here's how it works:
- Compute initial crop using inter-eye distance as the baseline. The crop width is calculated from the country's output aspect ratio.
- Iterative eye-line recentering (up to 4 passes): extract the face in the provisional crop, recompute the eye-line position, and shift the crop to center the eyes. Stop when the error is less than 1.25 pixels or max iterations are hit.
- Segmentation-aware vertical rebalancing: if the segmentation mask is available, shift the crop up or down to keep both the crown and shoulders visible. This prevents the common issue of cropping off the top of someone's head or losing their shoulders.
- Mild roll straightening: if the head is tilted more than 0.3 degrees, apply a rotation matrix. Pad with reflected borders to avoid white corners.
- Final resize to the country's exact output dimensions using cubic interpolation.
The iterative approach matters because a single-pass crop often gets the eye position wrong by 5 to 10 pixels. That might sound small, but in a 514-pixel-tall Singapore passport photo, even a few pixels off can push the eye position outside the acceptable range. Four iterations converges reliably.
Why I didn't use generative AI
This is a question I get asked. Why not use Stable Diffusion or DALL-E to fix the background, adjust the pose, or enhance the photo? Three reasons:
- Identity fidelity. Passport photos must look exactly like you. Generative models can subtly alter facial features, skin tone, or eye shape. Even a small change could cause problems at border control. For compliance, you need deterministic, non-generative operations that preserve the original pixels.
- Explainability. Every operation in the pipeline is traceable. I can tell you exactly which pixels were changed and why. With a generative model, you get a black box that produces "a nice-looking result" with no guarantees about what was modified.
- Reproducibility. The same input always produces the same output. Generative models have randomness baked in. For a compliance tool, determinism is a feature, not a bug.
The non-generative approach uses classical CV operations: masking, alpha blending, color correction in LAB space, bilateral filtering. These are well-understood, fast, and completely transparent. Sometimes the boring solution is the right one.
Deployment on 2 GB of RAM
Running MediaPipe + rembg + OpenCV on a budget Railway instance (1 vCPU, 2 to 3 GB RAM) required careful memory management. Here's what I learned:
- Lazy loading: rembg's 170 MB model loads on first request, not at startup. This means cold start is fast and the model only occupies RAM when someone actually uses the app.
- Idle unloading: after 15 minutes of no requests, rembg unloads
itself and
malloc_trim()returns the heap to the OS. Next request takes ~2 seconds longer (model reload), but idle memory drops significantly. - Processing resolution cap: incoming images are downscaled to max 1920px on the long side or 4 megapixels before any processing. This prevents a 48-megapixel phone photo from eating all your RAM during inference.
- Concurrency limiting: a semaphore caps in-flight analysis requests at 3. This prevents queue buildup during traffic spikes.
- Rate limiting: IP-based token bucket with 10 requests/minute burst and 200/day cap. Aggressive, but necessary on shared infrastructure.
Typical latency: under 800 ms at the 50th percentile, under 1.5 seconds at the 95th. That's fast enough that users don't feel like they're waiting, even with all 25+ checks running on a single CPU core.
Lessons learned
Image quality metrics are surprisingly simple.
Blur detection is one line of OpenCV: cv2.Laplacian(gray, cv2.CV_64F).var().
The Laplacian operator detects edges; if the variance is low, the image is blurry.
Lighting uniformity splits the image into four quadrants, measures brightness in each,
and computes the spread. These aren't deep learning; they're signal processing
fundamentals that work reliably.
Edge cases are where the real work lives. The happy path (well-lit, centered, white background) was easy. The first 80% of the pipeline took 20% of the time. The remaining 20% (iPhone mirroring, green halos, dark hair on dark backgrounds, glasses glare, off-center framing) took the other 80%. If you're building any CV pipeline, budget your time for edge cases, not the main flow.
Configuration beats code for rules that change. Countries update their photo requirements. Having everything in YAML means I can adjust thresholds or add new countries without touching the pipeline code. This separation of rules from logic is one of the most useful patterns in software engineering.
Privacy is a feature, not a checkbox. Processing photos in memory with no persistence isn't just privacy-friendly; it's simpler to build and deploy. No database to manage, no storage to secure, no GDPR deletion requests to handle. Sometimes the most private design is also the simplest.
If you've been putting off getting a proper passport photo, go give the app a try. And if it tells you your head is tilted 9 degrees, don't argue with the math. Just straighten up and retake the shot.