← Back to blog

How 468 Facial Landmarks Decide If You're Passport-Ready

I built a system that checks passport photos against real government requirements for 6 countries using face mesh landmarks, background segmentation, and more rules than I ever wanted to learn about chin positioning.

Tech Stack
Python FastAPI MediaPipe OpenCV rembg Pillow Railway

Why passport photos are harder than you think

Here's a fun experiment. Go take a selfie right now and try to use it as a passport photo. It won't pass. Your head is tilted 3 degrees. The background isn't white enough. Your eyes are slightly closed. The face is 2% too small for Singapore's requirements. And if you used your iPhone front camera, the image is mirrored.

I found this out the hard way when I needed passport photos for multiple countries and kept getting rejected. Each country has different rules: different dimensions, different face-to-frame ratios, different background requirements. Singapore wants 400x514 pixels with your eyes at 42% from the top. The US wants a perfect 600x600 square. The UK wants 900x1200. None of them agree on anything.

So I built Photo ID Studio to automate the whole thing. Upload a photo, pick your country, and the system runs 25+ compliance checks using computer vision, then tells you exactly what's wrong and how to fix it. If you're in "assist" mode, it also crops, straightens, and whitens the background for you.

Beyond solving my own problem, this was a great project to learn practical computer vision: face detection, landmark extraction, image segmentation, color space manipulation, and the art of making algorithms work on messy real-world photos.

Try it yourself

Head to studio.kooexperience.com and upload any photo. The system supports:

Your photo never leaves memory. Nothing is stored on disk or in a database. Once the response is sent, the image is gone. I take this seriously because nobody wants their face sitting on some random server.

📷
Pro tip: Use a well-lit photo against a plain wall. Face the camera straight on. Don't smile (most countries require a neutral expression). And for the love of all things good, don't use a bathroom mirror selfie.

Face detection: 468 points on your face

What is MediaPipe FaceMesh? MediaPipe is Google's open-source framework for on-device ML. FaceMesh is one of its models that detects 468 3D landmarks on a human face in real time. Each landmark is an (x, y, z) coordinate representing a specific point: the tip of your nose, the corner of your left eye, the edge of your jaw, etc.

Why 468 points? Because passport compliance isn't just "is there a face?" It's "are the eyes open? Is the head tilted? Is the mouth closed? How far apart are the eyes? Where exactly is the chin?" You need dense landmarks to answer these questions precisely.

Here's what I extract from those 468 points:

Why MediaPipe over other options? I considered dlib/face_recognition, OpenCV Haar cascades, and heavier models like RetinaFace. MediaPipe won because: it runs on CPU in 40 to 80 ms (no GPU needed), it gives 468 landmarks (dlib gives 68), it includes a refined iris model, and it's actively maintained by Google. Haar cascades are fast but too brittle for varied poses and lighting. RetinaFace is more accurate but overkill for this use case and much heavier to deploy.

Face Landmark Mesh

468 points detected by MediaPipe FaceMesh

 

The 25+ things that can go wrong

Every check returns a structured result: pass, fail, warn, or manual review. Each includes a human-readable message and an action telling you what to do. Here's the full checklist grouped by category:

File and metadata

Face geometry

Head pose

Head Pose Visualizer
See how roll, yaw, and pitch affect compliance

Expression and visibility

Image quality

Background and framing

iPhone-specific

Background removal: the hardest easy problem

What is image segmentation? It's the process of separating the "person" pixels from the "background" pixels. Sounds simple until you try it on a photo with patterned wallpaper, a cat in the background, and hair strands catching the light. Segmentation models produce a mask: each pixel gets a confidence score from 0 (definitely background) to 1 (definitely person).

Photo ID Studio uses a dual-backend approach:

Why two backends? Because one size doesn't fit all. rembg is better but heavier. On a 2 GB Railway instance, you can't keep it loaded all the time. So I implemented lazy loading: rembg loads on first use and auto-unloads after 15 minutes of idle. This keeps memory usage manageable while still giving good results when someone actually uses the app. If rembg is unavailable, MediaPipe kicks in seamlessly.

The background whitening pipeline

Once you have the person mask, making the background white sounds trivial: just set non-person pixels to (255, 255, 255). In practice, the edges are where everything goes wrong. Hair strands, ear edges, and collar boundaries create a transition zone where the mask is uncertain. Naive replacement creates ugly halos.

The actual pipeline is a multi-stage process:

  1. Mask refinement – Gaussian blur + bilateral filter + morphological operations to smooth the mask edges
  2. Edge guard computation – gradient-based detection of high-frequency regions (hair detail, fabric texture) that need protection
  3. Color decontamination – unmixing the old background color from edge pixels so they don't carry a color cast
  4. Shadow lifting – boosting brightness in the HSV V channel for shadow regions near the person boundary
  5. Confidence-based blending – pixels with high background confidence get hard-overridden to white (RGB 252, 252, 252). Uncertain pixels get a weighted blend.
  6. Edge artifact suppression – clamping the outer pixel border to prevent seaming artifacts

This might sound over-engineered, but each step exists because of a real failure case I encountered. The first version had green halos on photos taken against grass. The second version had dark shadows along hair boundaries. The third version had visible seams at the image edge. Each bug added a stage to the pipeline.

Background Whitening Pipeline
6 stages from raw mask to clean white background
Original Photo
The raw input image with a colored background that needs to be replaced with white.

Country rules: why your Singapore photo won't work in the US

Every country has its own passport photo specification. These aren't suggestions; they're hard requirements enforced by immigration offices. Here's a comparison:

Country   Output Size    Max File   Min Input     Eye Position   Max Roll
SG        400 x 514      8 MB      800 x 1200    42% from top   8 deg
US        600 x 600     10 MB      900 x 900     varies         varies
UK        900 x 1200    10 MB     1100 x 1400    varies         varies
CA        826 x 1063    10 MB     1000 x 1300    varies         varies
AU        826 x 1063    10 MB     1000 x 1300    varies         varies
IN        413 x 531      8 MB      800 x 1100    varies         varies

Notice that the US wants a square photo while everyone else wants a rectangle. Singapore has specific requirements for eye positioning (42% from the top of the frame). The UK needs the highest resolution output. India has the smallest output dimensions.

All of this is stored in a countries.yaml config file. Adding a new country means adding a new YAML block with its requirements. No code changes needed.

# Example: Singapore profile (countries.yaml)
SG:
  output_width: 400
  output_height: 514
  max_file_size_mb: 8
  min_input_width: 800
  min_input_height: 1200
  min_eye_distance_px: 90
  min_face_height_px: 420
  eye_height_fraction_of_height: 0.42
  max_roll_degrees: 8
  max_yaw_ratio: 0.22
  max_pitch_ratio: 0.20
  min_background_brightness: 208
  max_background_saturation: 40
  min_blur_score: 55
  min_even_lighting_score: 0.62

Design lesson: Putting country rules in config instead of code was one of the best early decisions. When I added India support, it took 10 minutes of YAML editing and zero code changes. Configuration-driven design scales better than hardcoded conditionals.

The iterative crop algorithm

Getting the crop right is surprisingly tricky. The goal: position the person's eyes at exactly the right height in the output image, keep the head centered horizontally, and include enough of the shoulders. Here's how it works:

  1. Compute initial crop using inter-eye distance as the baseline. The crop width is calculated from the country's output aspect ratio.
  2. Iterative eye-line recentering (up to 4 passes): extract the face in the provisional crop, recompute the eye-line position, and shift the crop to center the eyes. Stop when the error is less than 1.25 pixels or max iterations are hit.
  3. Segmentation-aware vertical rebalancing: if the segmentation mask is available, shift the crop up or down to keep both the crown and shoulders visible. This prevents the common issue of cropping off the top of someone's head or losing their shoulders.
  4. Mild roll straightening: if the head is tilted more than 0.3 degrees, apply a rotation matrix. Pad with reflected borders to avoid white corners.
  5. Final resize to the country's exact output dimensions using cubic interpolation.

The iterative approach matters because a single-pass crop often gets the eye position wrong by 5 to 10 pixels. That might sound small, but in a 514-pixel-tall Singapore passport photo, even a few pixels off can push the eye position outside the acceptable range. Four iterations converges reliably.

Why I didn't use generative AI

This is a question I get asked. Why not use Stable Diffusion or DALL-E to fix the background, adjust the pose, or enhance the photo? Three reasons:

  1. Identity fidelity. Passport photos must look exactly like you. Generative models can subtly alter facial features, skin tone, or eye shape. Even a small change could cause problems at border control. For compliance, you need deterministic, non-generative operations that preserve the original pixels.
  2. Explainability. Every operation in the pipeline is traceable. I can tell you exactly which pixels were changed and why. With a generative model, you get a black box that produces "a nice-looking result" with no guarantees about what was modified.
  3. Reproducibility. The same input always produces the same output. Generative models have randomness baked in. For a compliance tool, determinism is a feature, not a bug.

The non-generative approach uses classical CV operations: masking, alpha blending, color correction in LAB space, bilateral filtering. These are well-understood, fast, and completely transparent. Sometimes the boring solution is the right one.

Deployment on 2 GB of RAM

Running MediaPipe + rembg + OpenCV on a budget Railway instance (1 vCPU, 2 to 3 GB RAM) required careful memory management. Here's what I learned:

Typical latency: under 800 ms at the 50th percentile, under 1.5 seconds at the 95th. That's fast enough that users don't feel like they're waiting, even with all 25+ checks running on a single CPU core.

Lessons learned

Image quality metrics are surprisingly simple. Blur detection is one line of OpenCV: cv2.Laplacian(gray, cv2.CV_64F).var(). The Laplacian operator detects edges; if the variance is low, the image is blurry. Lighting uniformity splits the image into four quadrants, measures brightness in each, and computes the spread. These aren't deep learning; they're signal processing fundamentals that work reliably.

Edge cases are where the real work lives. The happy path (well-lit, centered, white background) was easy. The first 80% of the pipeline took 20% of the time. The remaining 20% (iPhone mirroring, green halos, dark hair on dark backgrounds, glasses glare, off-center framing) took the other 80%. If you're building any CV pipeline, budget your time for edge cases, not the main flow.

Configuration beats code for rules that change. Countries update their photo requirements. Having everything in YAML means I can adjust thresholds or add new countries without touching the pipeline code. This separation of rules from logic is one of the most useful patterns in software engineering.

Privacy is a feature, not a checkbox. Processing photos in memory with no persistence isn't just privacy-friendly; it's simpler to build and deploy. No database to manage, no storage to secure, no GDPR deletion requests to handle. Sometimes the most private design is also the simplest.

If you've been putting off getting a proper passport photo, go give the app a try. And if it tells you your head is tilted 9 degrees, don't argue with the math. Just straighten up and retake the shot.


References