Building Computer Vision OpenCV March 12, 2026 16 min read

How 468 Facial Landmarks Decide If You're Passport-Ready

I built a system that checks passport photos against real government requirements for 6 countries using face mesh landmarks, background segmentation, and more rules than I ever wanted to learn about chin positioning.

Tech Stack

Python FastAPI MediaPipe OpenCV rembg Pillow Railway

Contents

Why passport photos are harder than you think Try it yourself Face detection: 468 points on your face The 25+ things that can go wrong Background removal: the hardest easy problem Country rules: why your Singapore photo won't work in the US The iterative crop algorithm Why I didn't use generative AI Deployment on 2 GB of RAM Lessons learned References

Why passport photos are harder than you think

Here's a fun experiment. Go take a selfie right now and try to use it as a passport photo. It won't pass. Your head is tilted 3 degrees. The background isn't white enough. Your eyes are slightly closed. The face is 2% too small for Singapore's requirements. And if you used your iPhone front camera, the image is mirrored.

I found this out the hard way when I needed passport photos for multiple countries and kept getting rejected. Each country has different rules: different dimensions, different face-to-frame ratios, different background requirements. Singapore wants 400x514 pixels with your eyes at 42% from the top. The US wants a perfect 600x600 square. The UK wants 900x1200. None of them agree on anything.

So I built Photo ID Studio to automate the whole thing. Upload a photo, pick your country, and the system runs 25+ compliance checks using computer vision, then tells you exactly what's wrong and how to fix it. If you're in "assist" mode, it also crops, straightens, and whitens the background for you.

Beyond solving my own problem, this was a great project to learn practical computer vision: face detection, landmark extraction, image segmentation, color space manipulation, and the art of making algorithms work on messy real-world photos.

Try it yourself

Head to studio.kooexperience.com and upload any photo. The system supports:

6 countries: Singapore, United States, United Kingdom, Canada, Australia, and India
2 modes: Strict (just tells you what's wrong) and Assist (fixes what it can)
Beauty options: Color correction and soft-light smoothing
Mirror handling: Auto-detects and corrects iPhone selfie mirroring

Your photo never leaves memory. Nothing is stored on disk or in a database. Once the response is sent, the image is gone. I take this seriously because nobody wants their face sitting on some random server.

📷

Pro tip: Use a well-lit photo against a plain wall. Face the camera straight on. Don't smile (most countries require a neutral expression). And for the love of all things good, don't use a bathroom mirror selfie.

Face detection: 468 points on your face

What is MediaPipe FaceMesh? MediaPipe is Google's open-source framework for on-device ML. FaceMesh is one of its models that detects 468 3D landmarks on a human face in real time. Each landmark is an (x, y, z) coordinate representing a specific point: the tip of your nose, the corner of your left eye, the edge of your jaw, etc.

Why 468 points? Because passport compliance isn't just "is there a face?" It's "are the eyes open? Is the head tilted? Is the mouth closed? How far apart are the eyes? Where exactly is the chin?" You need dense landmarks to answer these questions precisely.

Here's what I extract from those 468 points:

Eye aspect ratio – the ratio of eye height to width. If it drops below a threshold, the eyes are closed or squinting. Passport offices reject closed eyes.
Head roll – the angle of the line connecting both eyes. If your head is tilted left or right, this angle deviates from 0. Singapore allows up to 8 degrees.
Head yaw – is your face turned left or right? Measured by the horizontal offset between your nose tip and the midpoint of your eyes. Beyond a 0.22 ratio, you're too far off-center.
Head pitch – chin up or down? Measured by the nose-to-mouth distance ratio. Looking up or down beyond a 0.20 ratio fails the check.
Mouth closure – the distance between upper and lower lip landmarks. Open mouth = neutral expression violation.
Inter-eye distance – the pixel distance between eye centers. This is the baseline for all crop calculations. If it's too small, your face is too far from the camera.

Why MediaPipe over other options? I considered dlib/face_recognition, OpenCV Haar cascades, and heavier models like RetinaFace. MediaPipe won because: it runs on CPU in 40 to 80 ms (no GPU needed), it gives 468 landmarks (dlib gives 68), it includes a refined iris model, and it's actively maintained by Google. Haar cascades are fast but too brittle for varied poses and lighting. RetinaFace is more accurate but overkill for this use case and much heavier to deploy.

Face Landmark Mesh

468 points detected by MediaPipe FaceMesh

The 25+ things that can go wrong

Every check returns a structured result: pass, fail, warn, or manual review. Each includes a human-readable message and an action telling you what to do. Here's the full checklist grouped by category:

File and metadata

Is the file format supported? (JPG, PNG, HEIC, WebP)
Is the file under the size limit? (8 MB for Singapore, 10 MB for most others)
Is the resolution high enough? (Singapore needs at least 800x1200 input pixels)
Is the photo recent enough? (Checked via EXIF date if available)

Face geometry

Was exactly one face detected? (Zero = no face; two or more = group photo)
Is the face large enough in the frame? (Measured by inter-eye distance)
Is the face height within the required range for the output crop?

Head pose

Roll (tilt): within 8 degrees?
Yaw (turning): within 0.22 ratio?
Pitch (chin angle): within 0.20 ratio?

Head Pose Visualizer

See how roll, yaw, and pitch affect compliance

Expression and visibility

Are both eyes clearly open?
Is the mouth closed? (Neutral expression required)

Image quality

Sharpness – measured using the Laplacian variance (explained below)
Lighting uniformity – are there harsh shadows or uneven illumination?
Text or watermarks – detected via contour analysis

Background and framing

Is the background white/light enough?
Is the background uniform (low standard deviation)?
Are head and shoulders properly visible in the frame?

iPhone-specific

Is the image mirrored? (Front camera EXIF detection)
Was mirror correction applied?

Background removal: the hardest easy problem

What is image segmentation? It's the process of separating the "person" pixels from the "background" pixels. Sounds simple until you try it on a photo with patterned wallpaper, a cat in the background, and hair strands catching the light. Segmentation models produce a mask: each pixel gets a confidence score from 0 (definitely background) to 1 (definitely person).

Photo ID Studio uses a dual-backend approach:

Primary: rembg with the u2net_human_seg model (~170 MB). This is a neural network specifically trained for human segmentation. It handles complex backgrounds well: bookshelves, outdoor scenes, cluttered rooms. The tradeoff is RAM: it uses about 800 MB when loaded.
Fallback: MediaPipe SelfieSegmentation (~20 MB). Lighter and faster, but struggles with non-uniform backgrounds. Used when rembg is disabled or the server is low on memory.

Why two backends? Because one size doesn't fit all. rembg is better but heavier. On a 2 GB Railway instance, you can't keep it loaded all the time. So I implemented lazy loading: rembg loads on first use and auto-unloads after 15 minutes of idle. This keeps memory usage manageable while still giving good results when someone actually uses the app. If rembg is unavailable, MediaPipe takes over.

The background whitening pipeline

Once you have the person mask, making the background white sounds trivial: just set non-person pixels to (255, 255, 255). In practice, the edges are where everything goes wrong. Hair strands, ear edges, and collar boundaries create a transition zone where the mask is uncertain. Naive replacement creates ugly halos.

The actual pipeline is a multi-stage process:

Mask refinement – Gaussian blur + bilateral filter + morphological operations to smooth the mask edges
Edge guard computation – gradient-based detection of high-frequency regions (hair detail, fabric texture) that need protection
Color decontamination – unmixing the old background color from edge pixels so they don't carry a color cast
Shadow lifting – boosting brightness in the HSV V channel for shadow regions near the person boundary
Confidence-based blending – pixels with high background confidence get hard-overridden to white (RGB 252, 252, 252). Uncertain pixels get a weighted blend.
Edge artifact suppression – clamping the outer pixel border to prevent seaming artifacts

This might sound over-engineered, but each step exists because of a real failure case I encountered. The first version had green halos on photos taken against grass. The second version had dark shadows along hair boundaries. The third version had visible seams at the image edge. Each bug added a stage to the pipeline.

Background Whitening Pipeline

6 stages from raw mask to clean white background

Original Photo

The raw input image with a colored background that needs to be replaced with white.

Country rules: why your Singapore photo won't work in the US

Every country has its own passport photo specification. These aren't suggestions; they're hard requirements enforced by immigration offices. Here's a comparison:

Country   Output Size    Max File   Min Input     Eye Position   Max Roll
SG        400 x 514      8 MB      800 x 1200    42% from top   8 deg
US        600 x 600     10 MB      900 x 900     varies         varies
UK        900 x 1200    10 MB     1100 x 1400    varies         varies
CA        826 x 1063    10 MB     1000 x 1300    varies         varies
AU        826 x 1063    10 MB     1000 x 1300    varies         varies
IN        413 x 531      8 MB      800 x 1100    varies         varies

Notice that the US wants a square photo while everyone else wants a rectangle. Singapore has specific requirements for eye positioning (42% from the top of the frame). The UK needs the highest resolution output. India has the smallest output dimensions.

All of this is stored in a countries.yaml config file. Adding a new country means adding a new YAML block with its requirements and touching nothing else.

# Example: Singapore profile (countries.yaml)
SG:
  output_width: 400
  output_height: 514
  max_file_size_mb: 8
  min_input_width: 800
  min_input_height: 1200
  min_eye_distance_px: 90
  min_face_height_px: 420
  eye_height_fraction_of_height: 0.42
  max_roll_degrees: 8
  max_yaw_ratio: 0.22
  max_pitch_ratio: 0.20
  min_background_brightness: 208
  max_background_saturation: 40
  min_blur_score: 55
  min_even_lighting_score: 0.62

Design lesson: Putting country rules in config instead of code was one of the best early decisions. When I added India support, it took 10 minutes of YAML editing and zero code changes. Configuration-driven design scales better than hardcoded conditionals.

The iterative crop algorithm

Getting the crop right is surprisingly tricky. The goal: position the person's eyes at exactly the right height in the output image, keep the head centered horizontally, and include enough of the shoulders. Here's how it works:

Compute initial crop using inter-eye distance as the baseline. The crop width is calculated from the country's output aspect ratio.
Iterative eye-line recentering (up to 4 passes): extract the face in the provisional crop, recompute the eye-line position, and shift the crop to center the eyes. Stop when the error is less than 1.25 pixels or max iterations are hit.
Segmentation-aware vertical rebalancing: if the segmentation mask is available, shift the crop up or down to keep both the crown and shoulders visible. This prevents the common issue of cropping off the top of someone's head or losing their shoulders.
Mild roll straightening: if the head is tilted more than 0.3 degrees, apply a rotation matrix. Pad with reflected borders to avoid white corners.
Final resize to the country's exact output dimensions using cubic interpolation.

The iterative approach matters because a single-pass crop often gets the eye position wrong by 5 to 10 pixels. That might sound small, but in a 514-pixel-tall Singapore passport photo, even a few pixels off can push the eye position outside the acceptable range. Four iterations converges reliably.

Why I didn't use generative AI

This is a question I get asked. Why not use Stable Diffusion or DALL-E to fix the background, adjust the pose, or enhance the photo? Three reasons:

Identity fidelity. Passport photos must look exactly like you. Generative models can subtly alter facial features, skin tone, or eye shape. Even a small change could cause problems at border control. For compliance, you need deterministic, non-generative operations that preserve the original pixels.
Explainability. Every operation in the pipeline is traceable. I can tell you exactly which pixels were changed and why. With a generative model, you get a black box that produces "a nice-looking result" with no guarantees about what was modified.
Reproducibility. The same input always produces the same output. Generative models have randomness baked in. For a compliance tool, determinism is a feature, not a bug.

The non-generative approach uses classical CV operations: masking, alpha blending, color correction in LAB space, bilateral filtering. These are well-understood, fast, and completely transparent. Sometimes the boring solution is the right one.

Deployment on 2 GB of RAM

Running MediaPipe + rembg + OpenCV on a budget Railway instance (1 vCPU, 2 to 3 GB RAM) required careful memory management. Here's what I learned:

Lazy loading: rembg's 170 MB model loads on first request, not at startup. This means cold start is fast and the model only occupies RAM when someone actually uses the app.
Idle unloading: after 15 minutes of no requests, rembg unloads itself and malloc_trim() returns the heap to the OS. Next request takes ~2 seconds longer (model reload), but idle memory drops significantly.
Processing resolution cap: incoming images are downscaled to max 1920px on the long side or 4 megapixels before any processing. This prevents a 48-megapixel phone photo from eating all your RAM during inference.
Concurrency limiting: a semaphore caps in-flight analysis requests at 3. This prevents queue buildup during traffic spikes.
Rate limiting: IP-based token bucket with 10 requests/minute burst and 200/day cap. Aggressive, but necessary on shared infrastructure.

Typical latency: under 800 ms at the 50th percentile, under 1.5 seconds at the 95th. That's fast enough that users don't feel like they're waiting, even with all 25+ checks running on a single CPU core.

Lessons learned

Image quality metrics are surprisingly simple. Blur detection is one line of OpenCV: cv2.Laplacian(gray, cv2.CV_64F).var(). The Laplacian operator detects edges; if the variance is low, the image is blurry. Lighting uniformity splits the image into four quadrants, measures brightness in each, and computes the spread. These aren't deep learning; they're signal processing fundamentals that work reliably.

Edge cases are where the real work lives. The happy path (well-lit, centered, white background) was easy. The first 80% of the pipeline took 20% of the time. The remaining 20% (iPhone mirroring, green halos, dark hair on dark backgrounds, glasses glare, off-center framing) took the other 80%. If you're building any CV pipeline, budget your time for edge cases, not the main flow.

Configuration beats code for rules that change. Countries update their photo requirements. Having everything in YAML means I can adjust thresholds or add new countries without touching the pipeline code. This separation of rules from logic is one of the most useful patterns in software engineering.

In-memory processing is simpler to build than you'd think. No persistence means no database to manage, no storage to secure, and no GDPR deletion requests to handle. It also happens to be the most privacy-respecting design available. The simplest approach was also the right one here.

If you've been putting off getting a proper passport photo, go give the app a try. And if it tells you your head is tilted 9 degrees, don't argue with the math. Just straighten up and retake the shot.

References

Photo ID Studio (live app)
Source code on GitHub
MediaPipe Face Landmarker
rembg for background removal
OpenCV documentation
Singapore ICA passport photo guidelines
US Department of State photo requirements
UK passport photo rules
FastAPI
Railway for deployment