Create Next App

A single video of a golf swing can tell you a lot—but it can never fully answer the question, "Where was that in space?" You see 2D motion, but depth is always a bit of a guess. Was the club truly in-to-out, or does it just look that way from this angle? How far in front of the player did the hands move? How much did the pelvis actually shift toward the target?

Stereo vision is how Penguin starts to close that gap. By using two phones as a stereo pair, we can estimate depth, reconstruct 3D poses, and track the club and body through true 3D motion instead of flat projections.

In this article, we'll unpack what stereo vision actually is, how two normal phones can behave like a structured camera rig, and what kinds of insights this unlocks for real coaching—not just for demos.

What stereo vision actually does

The core idea of stereo vision is deceptively simple: if you see the same point from two slightly different viewpoints, the difference in where it lands on each image tells you how far away it is.

That difference is called disparity. Objects closer to the cameras appear at different horizontal positions in the two images; distant objects appear more aligned. Given:

the separation between cameras (the baseline),
their internal geometry (the intrinsics), and
their relative pose (the extrinsics—how one is rotated and translated relative to the other),

we can take corresponding points in the two images and triangulate their 3D location. Do that for joints, club keypoints, and the ball over time, and you get a 3D motion trail, not just a flat drawing.

Two phones as a stereo pair

Traditional vision systems that use stereo rely on fixed rigs: calibrated cameras bolted to a metal bar in a lab. That's not realistic for most golf coaches on a range.

Our approach is to treat two normal phones like a flexible stereo rig:

one phone might be down-the-line,
the other slightly offset, or angled differently,
both capturing the same swing at the same time.

The trick is to move from "two random viewpoints" to "two calibrated pinhole cameras with known relative pose." Once we know how the phones are positioned relative to each other, they behave like any other stereo system—just with tripods instead of welded mounts.

Calibration: teaching the system how the phones are arranged

To unlock stereo, we need to estimate the extrinsics: how Camera B is rotated and offset relative to Camera A. There are a few ways to do this in practice:

Pattern-based calibration: both phones film a known pattern (a checkerboard or calibration card). By detecting that pattern in each view, we can solve for the cameras' relative positions.
Shared scene features: for looser setups, we can align phones using points that appear in both views (e.g., tee marker, ball, alignment sticks) if we know some real-world distances.
Predefined rigs: in the future, we may offer simple fixtures with known geometry that drastically simplify this step for high-volume programs.

Penguin's vision stack builds on the pinhole camera model: once we know each phone's intrinsics (how it projects rays) and their extrinsics (how those projections relate), we can treat the two-camera system as a single 3D measurement instrument.

Triangulation: turning matching pixels into 3D points

Once the stereo pair is calibrated, the core operation is triangulation.

At a high level, for each point of interest (say, the left wrist):

we find the corresponding pixel in the left and right camera views,
we cast a ray from each camera through that pixel into 3D space, and
we compute where those two rays come closest together—that's our best estimate of the point's 3D position.

Do this across time and across keypoints, and we get:

a 3D skeleton for the player,
a 3D trajectory for the club, and
a 3D path for the hands, hips, head, and more.

This is where stereo really shines: instead of reconstructing a 3D swing from a single 2D view plus assumptions, we let the geometry of the cameras and the data of the two videos work together.

Depth: turning "it looks in-to-out" into a measurable path

One of the biggest benefits of stereo for golf is simple: honest depth.

With true 3D data, we can:

distinguish between a swing that is genuinely in-to-out versus one that only appears so from certain camera angles,
measure how far behind or in front of the player the hands travel,
understand how the club approaches the ball in three dimensions, not just how it looks when projected onto a screen.

For coaches, this turns "I think you're too far under plane" into "your club is approaching from X degrees under the reference plane, consistently."

Club path and face: separating motion from projection

In 2D video, perceived club path is tightly tied to camera angle. If the camera drifts off alignment or sits too far inside, what looks like in-to-out might be closer to neutral in reality.

With stereo, we can:

reconstruct the club's 3D position frame by frame,
approximate the club's orientation using multiple keypoints (shaft, head, grip), and
compute path components relative to some stable reference: target line, player's stance line, or a defined swing plane.

The goal isn't to reinvent launch monitors, but to bridge the gap between what a coach sees and what the club is truly doing in space. Over time, this opens the door to:

better explanations ("your path is good, but your low point and face control need work"),
more precise drills (moving path without guessing), and
tighter feedback loops when players change feels that affect the club's 3D motion.

3D body motion: hips, ribcage, and “how you’re moving through space”

Stereo isn't just about the club. With two calibrated views, we can also reconstruct a richer 3D model of the body:

Center of mass shifts: how far the player moves toward or away from the target during the swing.
Hip and ribcage motions: separation, rotations, and lateral moves in 3D, not just projected angles.
Spine and head motion: how the upper body moves relative to the ground and the ball in a more honest coordinate system.

This helps coaches ground common ideas—like "pressure shift" or "staying in your posture"—in more precise motion patterns, while still speaking in familiar language to players.

Synchronization: making sure both phones are watching the same swing

For stereo to work, both phones need to be looking at the same moments in time. If one device is a few frames ahead or behind, the 3D reconstruction falls apart.

We handle this in a few ways inside Penguin's architecture:

Shared session control: both cameras are joined to the same session, and we coordinate capture starts/stops from a single controller.
Timestamp alignment: we rely on precise timestamps and frame indices rather than just file durations to match pairs of frames across devices.
Visual cross-checks (future work): in more advanced setups, we can cross-correlate motion (e.g., the clubhead path) across both views to fine-tune alignment.

The aim is to hide this complexity behind the UI: the coach should be able to tap once to capture a stereo swing and trust that the system will line everything up correctly.

Practical constraints and tradeoffs

Stereo vision in real-world coaching comes with constraints we actively design around:

Baseline length: if the phones are too close together, depth estimates get noisy; too far apart and it becomes hard to find overlapping views. We aim for practical baselines that work on real ranges.
Occlusions: sometimes one camera can't see a critical point (e.g., the clubhead disappears behind the player). In those moments, we gracefully fall back to single-view reasoning.
Setup complexity: there's a balance between "just hit record" and "set up a lab." Our goal is to keep stereo optional and progressive: powerful when you want it, invisible when you don't.
Compute budgets: processing two streams is heavier than one. We combine edge compute, efficient models, and staged analysis so the experience still feels snappy.

Stereo vision isn't a magic switch—it's a set of tools that must be deployed thoughtfully so they enhance lessons, not dominate them.

Coaching scenarios where stereo really pays off

Path and low-point work

When you're trying to fix fat/thin patterns or adjust how a player delivers the club, depth matters. Stereo gives us a clearer picture of:

how the club travels through the impact zone in 3D,
how far in front of or behind the ball the low point actually is,
and how those patterns change as the player experiments with new feels.

Face-to-path relationships (within reason)

While we're not replicating radar-level measurements, 3D club motion lets us approximate how the face is moving relative to the path in more useful ways than a single 2D projection.

Motion pattern changes over time

For players making big mechanical changes (e.g., shallowing the shaft, changing pivot, or reworking their pattern out of the top), 3D motion over time provides a clearer record of what actually changed—not just how it looked on video that day.

Where we're taking stereo vision next

Stereo vision is a key piece of the long-term roadmap for Penguin. It's a stepping stone toward:

More honest 3D overlays: not just drawing lines on top of video, but placing 3D guides and references into the reconstructed swing space.
Two-phone simulators: combining stereo swing capture with projection math and ball-flight models to simulate shots using only phones and nets—no radar required.
Better cross-session comparisons: aligning swings in a shared 3D frame so "then vs. now" becomes more about motion and less about camera placement.

Our north star isn't to turn every coach into a vision engineer. It's to let them benefit from stereo-level insights with the tools they already own: a couple of phones, a range, and a player who wants to get better.

Stereo vision is how we turn those phones into something closer to a 3D instrument—quietly doing the math in the background so the coaching conversation can stay front and center.