Fovea Vision for Computer Vision

The other morning I was sleeping in a little while the morning sky was still dark. My wife had woken up earlier than me to tend to our boy, and at some point needed to turn on the lights in our bedroom.

Something very brief but very cool happened when she turned the lights on. While my eyes were closed and I was half asleep still, the sudden change in the rooms brightness created a very interesting pattern in my eyes for a moment. I saw a bright solid dot right at the center of my vision, with rings of pixel like dots of mostly yellow with some green, blue and red. These rings appeared to be in a formation of thick rings with fewer β€œpixels” and thinner denser rings, repeating out from the small dot at the center. 

 

This gave me an idea of using a similar approach for my visual-cortex, where the center (fovea) has a higher pixel resolution for extreme detail, with outer overlapping rings having less and less detail. This would ensure that compute is utilised where it's really needed while still computing enough detail in the surrounding space to generalise and switch focus.

 

To test this theory, I built out a light weight python prototype to begin testing out how the model could shift it's focus between points of interest, and how effective different approaches could be. This is still very much a prototype which I will continue to work on and push GPU when I've found the winning formula, but here's an early view of how the module looks for points of interest and shifts it's attention to it.    

 

 

As for the human eye, they move approximately two to three times per second. These rapid movements, called saccades, are used to shift gaze to new points of interest. Even when focused on a single object, the eyes are constantly making tiny, vibrating movements up to around 100 times a second, so if you think the motion tracking in the video of where my AI is shifting attention to, you'll find it's probably less jittery than your eyes reading this post right now!

 

If you're interested and would like to play around with the code yourself, here is the main module β€œfovea_controller.py”

# MIT License - Please include reference to DJ-AI.AI - Dayyan James


import argparse
from collections import deque
import cv2
import numpy as np
import time

from face_prior import FacePrior

def clamp(v, lo, hi): return max(lo, min(hi, v))

def gaussian_prior(shape, cx, cy, sigma):
    h, w = shape
    if w == 0 or h == 0:  # guard
        return np.zeros((h, w), dtype=np.float32)
    ys = np.arange(h, dtype=np.float32)
    xs = np.arange(w, dtype=np.float32)
    xx, yy = np.meshgrid(xs, ys)
    g = np.exp(-((xx - cx)**2 + (yy - cy)**2) / (2.0 * sigma * sigma))
    m = g.max()
    if m > 0: g /= m
    return g.astype(np.float32)

class Kalman2D:
    def __init__(self, x0, y0):
        self.kf = cv2.KalmanFilter(4, 2, type=cv2.CV_32F)
        dt = 1.0
        self.kf.transitionMatrix = np.array([[1,0,dt,0],[0,1,0,dt],[0,0,1,0],[0,0,0,1]], np.float32)
        self.kf.measurementMatrix = np.array([[1,0,0,0],[0,1,0,0]], np.float32)
        self.kf.processNoiseCov = np.diag([1e-2,1e-2,1e-1,1e-1]).astype(np.float32)
        self.kf.measurementNoiseCov = np.diag([5e-2,5e-2]).astype(np.float32)
        self.kf.statePost = np.array([[x0],[y0],[0],[0]], np.float32)
        self.kf.errorCovPost = np.eye(4, dtype=np.float32)

    def predict(self):
        p = self.kf.predict()
        return float(p[0,0]), float(p[1,0])

    def correct(self, x, y):
        m = np.array([[np.float32(x)],[np.float32(y)]], np.float32)
        e = self.kf.correct(m)
        return float(e[0,0]), float(e[1,0])

class StableFoveaController:
    def __init__(self, src_w, src_h,
                 proc_width=512, flow_stride=2,
                 min_radius=96, max_radius=256,
                 dwell_frames=4, max_speed_px=40,
                 motion_w=0.8, edge_w=0.2,
                 sal_ema=0.8, zoom_k=0.6, zoom_ema=0.9,
                 stickiness=0.25, stick_sigma_frac=0.08,
                 switch_margin=0.10, roi_frac=0.35):
        self.reset_dims(src_w, src_h, proc_width)
        self.flow_stride = max(1, int(flow_stride))

        self.cx, self.cy = src_w // 2, src_h // 2
        self.min_r, self.max_r = int(min_radius), int(max_radius)
        self.r = int(0.5 * (min_radius + max_radius))

        self.dwell_frames = int(dwell_frames)
        self.max_speed = float(max_speed_px)
        self.motion_w, self.edge_w = float(motion_w), float(edge_w)
        self.sal_ema, self.zoom_k, self.zoom_ema = float(sal_ema), float(zoom_k), float(zoom_ema)

        self.prev_small = None
        self.prev_small_f = None
        self.sal_ema_map = None
        self.dwell_counter = 0
        self.trail = deque(maxlen=24)
        self.zoom_state = 0.5
        self.frame_idx = 0
        self.last_flow_n = None

        self.dis = cv2.DISOpticalFlow_create(cv2.DISOPTICAL_FLOW_PRESET_MEDIUM)

        self.stickiness = float(stickiness)
        self.stick_sigma_frac = float(stick_sigma_frac)
        self.switch_margin = float(switch_margin)
        self.roi_frac = float(roi_frac)

        self.kalman = Kalman2D(self.cx, self.cy)

        self.face_prior = FacePrior(method="haar", mouth_boost=0.9)
        self.face_weight = 0.7     # 0.2–0.5 typical
        self.face_blend = "add"     # "add" or "max"
        self.debug_faces = []       # for optional overlay

    def reset_dims(self, src_w, src_h, proc_width):
        # Ensure valid, positive processing size
        self.SW, self.SH = int(max(1, src_w)), int(max(1, src_h))
        self.proc_width = int(max(32, proc_width))
        self.scale = self.proc_width / float(max(1, self.SW))
        self.proc_height = int(max(32, round(self.SH * self.scale)))

    def _to_small(self, gray_src):
        if gray_src is None or gray_src.size == 0:
            return None, None
        # Resize guards
        small = cv2.resize(gray_src, (self.proc_width, self.proc_height), interpolation=cv2.INTER_AREA)
        small_f = (small.astype(np.float32) / 255.0) if small.size else None
        return small, small_f

    def _global_motion_shift(self, prev_f, curr_f):
        try:
            if prev_f is None or curr_f is None: return 0.0, 0.0
            if prev_f.shape != curr_f.shape or prev_f.size == 0 or curr_f.size == 0:
                return 0.0, 0.0
            hann = cv2.createHanningWindow((prev_f.shape[1], prev_f.shape[0]), cv2.CV_32F)
            (dx, dy), _ = cv2.phaseCorrelate(prev_f, curr_f, hann)
            return float(dx), float(dy)
        except Exception:
            return 0.0, 0.0

    def _saliency(self, small_gray, small_float):
        if small_gray is None or small_float is None:
            return None
        # Reset prev if dims changed
        if self.prev_small is not None and self.prev_small.shape != small_gray.shape:
            self.prev_small = None
            self.prev_small_f = None
            self.sal_ema_map = None
            self.last_flow_n = None

        if self.prev_small is None:
            flow_mag_n = np.zeros_like(small_gray, np.float32)
        else:
            gdx, gdy = self._global_motion_shift(self.prev_small_f, small_float)
            if (self.frame_idx % self.flow_stride) == 0:
                flow = self.dis.calc(self.prev_small, small_gray, None)
                fx, fy = flow[..., 0] - gdx, flow[..., 1] - gdy
                flow_mag = np.sqrt(fx*fx + fy*fy).astype(np.float32)
                p1, p99 = np.percentile(flow_mag, 1.0), np.percentile(flow_mag, 99.0)
                if p99 <= p1 + 1e-6:
                    flow_mag_n = np.zeros_like(flow_mag, np.float32)
                else:
                    flow_mag_n = np.clip((flow_mag - p1) / (p99 - p1), 0, 1)
                self.last_flow_n = flow_mag_n
            else:
                flow_mag_n = self.last_flow_n if self.last_flow_n is not None else np.zeros_like(small_gray, np.float32)

        gx = cv2.Sobel(small_gray, cv2.CV_32F, 1, 0, ksize=3)
        gy = cv2.Sobel(small_gray, cv2.CV_32F, 0, 1, ksize=3)
        grad_mag = cv2.magnitude(gx, gy)
        p1, p99 = np.percentile(grad_mag, 1.0), np.percentile(grad_mag, 99.0)
        grad_mag_n = np.zeros_like(grad_mag, np.float32) if p99 <= p1 + 1e-6 else np.clip((grad_mag - p1) / (p99 - p1), 0, 1)

        S = self.motion_w * flow_mag_n + self.edge_w * grad_mag_n
        face_heat, faces_src = self.face_prior.compute_on_small(
            small_gray,
            scale_up=1.0 / self.scale,
            src_shape=(self.SH, self.SW)
        )
        self.debug_faces = faces_src  # keep for overlay/debug

        if face_heat is not None and face_heat.size:
            if self.face_blend == "max":
                S = np.maximum(S, face_heat)  # faces dominate
            else:
                S = (1.0 - self.face_weight) * S + self.face_weight * face_heat  # additive bias
        # Stickiness prior
        scx, scy = self.cx * self.scale, self.cy * self.scale
        sigma = max(4.0, self.stick_sigma_frac * min(self.proc_width, self.proc_height))
        G = gaussian_prior(S.shape, scx, scy, sigma)
        S = (1.0 - self.stickiness) * S + self.stickiness * G

        # Spatial & temporal smoothing
        S = cv2.GaussianBlur(S, (0, 0), sigmaX=1.0, sigmaY=1.0)
        self.sal_ema_map = S if self.sal_ema_map is None else self.sal_ema * self.sal_ema_map + (1.0 - self.sal_ema) * S

        self.prev_small = small_gray
        self.prev_small_f = small_float
        return self.sal_ema_map

    def _centroid(self, S):
        if S is None or S.size == 0 or S.ndim != 2:
            return self.cx, self.cy, 0.0
        scx, scy = self.cx * self.scale, self.cy * self.scale
        roi_r = max(12.0, self.roi_frac * min(self.SW, self.SH) * self.scale)
        x0 = int(clamp(scx - roi_r, 0, S.shape[1]-1)); x1 = int(clamp(scx + roi_r, 0, S.shape[1]-1))
        y0 = int(clamp(scy - roi_r, 0, S.shape[0]-1)); y1 = int(clamp(scy + roi_r, 0, S.shape[0]-1))
        roi = S[y0:y1+1, x0:x1+1]
        roi_mean = float(roi.mean()) if roi.size else 0.0

        thresh = np.percentile(S, 90.0)
        mask = (S >= thresh)
        if not mask.any():
            return self.cx, self.cy, float(S.max() if S.size else 0.0)

        ys, xs = np.nonzero(mask)
        w = S[ys, xs]
        wsum = w.sum()
        x_small = float((xs * w).sum() / max(wsum, 1e-6))
        y_small = float((ys * w).sum() / max(wsum, 1e-6))
        peak = float(S.max())

        # Hysteresis: stay local unless global is clearly better
        if roi.size and peak < (roi_mean * (1.0 + self.switch_margin)):
            ysr, xsr = np.nonzero(roi >= np.percentile(roi, 85.0))
            if ysr.size:
                wr = roi[ysr, xsr]; wrsum = wr.sum()
                xr_small = x0 + float((xsr * wr).sum() / max(wrsum, 1e-6))
                yr_small = y0 + float((ysr * wr).sum() / max(wrsum, 1e-6))
                x_small, y_small = xr_small, yr_small

        return x_small / self.scale, y_small / self.scale, peak

    def _move(self, tx, ty):
        px, py = self.kalman.predict()
        txb, tyb = 0.6 * tx + 0.4 * px, 0.6 * ty + 0.4 * py
        dx, dy = txb - self.cx, tyb - self.cy
        dist = float(np.hypot(dx, dy))
        if dist <= self.max_speed or dist == 0:
            self.cx, self.cy = int(txb), int(tyb)
        else:
            s = self.max_speed / dist
            self.cx += int(dx * s); self.cy += int(dy * s)
        self.cx = int(clamp(self.cx, 0, self.SW - 1))
        self.cy = int(clamp(self.cy, 0, self.SH - 1))
        self.kalman.correct(self.cx, self.cy)

    def _update_zoom(self, peak_sal):
        self.zoom_state = self.zoom_ema * self.zoom_state + (1.0 - self.zoom_ema) * float(peak_sal)
        t = max(0.0, min(1.0, self.zoom_state ** self.zoom_k))
        r_target = self.max_r - t * (self.max_r - self.min_r)
        self.r = int(0.85 * self.r + 0.15 * r_target)
        self.r = int(clamp(self.r, self.min_r, self.max_r))

    def step(self, frame_bgr):
        if frame_bgr is None or frame_bgr.size == 0:
            return {"cx": self.cx, "cy": self.cy, "r": self.r, "zoom_state": float(self.zoom_state),
                    "peak_sal": 0.0, "moved": False, "trail": list(self.trail),
                    "sal_small": np.zeros((80, 160), np.uint8)}
        gray = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2GRAY)
        small, small_f = self._to_small(gray)
        S = self._saliency(small, small_f)

        tx, ty, peak = self._centroid(S)
        moved = False
        if self.dwell_counter > 0:
            self.dwell_counter -= 1
            self._move(tx, ty)
        else:
            dist = float(np.hypot(tx - self.cx, ty - self.cy))
            if dist > 0.40 * self.r:
                self._move(tx, ty); self.dwell_counter = self.dwell_frames; moved = True
            else:
                self._move(tx, ty)

        self._update_zoom(peak)
        self.trail.append((self.cx, self.cy))
        self.frame_idx += 1

        # HUD: robust resize
        if S is None or S.size == 0 or S.ndim != 2 or S.shape[1] == 0:
            sal_small = np.zeros((80, 160), np.uint8)
        else:
            sh, sw = int(S.shape[0]), int(S.shape[1])
            target_w = 160
            target_h = max(1, int(round(sh * (target_w / float(max(1, sw))))))
            sal_small = cv2.resize((S * 255).astype(np.uint8), (target_w, target_h), interpolation=cv2.INTER_AREA)

        return {"cx": self.cx, "cy": self.cy, "r": self.r, "zoom_state": float(self.zoom_state),
                "peak_sal": float(peak), "moved": moved, "trail": list(self.trail), "sal_small": sal_small}

def draw_overlay(frame, state, rings=2, ring_spacing=0.75):
    cx, cy, r = state["cx"], state["cy"], state["r"]
    cv2.circle(frame, (cx, cy), r, (0, 255, 255), 2, lineType=cv2.LINE_AA)
    cv2.circle(frame, (cx, cy), max(2, r // 12), (0, 255, 255), -1, lineType=cv2.LINE_AA)
    for i in range(1, int(rings)):
        rr = int(r * (1 + i * ring_spacing))
        cv2.circle(frame, (cx, cy), rr, (220, 220, 220), 1, lineType=cv2.LINE_AA)
    for i in range(1, len(state["trail"])):
        cv2.line(frame, state["trail"][i-1], state["trail"][i], (128, 200, 255), 2, lineType=cv2.LINE_AA)

    hud = state["sal_small"]
    if hud is not None and hud.size:
        if hud.ndim == 2:
            hud = cv2.applyColorMap(cv2.cvtColor(hud, cv2.COLOR_GRAY2BGR), cv2.COLORMAP_INFERNO)
        h, w = frame.shape[:2]; sh, sw = hud.shape[:2]
        y0, x0 = max(0, h - sh - 8), 8
        if y0+sh <= h and x0+sw <= w:
            frame[y0:y0+sh, x0:x0+sw] = hud

    txt = f"({cx:4d},{cy:4d}) r={r:3d} zoom={state['zoom_state']:.2f} Smax={state['peak_sal']:.2f}"
    cv2.putText(frame, txt, (10, 26), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0,0,0), 3, cv2.LINE_AA)
    cv2.putText(frame, txt, (10, 26), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (255,255,255), 1, cv2.LINE_AA)
    return frame

def process_video(input_path, output_path,
                  proc_width=512, flow_stride=2,
                  min_radius=96, max_radius=256,
                  dwell=4, max_speed=40,
                  motion_w=0.8, edge_w=0.2,
                  ring_count=2, ring_spacing=0.75,
                  stickiness=0.25, stick_sigma_frac=0.08,
                  switch_margin=0.10, roi_frac=0.35):
    cap = cv2.VideoCapture(input_path)
    if not cap.isOpened():
        raise RuntimeError(f"Could not open input video: {input_path}")
    W = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)) or 1
    H = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)) or 1
    fps = cap.get(cv2.CAP_PROP_FPS)
    fps = float(fps) if fps and fps > 0 else 30.0

    out = cv2.VideoWriter(output_path, cv2.VideoWriter_fourcc(*"mp4v"), fps, (W, H))
    ctrl = StableFoveaController(W, H, proc_width, flow_stride, min_radius, max_radius,
                                 dwell, max_speed, motion_w, edge_w,
                                 sal_ema=0.8, zoom_k=0.6, zoom_ema=0.9,
                                 stickiness=stickiness, stick_sigma_frac=stick_sigma_frac,
                                 switch_margin=switch_margin, roi_frac=roi_frac)


    t0 = time.time()
    frames = 0
    while True:
        ret, frame = cap.read()
        if not ret or frame is None or frame.size == 0:
            break
        # If stream dimensions change mid-video, adapt
        h, w = frame.shape[:2]
        if w != ctrl.SW or h != ctrl.SH:
            ctrl.reset_dims(w, h, proc_width)
        state = ctrl.step(frame)
        draw_overlay(frame, state, rings=ring_count, ring_spacing=ring_spacing)
        out.write(frame)

    cap.release()
    out.release()

    dur = time.time() - t0
    print(f"Processed {frames} frames in {dur:.2f}s ({frames / max(dur,1e-6):.1f} FPS). Output: {output_path}")


if __name__ == "__main__":
    ap = argparse.ArgumentParser(description="Fast+Stable fovea control with robust guards.")
    ap.add_argument("--input", "-i", required=True)
    ap.add_argument("--output", "-o", required=True)
    ap.add_argument("--proc-width", type=int, default=512)
    ap.add_argument("--flow-stride", type=int, default=2)
    ap.add_argument("--min-radius", type=int, default=96)
    ap.add_argument("--max-radius", type=int, default=256)
    ap.add_argument("--dwell", type=int, default=4)
    ap.add_argument("--max-speed", type=int, default=40)
    ap.add_argument("--motion-w", type=float, default=0.8)
    ap.add_argument("--edge-w", type=float, default=0.2)
    ap.add_argument("--ring-count", type=int, default=2)
    ap.add_argument("--ring-spacing", type=float, default=0.75)
    ap.add_argument("--stickiness", type=float, default=0.25)
    ap.add_argument("--stick-sigma-frac", type=float, default=0.08)
    ap.add_argument("--switch-margin", type=float, default=0.10)
    ap.add_argument("--roi-frac", type=float, default=0.35)
    args = ap.parse_args()

    process_video(args.input, args.output,
                  proc_width=args.proc_width, flow_stride=args.flow_stride,
                  min_radius=args.min_radius, max_radius=args.max_radius,
                  dwell=args.dwell, max_speed=args.max_speed,
                  motion_w=args.motion_w, edge_w=args.edge_w,
                  ring_count=args.ring_count, ring_spacing=args.ring_spacing,
                  stickiness=args.stickiness, stick_sigma_frac=args.stick_sigma_frac,
                  switch_margin=args.switch_margin, roi_frac=args.roi_frac)

And here is an add on I created to make faces get more focus (you'll need both for the code above to work.)

face_prior.py

 

# MIT License - please include reference to DJ-AI.AI - Dayyan James
import cv2
import numpy as np
from typing import List, Tuple, Optional

def _gaussian_2d(h: int, w: int, cx: float, cy: float, sigma: float) -> np.ndarray:
    """Return a normalized 2D Gaussian heatmap centered at (cx, cy)."""
    if h <= 0 or w <= 0:
        return np.zeros((max(0, h), max(0, w)), dtype=np.float32)
    ys = np.arange(h, dtype=np.float32)
    xs = np.arange(w, dtype=np.float32)
    xx, yy = np.meshgrid(xs, ys)
    g = np.exp(-(((xx - cx) ** 2) + ((yy - cy) ** 2)) / (2.0 * sigma * sigma))
    m = g.max()
    if m > 0:
        g /= m
    return g.astype(np.float32)

class FacePrior:
    """
    Face-prior generator to bias fovea toward faces (and optionally the mouth region).
    Works on the *downscaled* gray frame to be cheap, and returns a heatmap
    that aligns with your saliency map `S` (same HxW).

    Usage:
        prior = FacePrior(method="haar")  # or method="dnn", with model paths
        heat, faces_src = prior.compute_on_small(small_gray, scale_up=1.0/scale, src_shape=(SH, SW))
        S = (1 - w_face)*S + w_face*heat
    """
    def __init__(self,
                 method: str = "haar",
                 haar_face_cascade: Optional[str] = None,
                 haar_mouth_cascade: Optional[str] = None,
                 dnn_proto: Optional[str] = None,
                 dnn_model: Optional[str] = None,
                 dnn_conf_thresh: float = 0.6,
                 min_size_frac: float = 0.06,
                 max_size_frac: float = 0.60,
                 mouth_boost: float = 0.25):
        """
        Args:
          method: "haar" or "dnn"
          haar_face_cascade: path to Haar cascade xml for frontal face (defaults to OpenCV bundled path)
          haar_mouth_cascade: optional mouth cascade (if available)
          dnn_proto, dnn_model: paths for OpenCV DNN face detector (Res10 SSD deploy.prototxt + .caffemodel)
          dnn_conf_thresh: confidence threshold for DNN detections
          min_size_frac / max_size_frac: clamp face sizes (fraction of downscaled shorter side)
          mouth_boost: extra heat around lower half (mouth) region when face is found
        """
        self.method = method.lower().strip()
        self.dnn_conf = float(dnn_conf_thresh)
        self.min_size_frac = float(min_size_frac)
        self.max_size_frac = float(max_size_frac)
        self.mouth_boost = float(mouth_boost)

        self.face_cascade = None
        self.mouth_cascade = None
        self.net = None

        if self.method == "haar":
            # Use bundled cascade if not provided
            if haar_face_cascade is None:
                haar_face_cascade = cv2.data.haarcascades + "haarcascade_frontalface_default.xml"
            self.face_cascade = cv2.CascadeClassifier(haar_face_cascade)
            if haar_mouth_cascade is not None:
                self.mouth_cascade = cv2.CascadeClassifier(haar_mouth_cascade)
        elif self.method == "dnn":
            if not (dnn_proto and dnn_model):
                raise ValueError("DNN method requires dnn_proto (deploy.prototxt) and dnn_model (.caffemodel)")
            self.net = cv2.dnn.readNetFromCaffe(dnn_proto, dnn_model)
        else:
            raise ValueError(f"Unknown method: {method}")

    def _detect_haar(self, small_gray: np.ndarray) -> List[Tuple[int, int, int, int]]:
        h, w = small_gray.shape[:2]
        if h == 0 or w == 0:
            return []
        min_side = min(h, w)
        min_size = max(1, int(self.min_size_frac * min_side))
        max_size = max(min_size, int(self.max_size_frac * min_side))
        faces = self.face_cascade.detectMultiScale(
            small_gray,
            scaleFactor=1.1,
            minNeighbors=4,
            flags=cv2.CASCADE_SCALE_IMAGE,
            minSize=(min_size, min_size),
            maxSize=(max_size, max_size)
        )
        return [(int(x), int(y), int(w_), int(h_)) for (x, y, w_, h_) in faces]

    def _detect_dnn(self, small_gray: np.ndarray) -> List[Tuple[int, int, int, int]]:
        # Run SSD on a BGR image; convert gray->BGR
        h, w = small_gray.shape[:2]
        if h == 0 or w == 0:
            return []
        bgr = cv2.cvtColor(small_gray, cv2.COLOR_GRAY2BGR)
        blob = cv2.dnn.blobFromImage(bgr, 1.0, (300, 300), (104.0, 177.0, 123.0), swapRB=False, crop=False)
        self.net.setInput(blob)
        det = self.net.forward()
        boxes = []
        for i in range(det.shape[2]):
            conf = float(det[0, 0, i, 2])
            if conf < self.dnn_conf:
                continue
            x1 = int(det[0, 0, i, 3] * w)
            y1 = int(det[0, 0, i, 4] * h)
            x2 = int(det[0, 0, i, 5] * w)
            y2 = int(det[0, 0, i, 6] * h)
            x1, y1 = max(0, x1), max(0, y1)
            x2, y2 = min(w - 1, x2), min(h - 1, y2)
            boxes.append((x1, y1, max(1, x2 - x1), max(1, y2 - y1)))
        return boxes

    def compute_on_small(self,
                         small_gray: np.ndarray,
                         scale_up: float,
                         src_shape: Tuple[int, int],
                         return_boxes_src: bool = True
                         ) -> Tuple[np.ndarray, Optional[List[Tuple[int, int, int, int]]]]:
        """
        Args:
          small_gray: downscaled grayscale frame (HxW), same space as your saliency map
          scale_up: factor to map small coords -> source coords (i.e., 1/scale)
          src_shape: (H_src, W_src) of original frame
        Returns:
          heatmap (HxW float32 in [0,1]), faces_src (list of boxes in source coords) if requested
        """
        if small_gray is None or small_gray.ndim != 2 or small_gray.size == 0:
            return np.zeros((0, 0), np.float32), [] if return_boxes_src else None

        if self.method == "haar":
            boxes = self._detect_haar(small_gray)
        else:
            boxes = self._detect_dnn(small_gray)

        h, w = small_gray.shape[:2]
        heat = np.zeros((h, w), dtype=np.float32)
        faces_src = []

        for (x, y, bw, bh) in boxes:
            cx = x + 0.5 * bw
            cy = y + 0.5 * bh
            sigma = 0.35 * max(bw, bh)  # broad to cover the full face
            heat = np.maximum(heat, _gaussian_2d(h, w, cx, cy, sigma))

            # Optional "mouth emphasis": boost lower half
            if self.mouth_cascade is not None or self.mouth_boost > 0:
                mouth_y = y + int(0.65 * bh)  # approximate mouth center
                mouth_sigma = 0.22 * max(bw, bh)
                heat = np.maximum(heat, self.mouth_boost * _gaussian_2d(h, w, cx, mouth_y, mouth_sigma))

            if return_boxes_src:
                # Map to source coords
                sx = int(round(x * scale_up)); sy = int(round(y * scale_up))
                sw = int(round(bw * scale_up)); sh = int(round(bh * scale_up))
                Hs, Ws = src_shape
                sx = max(0, min(Ws - 1, sx)); sy = max(0, min(Hs - 1, sy))
                sw = max(1, min(Ws - sx, sw)); sh = max(1, min(Hs - sy, sh))
                faces_src.append((sx, sy, sw, sh))

        if heat.size:
            m = heat.max()
            if m > 0:
                heat /= m
        return heat.astype(np.float32), faces_src if return_boxes_src else None

Of course the idea will be to have the AI model adjust the settings on the fly, say when it want's to read fine text, it will need to reduce the minimum radius for better localised detail, while increasing the edge and lowering motion. These I will leave the AI to learn to control itself with some gates for early days when it's still learning the world and everything is novel.

 

Have fun!

 

UPDATES:

I had a couple of questions about the various args, here's a quick helper to get your going:

 

If the fovea is jittery / ping-ponging

Increase --stickiness to 0.30–0.45

Biases toward staying near the current point.

Increase --switch-margin to 0.15–0.30
Requires the new target to be clearly better.

Increase --dwell to 6–10
Holds focus for more frames before big moves.

Lower --max-speed to 20–30
Caps per-frame motion so the center doesn’t lurch.

 


If tracking feels sluggish / won’t follow genuine motion

Decrease --stickiness to 0.10–0.20

Decrease --switch-margin to 0.05–0.10

Decrease --dwell to 3–4

Increase --max-speed to 40–60

Increase motion signal: --motion-w up to 0.85–0.95, --edge-w down.

 

If it misses faces / lips

Add the face prior and blend:

Weight: start self.face_weight = 0.30–0.45

Blend: "max" if you want faces to always win, or "add" for softer bias

For mouth emphasis: mouth_boost=0.30–0.50

(If still weak) increase --proc-width to 640–768 so faces survive downscale.

 

If it chases background motion (flags, trees, traffic)

Increase --stickiness (0.35–0.50)

Increase --switch-margin (0.20–0.30)

Decrease --motion-w a bit (e.g., 0.65–0.75) and increase --edge-w (0.25–0.35)

Keep face prior on if people are present.

 

If camera shake confuses it (hand-held footage)

Increase --dwell (6–10) and stickiness (0.35–0.50)

Lower --max-speed (20–30)

Raise --switch-margin (0.20–0.30)

 

If you want tight reading / fine detail focus

Decrease --min-radius (e.g., 72–96) and max ~192–224

Increase --edge-w (0.35–0.50) and lower --motion-w

Increase --proc-width (640+)

Slightly increase --stickiness (0.30–0.40) to avoid wandering

 

Multiple people in frame (pick one and stick)

Use face prior with "max" blend and moderate weight (0.30–0.40)

Increase --stickiness (0.35–0.45) and --switch-margin (0.15–0.25)

ROI: if you want center bias, raise --roi-frac (0.40–0.50)

 

Fast action / sports

Increase --max-speed (50–80)

Lower --dwell (2–4) and stickiness (0.15–0.25)

Increase --motion-w (0.85–0.95)

Consider --proc-width 640 for better motion detail