Technology II — Perception

Vision AI & Perception

Multi-modal perception pipeline that fuses depth, RGB, and proprioception into structured scene understanding — the eyes, ears, and spatial awareness of embodied intelligence.

The Challenge

Seeing Is Not Understanding

A camera captures pixels. A depth sensor measures distances. But embodied intelligence needs more than raw data — it needs to understand what it sees: where objects are in 3D space, what they are, how they relate to each other, and what they mean for the task at hand.

The SYNAPEX perception system fuses multiple sensor modalities into a unified scene representation that the brain modules can reason about, plan with, and act on. This is not just computer vision — it is the perceptual foundation of autonomous existence.

Our approach is fundamentally different from single-model vision systems. Perception is decomposed into specialised stages, each producing structured output that feeds the next — creating a pipeline that is interpretable, modular, and debuggable.

👁
Multi-modal perception active

From Raw Sensors to Scene Understanding

Seven layers of processing, each adding semantic richness to the raw input.

01
Sensor Fusion
RGB cameras + depth sensors + proprioceptive data aligned into a unified spatio-temporal frame. Time-synchronized multi-modal input.
02
Feature Extraction
Backbone network produces dense feature maps. Multi-scale representation capturing both fine detail and global context. Pre-trained on massive datasets, fine-tuned for embodied tasks.
03
Object Detection & Segmentation
Instance-level detection and pixel-precise segmentation. Every object identified, located in 3D, and tracked across frames. Real-time panoptic segmentation.
04
Depth Estimation & 3D Reconstruction
Dense depth maps refined with monocular depth estimation. Point cloud generation. Local 3D mesh reconstruction for manipulation targets.
05
Spatial Relationship Mapping
Scene graph construction: which objects are near which, what supports what, what occludes what. Semantic spatial reasoning for navigation and manipulation.
06
Context & Intent Estimation
What is happening in this scene? Activity recognition, human pose estimation, gesture detection. Predicting where things are going, not just where they are.
07
Structured Scene Representation
Final output: a machine-readable scene graph with 3D positions, object identities, relationships, and context. Ready for the Reasoning and Planning modules of the AI brain.

Multi-Modal Awareness

📷
RGB Vision
High-resolution stereo cameras for colour, texture, and pattern recognition. The primary source of semantic information about the external world.
🛰
Depth Sensing
Structured-light and time-of-flight sensors for precise distance measurement. Creates a 3D point cloud of the environment at every frame.
📡
Proprioception
Internal body state from the muscle system: joint positions, forces, velocities. Fused with external sensors for a complete understanding of the body in its environment.
🎙
Audio Processing
Directional microphone array for sound localisation, speech recognition, and environmental audio classification. Hearing adds context that vision alone cannot provide.
🌡
Thermal Sensing
Infrared thermal imaging for temperature awareness, human detection in low light, and material identification. Critical for safe interaction with the natural world.
🛰
IMU & Vestibular
Inertial measurement for balance, acceleration, and orientation. The vestibular analog for embodied systems — essential for locomotion and dynamic stability.

Vision Modules in the Laboratory

Each stage of the perception pipeline is published as a separate module in the SYNAPEX Lab. Researchers can use the full pipeline or pick individual modules — face recognition, object detection, depth estimation — for their own projects. Every module earns $SYNX for its creator.

Phase 1 🕵
Face Detection & Recognition
Real-time multi-face detection and identification. Extracted from the perception pipeline as a standalone deployable module.
CVBiometricEdge
In Dev 🧹
3D Face Scan
RGB-D face scanning with mesh generation. Landmark extraction, expression tracking. Publishable for avatar, health, and security applications.
3DDepthMesh
In Dev 👁
Scene Segmentation
Panoptic segmentation: every pixel classified, every instance segmented. The foundation for spatial reasoning in embodied AI.
PanopticSemanticInstance