
AI that can see and understand what is happening in a video, especially a live stream, is understandably an attractive product for many companies and organizations. Beyond acting as a guarantee "watchdog" At sites and facilities, such an AI model could also be used to trim the most interesting parts of marketing videos and repurpose them for social media, identify inconsistencies and errors in videos and flag them for deletion, and identify the body language and actions of participants in controlled studies or candidates applying for new roles.
While there are some AI models that offer this type of functionality today, it is far from a widespread capability. However, two-year-old startup Perceptron Inc. is looking to change all that. Today he announced the launch of his Flagship Proprietary Video Analytics Reasoning Model, Mk1 (abbreviation of "mark one") at a cost ($0.15 per million tokens in / $1.50 per million tokens out via its application programming interface (API)) that is 80-90% less than other leading proprietary rivals, namely Anthropic’s Claude Sonnet 4.5, OpenAI’s GPT-5, and Google’s Gemini 3.1 Pro.
Led by co-founder and CEO Armen Aghajanyan, formerly of Meta FAIR and Microsoft, the company spent 16 months developing a "multimodal recipe" from scratch to address the complexities of the physical world.
This release signals a new era in which models are expected to understand cause and effect, object dynamics, and the laws of physics with the same fluency they once applied to grammar.
Interested users and potential business customers can try it out for themselves in a Perceptron public demo site here.
Performance in spatial and video benchmarks
Model performance is supported by a set of industry-standard benchmarks focused on informed understanding.
In Spatial Reasoning (ER Benchmarks), Mk1 achieved a score of 85.1 on EmbSpatialBench, beating Google’s Robotics-ER 1.5 (78.4) and Alibaba’s Q3.5-27B (approx. 84.5).
In the specialized RefSpatialBench, the Mk1’s score of 72.4 represents a big jump over competitors such as GPT-5m (9.0) and Sonnet 4.5 (2.2), highlighting a significant advantage in understanding reference expressions.
Video benchmarks show similar dominance; in the ego scheme "Hard Subset"—where inference from the first and last frame is insufficient—Mk1 scored 41.4, matching Alibaba’s Q3.5-27B and significantly outperforming the Gemini 3.1 Flash-Lite (25.0).
On the VSI-Bench, Mk1 achieved 88.5, the highest score recorded among the compared models, further validating its ability to handle real temporal reasoning tasks.
Market positioning and efficiency frontier
Perceptron has explicitly addressed "efficiency frontier," a metric that plots average scores on embedded reasoning videos and benchmarks against the combined cost per million tokens.
Benchmark data reveals that Mk1 occupies a unique position: matching or exceeding the performance of "border" models like GPT-5 and Gemini 3.1 Pro while maintaining a cost profile closer to "light" either "Flash" versions.
Specifically, Perceptron Mk1 is priced at $0.15 per million input tokens and $1.50 per million output tokens. In comparison, the "Efficiency frontier" The chart shows GPT-5 at a significantly higher combined cost (about $2.00) and Gemini 3.1 Pro at about $3.00, while the Mk1 sits at the $0.30 combined cost mark with higher reasoning scores.
This aggressive pricing strategy aims to make high-end physics AI accessible for large-scale industrial use rather than just experimental research.
Architecture and temporal continuity
The technical core of Perceptron Mk1 is its ability to process native video at up to 2 frames per second (FPS) in a significant 32K token context window.
Unlike traditional vision and language models (VLM) that often treat video as a disjointed sequence of still images, Mk1 is designed for temporal continuity.
This architecture allows the model "look" extended flows and maintain object identity even across occlusions, a critical requirement for robotics and surveillance applications.
Developers can query the model for specific moments in a long sequence and receive structured time codes in return, streamlining the process of video trimming and event detection.
Reason with the laws of physics.
A main differentiator for the Mk1 is its "Physical reasoning" ability. Perceptron defines this as high-precision spatial awareness that allows the model to understand object dynamics and physical interactions in real-world environments.
For example, the model can analyze a scene to determine whether a basketball shot was taken before or after the buzzer rang, reasoning jointly about the position of the ball in the air and the reading on a shot clock.
This requires more than just pattern recognition; requires an understanding of how objects move through space and time.
The model is capable of "pixel precision" point and count up to hundreds within dense, complex scenes. It can also read analog gauges and clocks, which have historically been difficult to interpret with high reliability for purely digital vision systems.
He also seems to have a solid general historical and world knowledge. In my short test, I uploaded an old public domain. Film about the construction of skyscrapers in New York City from 1906. from the US Library of Congress, and Mk1 was not only able to correctly describe the content of the footage (including strange and atypical views of workers suspended by ropes), but did so quickly and even correctly identified the approximate date (early 1900s) from the look of the footage alone.
A development platform for physical AI
Accompanying the release of the model is an expanded developer platform designed to turn these high-level perception capabilities into functional applications with minimal code.
The Perceptron SDK, available through Python, introduces several specialized functions such as "Focus," "Calculation," and "Learning in context".
The Zoom feature allows users to automatically zoom and crop specific regions of a frame based on a natural language prompt, such as detecting and locating personal protective equipment (PPE) on a construction site. The Count function is optimized for dense scenes, such as identifying and pointing out each puppy in a group or individual products.
Additionally, the platform supports in-context learning, allowing developers to tailor Mk1 to specific tasks by providing just a few examples, such as displaying an image of an apple and telling the model to label each Category 1 instance in a new scene.
Licensing Strategies and the Isaac Series
Perceptron is employing a two-track strategy for its model weights and licenses. The flagship Perceptron Mk1 is a closed source model accessed via API, designed to deliver enterprise-grade performance and security.
However, the company also maintains its "isaac" series, which began with the Isaac 0.1 release in September 2025as an alternative to open weights. Isaac 0.2-2b-previewReleased in December 2025, it is a 2 billion-parameter vision language model with reasoning capabilities that is available for edge and low-latency deployments.
While the weights for Isaac models are open in the popular AI code-sharing community hugging facePerceptron offers commercial licenses for companies that require maximum control or local implementation of the weights.
This approach allows the company to support both the open source community and specialized industrial partners who need proprietary flexibility. The documentation notes that the Isaac 0.2 models are specifically optimized for a first token get time of less than 200 ms, making them ideal for real-time edge devices.
Background on Perceptron’s founding and approach
Perceptron AI is a Bellevue, Washington-based physical artificial intelligence startup founded by Aghajanyan and Akshat Shrivastava, both former research scientists at Meta’s Facebook AI Research (FAIR) lab.
The company’s public materials date its founding to November 2024, while a Washington corporate filing record for Perceptron.ai Inc. shows a previous foreign registration filing on October 9, 2024listing Shrivastava and Aghajanyan as governors.
In the founder’s launch posts from late 2024, Aghajanyan He said he had left Meta after almost six years and “joined forces” with Shrivastava to build AI for the physical world, while Shrivastava said the company emerged from his work on efficiency, multimodality and new model architectures.
The foundation appears to have emerged directly from the couple’s work on multimodal foundation models in Meta. In May 2024, Meta-researchers published Chameleona family of early fusion models designed to understand and generate mixed sequences of text and images, work that Perceptron later described as part of the lineage behind its own models.
A follow-up article from July 2024, Mommaexplored more efficient early fusion training for mixed modal models and included Shrivastava and Aghajanyan among the authors. Perceptron’s stated thesis extends that research direction to “physical AI”: models that can process real-world video and other sensory streams for use cases like robotics, manufacturing, geospatial analysis, security, and content moderation.
Partner ecosystems and future perspectives
The real-world impact of Mk1 is already being demonstrated through Perceptron’s partner network. Early adopters are using the model for a variety of applications, such as automatically cropping live sports highlights, which leverages the model’s temporal understanding to identify key plays without human intervention.
In the robotics sector, partners are transforming teleoperation episodes into training data, effectively automating the data labeling and cleaning process for robotic arms and mobile units.
Other use cases include multimodal quality control agents on manufacturing lines, which can detect defects and verify assembly steps in real time, and wearable assistants in smart glasses that provide contextual help to users.
Aghajanyan said these launches are the culmination of research aimed at making AI work better in the physical world, moving towards a future where "Physical AI" It is as ubiquitous as digital AI.





