model release

Gemma 4 VLA runs locally on NVIDIA Jetson Orin Nano Super with 8GB RAM, autonomous webcam tool-calling

TL;DR

NVIDIA engineer Asier Arranz demonstrated Gemma 4 running as a vision-language agent (VLA) on a Jetson Orin Nano Super with 8GB RAM. The model autonomously decides when to access a webcam based on user queries, with no hardcoded triggers—performing speech-to-text, vision analysis, and text-to-speech entirely locally.

April 22, 2026 · 3:51 PM3 min read

Gemma 4 E2B VLA — Quick Specs

Context window2K tokens

Compare Gemma 4 E2B VLA with other models →

Gemma 4 VLA runs locally on NVIDIA Jetson Orin Nano Super with 8GB RAM, autonomous webcam tool-calling

Google's Gemma 4 vision-language model now runs entirely on NVIDIA's Jetson Orin Nano Super developer board with 8GB RAM, according to a technical demo published by NVIDIA engineer Asier Arranz. The implementation autonomously decides when to access a webcam based on conversational context—no keyword triggers or hardcoded logic.

Technical implementation

The demo uses a Q4_K_M quantized version of Gemma 4 E2B served via llama.cpp with CUDA acceleration. The full pipeline runs locally:

Speech input: Parakeet STT (speech-to-text)
Language model: Gemma 4 E2B at Q4_K_M quantization (4.5GB)
Vision projector: mmproj-gemma4-e2b-f16.gguf
Speech output: Kokoro TTS
Context window: 2,048 tokens
Image tokens: 70 (fixed)

The system exposes exactly one tool to Gemma 4: look_and_answer, which captures a webcam frame. When a user asks a question requiring visual context ("What color is my shirt?"), the model calls this tool autonomously. The vision projector processes the image, and Gemma 4 answers using visual context.

Hardware requirements

Arranz used:

NVIDIA Jetson Orin Nano Super (8GB)
Logitech C920 webcam
USB speaker
USB keyboard

The setup required aggressive memory management on the 8GB board. Arranz recommends:

8GB swap file
Stopping Docker and containerd
Killing background processes (tracker-miner, gnome-software)
Closing all browser tabs and IDEs

With cleanup, Q4_K_M runs comfortably. Users experiencing memory pressure can drop to Q3_K_M quantization (lower quality, lighter memory footprint).

Model serving configuration

The llama-server runs with:

-ngl 99: All model layers offloaded to GPU
--flash-attn on: Flash attention enabled
--no-mmproj-offload: Vision projector stays in system RAM
--jinja: Enables Gemma's native tool-calling support
-c 2048: Context window size

The --jinja flag activates Gemma 4's built-in function-calling capabilities. Without it, the model cannot interpret tool definitions or decide when to call look_and_answer.

Vision-language agent behavior

Unlike traditional computer vision demos that describe images, this implementation answers questions using visual context. If you ask "What color is my shirt?", Gemma doesn't respond with "I see a blue shirt." It responds with "Your shirt is blue," treating the image as supporting evidence for the query.

The model determines tool usage based purely on conversational context. Questions like "What's the weather?" receive text-only responses. "How many fingers am I holding up?" triggers webcam access.

Code availability

The complete implementation is available on GitHub at asierarranz/Google_Gemma in a single Python file (Gemma4_vla.py). The script downloads STT/TTS models from Hugging Face on first run.

A text-only mode bypasses audio components for testing:

python3 Gemma4_vla.py --text

What this means

This demonstrates meaningful progress in edge AI deployment. A vision-language model with autonomous tool-calling running on an 8GB developer board ($249 retail) represents a significant compression achievement. The Q4_K_M quantization maintains functional reasoning about when visual context is necessary while fitting in consumer-grade memory constraints.

The lack of hardcoded triggers is notable—previous VLA implementations often relied on keyword detection or explicit user commands to activate vision capabilities. Gemma 4's native function-calling support allows genuine contextual reasoning about tool use.

For robotics and edge computing applications, this level of multimodal reasoning in 8GB opens deployment scenarios previously requiring datacenter hardware. The tradeoff is inference speed (Arranz doesn't disclose tokens/second), but for interactive applications where sub-second response isn't critical, local execution eliminates API costs and latency.

Source: huggingface.co ↗

Gemma NVIDIA Jetson edge-ai vision-language VLA quantization llama.cpp

model releaseJuly 20, 2026

NVIDIA Releases Cosmos 3 Edge: 4B-Parameter World Model for Real-Time Robot Control at 15 Hz

NVIDIA has released Cosmos 3 Edge, a 4-billion-parameter open world model designed for edge AI systems. The model delivers real-time robot control at 15 Hz on NVIDIA Jetson devices, generating 32 actions per inference at 640×360 resolution.

model releaseJuly 20, 2026

NVIDIA Releases Nemotron-3-Embed-1B-BF16: 1.14B Parameter Multilingual Embedding Model with 2048-Dimensional Vectors

NVIDIA has released Nemotron-3-Embed-1B-BF16, a 1.14 billion parameter text embedding model supporting 34 languages with a 32,768 token context window. The model generates 2048-dimensional embeddings and was derived from Ministral-3-3B-Instruct-2512 through two rounds of structured pruning and distillation, first to 2B then to 1.14B parameters.

model releaseJuly 20, 2026

Black Forest Labs releases FLUX.2: 32B open-weight image model with 4MP editing and 10-image multi-reference support

Black Forest Labs has released FLUX.2, a family of image generation models including a 32B parameter open-weight variant. The models support editing at up to 4 megapixel resolution and can reference up to 10 images simultaneously for character and style consistency.

model releaseJuly 21, 2026

Alibaba Releases Qwen-Image-3.0, an Image Generator That Renders 10-Pixel Text and 3x3 Infographic Grids in One Pass

Alibaba's Qwen team has released Qwen-Image-3.0, an image generator that accepts prompts up to 4,500 tokens and can render legible text as small as ten pixels, complex LaTeX formulas, and twelve languages in a single pass. The model is currently invite-only via API, and unlike its predecessor, it likely won't ship with open weights.

Gemma 4 VLA runs locally on NVIDIA Jetson Orin Nano Super with 8GB RAM, autonomous webcam tool-calling

Gemma 4 E2B VLA — Quick Specs

Gemma 4 VLA runs locally on NVIDIA Jetson Orin Nano Super with 8GB RAM, autonomous webcam tool-calling

Technical implementation

Hardware requirements

Model serving configuration

Vision-language agent behavior

Code availability

What this means

Related Articles

NVIDIA Releases Cosmos 3 Edge: 4B-Parameter World Model for Real-Time Robot Control at 15 Hz

NVIDIA Releases Nemotron-3-Embed-1B-BF16: 1.14B Parameter Multilingual Embedding Model with 2048-Dimensional Vectors

Black Forest Labs releases FLUX.2: 32B open-weight image model with 4MP editing and 10-image multi-reference support

Alibaba Releases Qwen-Image-3.0, an Image Generator That Renders 10-Pixel Text and 3x3 Infographic Grids in One Pass

Comments