Gemma 4 VLA runs locally on NVIDIA Jetson Orin Nano Super with 8GB RAM, autonomous webcam tool-calling
NVIDIA engineer Asier Arranz demonstrated Gemma 4 running as a vision-language agent (VLA) on a Jetson Orin Nano Super with 8GB RAM. The model autonomously decides when to access a webcam based on user queries, with no hardcoded triggers—performing speech-to-text, vision analysis, and text-to-speech entirely locally.
Gemma 4 VLA runs locally on NVIDIA Jetson Orin Nano Super with 8GB RAM, autonomous webcam tool-calling
Google's Gemma 4 vision-language model now runs entirely on NVIDIA's Jetson Orin Nano Super developer board with 8GB RAM, according to a technical demo published by NVIDIA engineer Asier Arranz. The implementation autonomously decides when to access a webcam based on conversational context—no keyword triggers or hardcoded logic.
Technical implementation
The demo uses a Q4_K_M quantized version of Gemma 4 E2B served via llama.cpp with CUDA acceleration. The full pipeline runs locally:
- Speech input: Parakeet STT (speech-to-text)
- Language model: Gemma 4 E2B at Q4_K_M quantization (4.5GB)
- Vision projector: mmproj-gemma4-e2b-f16.gguf
- Speech output: Kokoro TTS
- Context window: 2,048 tokens
- Image tokens: 70 (fixed)
The system exposes exactly one tool to Gemma 4: look_and_answer, which captures a webcam frame. When a user asks a question requiring visual context ("What color is my shirt?"), the model calls this tool autonomously. The vision projector processes the image, and Gemma 4 answers using visual context.
Hardware requirements
Arranz used:
- NVIDIA Jetson Orin Nano Super (8GB)
- Logitech C920 webcam
- USB speaker
- USB keyboard
The setup required aggressive memory management on the 8GB board. Arranz recommends:
- 8GB swap file
- Stopping Docker and containerd
- Killing background processes (tracker-miner, gnome-software)
- Closing all browser tabs and IDEs
With cleanup, Q4_K_M runs comfortably. Users experiencing memory pressure can drop to Q3_K_M quantization (lower quality, lighter memory footprint).
Model serving configuration
The llama-server runs with:
-ngl 99: All model layers offloaded to GPU--flash-attn on: Flash attention enabled--no-mmproj-offload: Vision projector stays in system RAM--jinja: Enables Gemma's native tool-calling support-c 2048: Context window size
The --jinja flag activates Gemma 4's built-in function-calling capabilities. Without it, the model cannot interpret tool definitions or decide when to call look_and_answer.
Vision-language agent behavior
Unlike traditional computer vision demos that describe images, this implementation answers questions using visual context. If you ask "What color is my shirt?", Gemma doesn't respond with "I see a blue shirt." It responds with "Your shirt is blue," treating the image as supporting evidence for the query.
The model determines tool usage based purely on conversational context. Questions like "What's the weather?" receive text-only responses. "How many fingers am I holding up?" triggers webcam access.
Code availability
The complete implementation is available on GitHub at asierarranz/Google_Gemma in a single Python file (Gemma4_vla.py). The script downloads STT/TTS models from Hugging Face on first run.
A text-only mode bypasses audio components for testing:
python3 Gemma4_vla.py --text
What this means
This demonstrates meaningful progress in edge AI deployment. A vision-language model with autonomous tool-calling running on an 8GB developer board ($249 retail) represents a significant compression achievement. The Q4_K_M quantization maintains functional reasoning about when visual context is necessary while fitting in consumer-grade memory constraints.
The lack of hardcoded triggers is notable—previous VLA implementations often relied on keyword detection or explicit user commands to activate vision capabilities. Gemma 4's native function-calling support allows genuine contextual reasoning about tool use.
For robotics and edge computing applications, this level of multimodal reasoning in 8GB opens deployment scenarios previously requiring datacenter hardware. The tradeoff is inference speed (Arranz doesn't disclose tokens/second), but for interactive applications where sub-second response isn't critical, local execution eliminates API costs and latency.
Related Articles
NVIDIA Releases Nemotron 3.5 Content Safety: 4B-Parameter Multimodal Model with Custom Policy Enforcement and 140-Langua
NVIDIA has released Nemotron 3.5 Content Safety, a 4B-parameter model built on Google Gemma 3 4B IT that provides multimodal safety classification across approximately 140 languages. The model includes a 128K context window, custom enterprise policy enforcement, auditable reasoning traces, and is releasing its training dataset.
Nvidia Releases Free 4B-Parameter Nemotron 3.5 Content Safety Model with 128K Context
Nvidia has released Nemotron 3.5 Content Safety, a 4-billion parameter multimodal guardrail model fine-tuned from Google Gemma-3-4B. The model is available for free, supports 128K token context windows, and moderates content across 12 languages.
Nvidia releases Nemotron 3 Ultra: 550B-parameter MoE model with 1M context window for agentic workflows
Nvidia has released Nemotron 3 Ultra, a 550-billion parameter mixture-of-experts model with 55 billion active parameters and support for up to 1 million token context windows. The model uses a hybrid Transformer-Mamba architecture and is designed specifically for long-running agentic workflows including agent orchestration, coding agents, and complex enterprise tasks.
NVIDIA Releases Nemotron-3-Ultra: 550B Parameter Model with 1M Token Context and Configurable Reasoning
NVIDIA released Nemotron-3-Ultra-550B-A55B-NVFP4, a 550B parameter model with 55B active parameters, featuring a 1M token context window and configurable reasoning mode. The model uses a hybrid LatentMoE architecture combining Mamba-2, Mixture-of-Experts, and Attention layers with Multi-Token Prediction, trained with NVIDIA's NVFP4 quantization-aware approach.
Comments
Loading...