Gemma 4 VLA runs locally on NVIDIA Jetson Orin Nano Super with 8GB RAM, autonomous webcam tool-calling
NVIDIA engineer Asier Arranz demonstrated Gemma 4 running as a vision-language agent (VLA) on a Jetson Orin Nano Super with 8GB RAM. The model autonomously decides when to access a webcam based on user queries, with no hardcoded triggers—performing speech-to-text, vision analysis, and text-to-speech entirely locally.
Gemma 4 VLA runs locally on NVIDIA Jetson Orin Nano Super with 8GB RAM, autonomous webcam tool-calling
Google's Gemma 4 vision-language model now runs entirely on NVIDIA's Jetson Orin Nano Super developer board with 8GB RAM, according to a technical demo published by NVIDIA engineer Asier Arranz. The implementation autonomously decides when to access a webcam based on conversational context—no keyword triggers or hardcoded logic.
Technical implementation
The demo uses a Q4_K_M quantized version of Gemma 4 E2B served via llama.cpp with CUDA acceleration. The full pipeline runs locally:
- Speech input: Parakeet STT (speech-to-text)
- Language model: Gemma 4 E2B at Q4_K_M quantization (4.5GB)
- Vision projector: mmproj-gemma4-e2b-f16.gguf
- Speech output: Kokoro TTS
- Context window: 2,048 tokens
- Image tokens: 70 (fixed)
The system exposes exactly one tool to Gemma 4: look_and_answer, which captures a webcam frame. When a user asks a question requiring visual context ("What color is my shirt?"), the model calls this tool autonomously. The vision projector processes the image, and Gemma 4 answers using visual context.
Hardware requirements
Arranz used:
- NVIDIA Jetson Orin Nano Super (8GB)
- Logitech C920 webcam
- USB speaker
- USB keyboard
The setup required aggressive memory management on the 8GB board. Arranz recommends:
- 8GB swap file
- Stopping Docker and containerd
- Killing background processes (tracker-miner, gnome-software)
- Closing all browser tabs and IDEs
With cleanup, Q4_K_M runs comfortably. Users experiencing memory pressure can drop to Q3_K_M quantization (lower quality, lighter memory footprint).
Model serving configuration
The llama-server runs with:
-ngl 99: All model layers offloaded to GPU--flash-attn on: Flash attention enabled--no-mmproj-offload: Vision projector stays in system RAM--jinja: Enables Gemma's native tool-calling support-c 2048: Context window size
The --jinja flag activates Gemma 4's built-in function-calling capabilities. Without it, the model cannot interpret tool definitions or decide when to call look_and_answer.
Vision-language agent behavior
Unlike traditional computer vision demos that describe images, this implementation answers questions using visual context. If you ask "What color is my shirt?", Gemma doesn't respond with "I see a blue shirt." It responds with "Your shirt is blue," treating the image as supporting evidence for the query.
The model determines tool usage based purely on conversational context. Questions like "What's the weather?" receive text-only responses. "How many fingers am I holding up?" triggers webcam access.
Code availability
The complete implementation is available on GitHub at asierarranz/Google_Gemma in a single Python file (Gemma4_vla.py). The script downloads STT/TTS models from Hugging Face on first run.
A text-only mode bypasses audio components for testing:
python3 Gemma4_vla.py --text
What this means
This demonstrates meaningful progress in edge AI deployment. A vision-language model with autonomous tool-calling running on an 8GB developer board ($249 retail) represents a significant compression achievement. The Q4_K_M quantization maintains functional reasoning about when visual context is necessary while fitting in consumer-grade memory constraints.
The lack of hardcoded triggers is notable—previous VLA implementations often relied on keyword detection or explicit user commands to activate vision capabilities. Gemma 4's native function-calling support allows genuine contextual reasoning about tool use.
For robotics and edge computing applications, this level of multimodal reasoning in 8GB opens deployment scenarios previously requiring datacenter hardware. The tradeoff is inference speed (Arranz doesn't disclose tokens/second), but for interactive applications where sub-second response isn't critical, local execution eliminates API costs and latency.
Related Articles
NVIDIA Releases GR00T N1.7, 3B-Parameter Open-Source Humanoid Robot Model Trained on 20,854 Hours of Human Video
NVIDIA released GR00T N1.7, a 3-billion parameter open-source Vision-Language-Action model for humanoid robots with commercial licensing. The model was trained on 20,854 hours of human egocentric video data and demonstrates the first documented scaling law for robot dexterity, where increasing human video data from 1,000 to 20,000 hours more than doubles task completion rates.
Alibaba Qwen Releases 27B Parameter Model That Claims to Match 397B Performance on Coding Tasks
Alibaba Qwen released Qwen3.6-27B, a 27B parameter dense model that claims flagship-level coding performance surpassing their previous 397B MoE model across major coding benchmarks. The full model is 55.6GB compared to 807GB for the predecessor.
Xiaomi Launches MiMo-V2.5 With 1M Context Window at $0.40 per Million Input Tokens
Xiaomi released MiMo-V2.5 on April 22, 2026, a native omnimodal model with a 1,048,576 token context window. The model is priced at $0.40 per million input tokens and $2 per million output tokens, positioning it as a cost-efficient alternative for agentic applications requiring multimodal perception across image and video understanding.
Arcee AI Releases Trinity Large Preview: 400B-Parameter MoE Model with 512K Context Window
Arcee AI has released Trinity Large Preview, a 400B-parameter sparse Mixture-of-Experts model with 13B active parameters per token using 4-of-256 expert routing. The model supports context windows up to 512K tokens and is available with open weights under permissive licensing.
Comments
Loading...