Qwen3.6-35B-A3B Outperforms Claude Opus 4.7 on SVG Generation Test
In an informal SVG generation benchmark, Alibaba's Qwen3.6-35B-A3B model running locally via a 20.9GB quantized version outperformed Anthropic's newly released Claude Opus 4.7. The test, which asked models to generate SVG illustrations of pelicans and flamingos on bicycles, showed the smaller local model producing more accurate bicycle frames and more creative outputs.
Qwen3.6-35B-A3B Outperforms Claude Opus 4.7 on SVG Generation Test
Alibaba's Qwen3.6-35B-A3B model running locally produced more accurate SVG illustrations than Anthropic's Claude Opus 4.7 in an informal benchmark test, according to developer Simon Willison's comparison published April 16.
The test asked both models to generate SVG code for a "pelican riding a bicycle." Qwen3.6-35B-A3B, running via a 20.9GB quantized model (Qwen3.6-35B-A3B-UD-Q4_K_S.gguf) on a MacBook Pro M5 through LM Studio, produced a correct bicycle frame with clouds and a detailed pelican pouch. Claude Opus 4.7 generated an incorrect bicycle frame shape in both standard and maximum thinking mode.
Benchmark Details
The Qwen model ran entirely locally using the quantized GGUF format from Unsloth. Opus 4.7 ran via Anthropic's API. Both models were tested on the same prompt without modification.
In a follow-up test using "flamingo riding a unicycle" to verify the results weren't due to training on the specific benchmark, Qwen3.6-35B-A3B again produced what Willison judged to be superior output, including creative details like sunglasses and a bowtie on the flamingo, along with SVG comments.
Model Specifications
Qwen3.6-35B-A3B:
- Parameter count: 35 billion
- Quantized size: 20.9GB (Q4_K_S format)
- Deployment: Local via LM Studio
- Released: April 16, 2026 (announced by Alibaba)
Claude Opus 4.7:
- Parameter count: Not disclosed
- Deployment: API only
- Released: April 16, 2026 (announced by Anthropic)
- Tested with both standard and maximum thinking levels
Analysis Limitations
Willison noted that this informal benchmark tests only a narrow capability and should not be interpreted as evidence that the quantized Qwen model is generally more capable than Opus 4.7. "I very much doubt that a 21GB quantized version of their latest model is more powerful or useful than Anthropic's latest proprietary release," he wrote.
The benchmark has historically correlated with general model capability improvements since October 2024, when early models produced poor results. Recent flagship models like Gemini 3.1 Pro have generated production-quality illustrations on this test.
What This Means
This result demonstrates that specialized performance on specific tasks can vary significantly between models regardless of overall capability or size. A 35B parameter model running locally in quantized form matched or exceeded a flagship proprietary model on SVG generation, while likely trailing in most other benchmarks.
The finding also highlights the growing sophistication of local LLMs. A model small enough to run on consumer hardware (20.9GB) can now compete with cloud-based flagship models on certain creative tasks, though general-purpose performance gaps remain significant.
For developers specifically needing SVG generation capabilities, this suggests testing multiple models on actual use cases rather than relying solely on general benchmark scores or parameter counts.
Related Articles
Frontier AI Models Score Below 50% on First Enterprise IT Benchmark for Kubernetes Incident Response
Artificial Analysis and IBM Research have released ITBench-AA, the first benchmark evaluating AI models on enterprise Site Reliability Engineering tasks. Claude Opus 4.7 leads at 47%, followed by GPT-5.5 at 46% and Qwen3.7 Max at 42%—all frontier models score below 50% on Kubernetes incident response tasks requiring root-cause diagnosis across complex infrastructure.
Gemini handles video analysis across YouTube and 1.65GB local files, Claude fails entirely
In direct testing, Google's Gemini successfully analyzed video content from YouTube links and local files up to 1.65GB, accurately understanding context without audio or metadata. Anthropic's Claude cannot process video at all, while OpenAI's ChatGPT faces a 500MB file size limit without Codex assistance.
Claude Mythos achieves 73% success rate on expert-level hacking challenges, completes full network takeover in 3 of 10 a
The UK's AI Safety Institute reports Claude Mythos Preview achieved a 73% success rate on expert-level capture-the-flag cybersecurity challenges and became the first AI model to complete a full 32-step simulated corporate network takeover, succeeding in 3 out of 10 attempts. The testing occurred in environments without active security monitoring or defenders.
IBM Research launches Open Agent Leaderboard, showing same models achieve different results based on agent architecture
IBM Research has launched the Open Agent Leaderboard, the first open benchmark that evaluates complete AI agent systems rather than just underlying models. The leaderboard reveals that agents using identical models can achieve significantly different success rates and costs depending on system architecture, with failed runs costing 20-54% more than successful ones.
Comments
Loading...