benchmark

Qwen3.6-35B-A3B Outperforms Claude Opus 4.7 on SVG Generation Test

TL;DR

In an informal SVG generation benchmark, Alibaba's Qwen3.6-35B-A3B model running locally via a 20.9GB quantized version outperformed Anthropic's newly released Claude Opus 4.7. The test, which asked models to generate SVG illustrations of pelicans and flamingos on bicycles, showed the smaller local model producing more accurate bicycle frames and more creative outputs.

2 min read
0

Qwen3.6-35B-A3B Outperforms Claude Opus 4.7 on SVG Generation Test

Alibaba's Qwen3.6-35B-A3B model running locally produced more accurate SVG illustrations than Anthropic's Claude Opus 4.7 in an informal benchmark test, according to developer Simon Willison's comparison published April 16.

The test asked both models to generate SVG code for a "pelican riding a bicycle." Qwen3.6-35B-A3B, running via a 20.9GB quantized model (Qwen3.6-35B-A3B-UD-Q4_K_S.gguf) on a MacBook Pro M5 through LM Studio, produced a correct bicycle frame with clouds and a detailed pelican pouch. Claude Opus 4.7 generated an incorrect bicycle frame shape in both standard and maximum thinking mode.

Benchmark Details

The Qwen model ran entirely locally using the quantized GGUF format from Unsloth. Opus 4.7 ran via Anthropic's API. Both models were tested on the same prompt without modification.

In a follow-up test using "flamingo riding a unicycle" to verify the results weren't due to training on the specific benchmark, Qwen3.6-35B-A3B again produced what Willison judged to be superior output, including creative details like sunglasses and a bowtie on the flamingo, along with SVG comments.

Model Specifications

Qwen3.6-35B-A3B:

  • Parameter count: 35 billion
  • Quantized size: 20.9GB (Q4_K_S format)
  • Deployment: Local via LM Studio
  • Released: April 16, 2026 (announced by Alibaba)

Claude Opus 4.7:

  • Parameter count: Not disclosed
  • Deployment: API only
  • Released: April 16, 2026 (announced by Anthropic)
  • Tested with both standard and maximum thinking levels

Analysis Limitations

Willison noted that this informal benchmark tests only a narrow capability and should not be interpreted as evidence that the quantized Qwen model is generally more capable than Opus 4.7. "I very much doubt that a 21GB quantized version of their latest model is more powerful or useful than Anthropic's latest proprietary release," he wrote.

The benchmark has historically correlated with general model capability improvements since October 2024, when early models produced poor results. Recent flagship models like Gemini 3.1 Pro have generated production-quality illustrations on this test.

What This Means

This result demonstrates that specialized performance on specific tasks can vary significantly between models regardless of overall capability or size. A 35B parameter model running locally in quantized form matched or exceeded a flagship proprietary model on SVG generation, while likely trailing in most other benchmarks.

The finding also highlights the growing sophistication of local LLMs. A model small enough to run on consumer hardware (20.9GB) can now compete with cloud-based flagship models on certain creative tasks, though general-purpose performance gaps remain significant.

For developers specifically needing SVG generation capabilities, this suggests testing multiple models on actual use cases rather than relying solely on general benchmark scores or parameter counts.

Related Articles

benchmark

Claude Mythos achieves 73% success rate on expert-level hacking challenges, completes full network takeover in 3 of 10 a

The UK's AI Safety Institute reports Claude Mythos Preview achieved a 73% success rate on expert-level capture-the-flag cybersecurity challenges and became the first AI model to complete a full 32-step simulated corporate network takeover, succeeding in 3 out of 10 attempts. The testing occurred in environments without active security monitoring or defenders.

benchmark

OpenAI's GPT 5.4 ties Gemini 3.1 Pro at 72.4% on Google's Android coding benchmark

Google's Android Bench—a benchmark measuring AI model performance for Android app development—shows OpenAI's GPT 5.4 and Google's Gemini 3.1 Pro Preview tied at 72.4% in the latest April 2026 update. OpenAI's GPT 5.3-Codex ranks third at 67.7%, while Anthropic's Claude Opus 4.6 scores 66.6%.

benchmark

Study: 25% of quotes in AI chatbot responses originate from journalism

A Muckrack analysis of 15 million quotes generated by AI systems found that one in four originate from journalistic sources. The study evaluated responses from Gemini, Perplexity, Claude, and ChatGPT, revealing Reuters as the most cited publication globally, with The Guardian leading in the UK.

benchmark

Google AI Overviews reach 91% accuracy with Gemini 3, but 56% of answers lack verifiable sources

An independent study by AI startup Oumi found that Google's AI Overviews answered correctly 91% of the time with Gemini 3, up from 85% with Gemini 2, based on 4,326 searches using the SimpleQA benchmark. However, 56% of correct answers in Gemini 3 could not be verified through the linked sources—a significant increase from 37% in Gemini 2—and at Google's scale, a 9% error rate still translates to millions of wrong answers per hour.

Comments

Loading...