benchmark

Qwen3.6-35B-A3B Outperforms Claude Opus 4.7 on SVG Generation Test

TL;DR

In an informal SVG generation benchmark, Alibaba's Qwen3.6-35B-A3B model running locally via a 20.9GB quantized version outperformed Anthropic's newly released Claude Opus 4.7. The test, which asked models to generate SVG illustrations of pelicans and flamingos on bicycles, showed the smaller local model producing more accurate bicycle frames and more creative outputs.

April 16, 2026 · 5:35 PM2 min read

Qwen3.6-35B-A3B Outperforms Claude Opus 4.7 on SVG Generation Test

Alibaba's Qwen3.6-35B-A3B model running locally produced more accurate SVG illustrations than Anthropic's Claude Opus 4.7 in an informal benchmark test, according to developer Simon Willison's comparison published April 16.

The test asked both models to generate SVG code for a "pelican riding a bicycle." Qwen3.6-35B-A3B, running via a 20.9GB quantized model (Qwen3.6-35B-A3B-UD-Q4_K_S.gguf) on a MacBook Pro M5 through LM Studio, produced a correct bicycle frame with clouds and a detailed pelican pouch. Claude Opus 4.7 generated an incorrect bicycle frame shape in both standard and maximum thinking mode.

Benchmark Details

The Qwen model ran entirely locally using the quantized GGUF format from Unsloth. Opus 4.7 ran via Anthropic's API. Both models were tested on the same prompt without modification.

In a follow-up test using "flamingo riding a unicycle" to verify the results weren't due to training on the specific benchmark, Qwen3.6-35B-A3B again produced what Willison judged to be superior output, including creative details like sunglasses and a bowtie on the flamingo, along with SVG comments.

Model Specifications

Qwen3.6-35B-A3B:

Parameter count: 35 billion
Quantized size: 20.9GB (Q4_K_S format)
Deployment: Local via LM Studio
Released: April 16, 2026 (announced by Alibaba)

Claude Opus 4.7:

Parameter count: Not disclosed
Deployment: API only
Released: April 16, 2026 (announced by Anthropic)
Tested with both standard and maximum thinking levels

Analysis Limitations

Willison noted that this informal benchmark tests only a narrow capability and should not be interpreted as evidence that the quantized Qwen model is generally more capable than Opus 4.7. "I very much doubt that a 21GB quantized version of their latest model is more powerful or useful than Anthropic's latest proprietary release," he wrote.

The benchmark has historically correlated with general model capability improvements since October 2024, when early models produced poor results. Recent flagship models like Gemini 3.1 Pro have generated production-quality illustrations on this test.

What This Means

This result demonstrates that specialized performance on specific tasks can vary significantly between models regardless of overall capability or size. A 35B parameter model running locally in quantized form matched or exceeded a flagship proprietary model on SVG generation, while likely trailing in most other benchmarks.

The finding also highlights the growing sophistication of local LLMs. A model small enough to run on consumer hardware (20.9GB) can now compete with cloud-based flagship models on certain creative tasks, though general-purpose performance gaps remain significant.

For developers specifically needing SVG generation capabilities, this suggests testing multiple models on actual use cases rather than relying solely on general benchmark scores or parameter counts.

Source: simonwillison.net ↗

qwen anthropic claude benchmark svg-generation local-llms quantization

benchmarkMay 27, 2026

Frontier AI Models Score Below 50% on First Enterprise IT Benchmark for Kubernetes Incident Response

Artificial Analysis and IBM Research have released ITBench-AA, the first benchmark evaluating AI models on enterprise Site Reliability Engineering tasks. Claude Opus 4.7 leads at 47%, followed by GPT-5.5 at 46% and Qwen3.7 Max at 42%—all frontier models score below 50% on Kubernetes incident response tasks requiring root-cause diagnosis across complex infrastructure.

benchmarkJune 12, 2026

Gemini 3.5 Flash ranks 6th in Android coding benchmark at 3x cost of Gemini 3.1 Pro

Google's latest Android Bench results show Gemini 3.5 Flash ranking 6th with a 63.7% success rate, despite averaging $147.10 per benchmark run compared to Gemini 3.1 Pro Preview's $47.90. The newer model used 355.9 tokens per run versus 73.3 for its predecessor, while GPT 5.5 leads the benchmark at 74% success rate.

benchmarkJune 2, 2026

Claude Opus 4.8 fails legal reasoning test despite improved honesty scores

Anthropic's Claude Opus 4.8 demonstrated better uncertainty handling than its predecessor in independent testing across coding, medical, and financial scenarios. However, the model exhibited a significant judgment error in a legal reasoning test involving travel insurance claims, according to results published by ZDNET.

benchmarkMay 11, 2026

Gemini handles video analysis across YouTube and 1.65GB local files, Claude fails entirely

In direct testing, Google's Gemini successfully analyzed video content from YouTube links and local files up to 1.65GB, accurately understanding context without audio or metadata. Anthropic's Claude cannot process video at all, while OpenAI's ChatGPT faces a 500MB file size limit without Codex assistance.

Qwen3.6-35B-A3B Outperforms Claude Opus 4.7 on SVG Generation Test

Qwen3.6-35B-A3B Outperforms Claude Opus 4.7 on SVG Generation Test

Benchmark Details

Model Specifications

Analysis Limitations

What This Means

Related Articles

Frontier AI Models Score Below 50% on First Enterprise IT Benchmark for Kubernetes Incident Response

Gemini 3.5 Flash ranks 6th in Android coding benchmark at 3x cost of Gemini 3.1 Pro

Claude Opus 4.8 fails legal reasoning test despite improved honesty scores

Gemini handles video analysis across YouTube and 1.65GB local files, Claude fails entirely

Comments