model releaseMoonshot AI

Moonshot AI releases Kimi K2.7 Code with 1T parameters, 256K context window, 30% lower thinking token usage

TL;DR

Moonshot AI has released Kimi K2.7 Code, a 1 trillion parameter Mixture-of-Experts model designed for long-horizon coding tasks. The model features a 256K context window and reduces thinking token usage by approximately 30% compared to its predecessor K2.6.

2 min read
0

Moonshot AI Releases Kimi K2.7 Code with 1T Parameters

Moonshot AI has released Kimi K2.7 Code, a 1 trillion parameter Mixture-of-Experts (MoE) model built on the K2.6 architecture. The model features 32 billion activated parameters, a 256K token context window, and reduces thinking token usage by approximately 30% compared to K2.6.

Architecture and Specifications

Kimi K2.7 Code uses a 384-expert MoE architecture with 8 experts selected per token, plus 1 shared expert. The model includes 61 total layers (including 1 dense layer), 64 attention heads, and a 160K vocabulary size. It employs Multi-head Latent Attention (MLA) with SwiGLU activation and integrates MoonViT, a 400M parameter vision encoder, for multimodal capabilities supporting image and video input.

The model is available in native INT4 quantization using the same method as Kimi-K2-Thinking.

Benchmark Performance

According to Moonshot AI, K2.7 Code shows substantial improvements across coding and agentic benchmarks:

Coding benchmarks:

  • Kimi Code Bench v2: 62.0 (vs 50.9 for K2.6)
  • Program Bench: 53.6 (vs 48.3 for K2.6)
  • MLS Bench Lite: 35.1 (vs 26.7 for K2.6)

Agentic benchmarks:

  • Kimi Claw 24/7 Bench: 46.9 (vs 42.9 for K2.6)
  • MCP Atlas: 76.0 (vs 69.4 for K2.6)
  • MCP Mark Verified: 81.1 (vs 72.8 for K2.6)

Moonshot AI compared K2.7 Code against GPT-5.5 and Claude Opus 4.8, though these comparison scores cannot be independently verified. In Moonshot's testing, GPT-5.5 led most benchmarks, with K2.7 Code placing second or third depending on the task.

Deployment and Availability

The model API is available at platform.moonshot.ai with OpenAI and Anthropic-compatible interfaces. Pricing per million tokens has not been disclosed. The model requires transformers version 4.57.1 or higher (but below 5.0.0) and can be deployed using vLLM, SGLang, or KTransformers inference engines.

Kimi K2.7 Code operates exclusively in thinking mode with preserve_thinking forced to True. Moonshot AI recommends a temperature of 1.0 and top_p of 0.95 for inference. Video content chat is currently experimental and only supported through the official API.

What This Means

Kimi K2.7 Code represents Moonshot AI's push into specialized coding models with extended reasoning capabilities. The 30% reduction in thinking token usage addresses a practical efficiency concern for reasoning models in production environments. The 256K context window positions it competitively for repository-level code understanding tasks, though it remains shorter than some competitors offering 1M+ token windows. The model's MoE architecture with 384 experts and INT4 quantization support suggests Moonshot is optimizing for deployment efficiency alongside raw capability.

Related Articles

model release

Nex AGI Releases Nex-N2-Pro: 17B Active Parameter MoE Model with 262K Context Window

Nex AGI has released Nex-N2-Pro, a mixture-of-experts model with 17 billion active parameters from a total of 397 billion parameters. Built on the Qwen3.5 architecture, the model offers a 262,144 token context window and is available for free through OpenRouter.

model release

Nex AGI Releases Nex-N2-Pro: 397B Parameter MoE Model With 262K Context, Available Free

Nex AGI has released Nex-N2-Pro, an agentic mixture-of-experts model with 397B total parameters and 17B active parameters. The model features a 262K token context window and is available free via OpenRouter's API.

model release

Google DeepMind releases DiffusionGemma, a 26B parameter model generating 15-20 tokens per forward pass via discrete dif

Google DeepMind released DiffusionGemma, a 26B parameter mixture-of-experts model that generates text using discrete diffusion instead of autoregression. The model processes blocks of 256 tokens in parallel, achieving generation speeds exceeding 1100 tokens per second on H100 GPUs in low-batch settings.

model release

Nvidia releases Nemotron 3 Ultra: 550B-parameter MoE model with 1M context window for agentic workflows

Nvidia has released Nemotron 3 Ultra, a 550-billion parameter mixture-of-experts model with 55 billion active parameters and support for up to 1 million token context windows. The model uses a hybrid Transformer-Mamba architecture and is designed specifically for long-running agentic workflows including agent orchestration, coding agents, and complex enterprise tasks.

Comments

Loading...