China's Best Open Coding Model Won't Show Its Work
Moonshot's Kimi K2.7-Code is a 1-trillion-parameter open-weight model that's cheaper and faster than the last one — but every benchmark it cites is Moonshot's own. That's the new pattern worth watching.
The release looks like an unambiguous win for open AI. On June 12, Moonshot AI dropped Kimi K2.7-Code on Hugging Face — a 1-trillion-parameter Mixture-of-Experts model with 32 billion active parameters, released under a Modified MIT license with a 256K-token context window. It is the company's fifth major model in under a year, and on Moonshot's own numbers it is a meaningful step up: +21.8% on Kimi Code Bench v2 and +31.5% on MLS Bench Lite over the previous K2.6, while using roughly 30% fewer reasoning tokens to get there. The API runs $0.95 per million input tokens and $4 per million output. Cheaper, faster, and more capable, given away with weights anyone can download. The open-source coding market just entered the phase nobody quite priced in: models getting cheaper and better at the same time.
And then you read the fine print, and the whole picture gets more interesting — and more cautionary.
Every benchmark is Moonshot's own
Here is the detail most of the coverage glided past. As of the release, every benchmark published for K2.7-Code is one of Moonshot's own proprietary evals — Kimi Code Bench v2, MLS Bench Lite, and the rest. There are, so far, no independent third-party results on the standard public suites the industry actually uses to compare models: SWE-bench Verified, SWE-bench Pro, Terminal-Bench, LiveCodeBench, GPQA Diamond, AIME, or MMLU-Pro.
The self-reported comparisons are framed favorably but tellingly. On Kimi Code Bench v2, K2.7-Code scores 62.0 against GPT-5.5's 69.0 and Claude Opus 4.8's 67.4 — so even on Moonshot's home turf, the frontier closed models lead. On the multi-language MLS Bench Lite, K2.7 scores 35.1, nearly matching GPT-5.5's 35.5. Those are respectable numbers. But a benchmark you built, scored yourself, and chose to publish is a marketing artifact until someone else can reproduce it. The gap between "we measured this" and "the field verified this" is the entire credibility of a benchmark claim, and right now that gap is wide open.
Why this pattern is spreading
Moonshot is not uniquely guilty here; it's an unusually clean example of a pattern sweeping the open-weight world, and the Chinese labs in particular. The incentive is structural. Public benchmarks like SWE-bench Verified have become so contested, so heavily optimized against, and so prone to contamination — where test problems leak into training data — that labs increasingly distrust them as differentiators. So they build their own. A proprietary benchmark lets a company define the test, tune to it, and present a clean upward line release over release. It is genuinely useful as an internal yardstick. It is much less useful as an external claim, because no one outside the lab can audit whether the model is good or whether the test was built to make it look good.
The result is an information environment where a model can ship with impressive-sounding numbers that are technically true and practically unverifiable. For a closed API that might be tolerable — you can at least run the thing and judge the output. For an open-weight release it's a strange tension: the weights are maximally transparent, downloadable and inspectable by anyone, while the evidence for how good they are is the most opaque part of the package.
The open-weight flywheel is real regardless
Step back from the benchmark question and the larger trend is undeniable and, on balance, good for everyone who builds with these tools. Open-weight coding models are improving on a steep curve while their prices fall, and that combination is rewriting the economics of AI-assisted software work. A capable, downloadable, MIT-ish-licensed model with a 256K context and sub-dollar input pricing is a serious foundation for any team that wants to run coding agents on its own infrastructure, fine-tune for a private codebase, or simply avoid metering every keystroke through a closed API.
The "30% fewer reasoning tokens" claim, if it holds up, is the most economically meaningful number in the release. Reasoning tokens are pure cost and pure latency — the model thinking out loud before it answers. Cutting them by a third without losing capability is exactly the kind of efficiency gain that compounds across the thousands of calls an autonomous coding agent makes. Cheaper thinking is what makes agentic coding affordable at scale, and it's where the open models are competing hardest.
How to read a release like this
So treat Kimi K2.7-Code the way you should treat any model that arrives with a wall of self-reported wins: as a credible, promising release whose real ranking is unknown until the independent evals land. Download it, run it on your own problems, and weight your own results far above any number Moonshot — or any lab — chose to publish about itself. The open-weight world's greatest strength is that you don't have to trust the press release; you can verify with the artifact in hand.
That's the discipline the current moment demands. The models are getting genuinely better and genuinely cheaper, and the Chinese open-weight labs are pushing that frontier as hard as anyone. But the benchmark layer has quietly become the least trustworthy part of the stack, precisely because it's the easiest to game and the hardest to audit. The weights are open. The scoreboard isn't. Until a third party confirms the numbers, the right posture toward K2.7-Code is the oldest one in engineering: don't tell me it's fast — show me, on my machine, on my problem.
