‘f¦ΒΙίι‘
S
1-
ΕV50
«Εγ
Tencent improves te
- 1 Emmetttug [2025/08/09(Sat) 01:29]
- Getting it germane in the conk, like a charitable would should
So, how does Tencentfs AI benchmark work? First, an AI is prearranged a natural reproach from a catalogue of as leftovers 1,800 challenges, from edifice observations visualisations and pr xpry~yru} q~uz apps to making interactive mini-games.
Things being what they are the AI generates the pandect, ArtifactsBench gets to work. It automatically builds and runs the athletic in a satisfactory and sandboxed environment.
To prophesy how the germaneness behaves, it captures a series of screenshots all hardly time. This allows it to corroboration seeking things like animations, empire changes after a button click, and other spry consumer feedback.
In behalf of strictly speaking, it hands atop of all this asseverate the firsthand solicitation, the AIfs pandect, and the screenshots to a Multimodal LLM (MLLM), to make out as a judge.
This MLLM arbiter isnft tow-headed giving a forsaken y|u}p and to a non-specified variety than uses a pompous, per-task checklist to swarms the impact across ten unravel metrics. Scoring includes functionality, proprietor circumstance, and the unvarying aesthetic quality. This ensures the scoring is disinterested, compatible, and thorough.
The conceitedly quarrel is, does this automated expect indeed have honoured taste? The results proffer it does.
When the rankings from ArtifactsBench were compared to WebDev Arena, the gold-standard u|ru{ crease where reverberate humans referendum on the supreme AI creations, they matched up with a 94.4% consistency. This is a heinousness take from older automated benchmarks, which at worst managed hither 69.4% consistency.
On lid of this, the frameworkfs judgments showed across 90% integrity with maven benevolent developers.
https://www.artificialintelligence-news.com/
©ίι
S
ͺζͺ
read.php ver2 (2004/1/26)