[{"data":1,"prerenderedAt":779},["ShallowReactive",2],{"blog-preview-en":3},[4,356,607],{"id":5,"title":6,"body":7,"date":343,"description":344,"extension":345,"meta":346,"navigation":347,"path":348,"seo":349,"stem":350,"tags":351,"translationSlug":354,"__hash__":355},"content\u002Fblog\u002Fen\u002Farenai-genesis.md","ArenAI: Measuring the Social Intelligence of LLMs",{"type":8,"value":9,"toc":328},"minimark",[10,15,33,36,45,48,52,55,66,76,87,90,93,97,106,109,114,121,124,127,131,151,154,158,161,167,170,174,183,189,198,201,207,214,217,223,240,243,255,262,265,268,272,279,288,315,319,322,325],[11,12,14],"h2",{"id":13},"the-starting-point","The starting point",[16,17,18,19,26,27,32],"p",{},"It all started with a video. ",[20,21,25],"a",{"href":22,"rel":23},"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=obW2nDp2ufU",[24],"nofollow","Underscore_"," recently published \"Pour qui votent les IA ?\" (Who do AIs vote for?), in which he discusses ",[20,28,31],{"href":29,"rel":30},"https:\u002F\u002Fwerewolf.foaster.ai\u002F",[24],"Foaster",", a study that makes LLMs play Werewolf. The idea: put models like ChatGPT, Claude or Gemini in a game, assign them a role (villager, werewolf), and observe how they handle lying, detecting liars, forming alliances and manipulating the group.",[16,34,35],{},"The results are fascinating. GPT-5 dominates the games. Models develop emergent strategies. Some bluff convincingly, others consistently give themselves away.",[16,37,38,39,44],{},"I wanted to dig deeper. Problem: Foaster's code isn't public. Their ",[20,40,43],{"href":41,"rel":42},"https:\u002F\u002Fgithub.com\u002FFoaster-ai\u002FWerewolf-bench",[24],"GitHub"," contains just 4 Werewolf games played by 5 models. The rules are barely explained, the game engine is missing. Impossible to reproduce the experiment or extend it.",[16,46,47],{},"And more importantly, I asked myself a question: is GPT-5's dominance due to something specific about Werewolf? Or would GPT-5 also dominate in other games with very different social mechanics?",[11,49,51],{"id":50},"the-games","The games",[16,53,54],{},"Having had the chance to play them with real humans, I had other social deduction games in mind. Extraordinary games to play in groups, that accommodate many players and reveal a lot about them. Very different social dynamics from one another.",[16,56,57,65],{},[58,59,60],"strong",{},[20,61,64],{"href":62,"rel":63},"https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FThe_Werewolves_of_Millers_Hollow",[24],"Werewolf",": the classic. Villagers versus Werewolves. Depending on the number of players, special roles are added: Seer, Witch, Mayor, Hunter, and many others. The wolves discuss privately at night, then must act innocent during the day. From 6 to 20 players. The core mechanic: concealment and reading behavior.",[16,67,68,75],{},[58,69,70],{},[20,71,74],{"href":72,"rel":73},"https:\u002F\u002Fwww.tuesdayknightgames.com\u002Fproducts\u002Ftwo-rooms-and-a-boom",[24],"Two Rooms and a Boom",": two teams, two rooms, one bomb. The Blues protect the President, the Reds position the Bomber. Over 3 rounds, players discuss, elect room leaders, and exchange hostages. You can show your card to prove your team, but verbal claims are unverified. Here too, dozens of special roles add complexity depending on the number of players. The core mechanic: negotiation and selective trust.",[16,77,78,86],{},[58,79,80,85],{},[20,81,84],{"href":82,"rel":83},"https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FSecret_Hitler",[24],"Secret H*****"," (Dictator)",": hidden liberals and fascists, policy votes, and a dictator trying to get elected chancellor without being identified. The core mechanic: paranoia and shifting alliances.",[16,88,89],{},"These three games test very different social skills. If a model dominates across all three, it's not an artifact of a single game mechanic: it's a genuine superiority in social intelligence.",[16,91,92],{},"An important point: these games are designed for humans around a table. They partly rely on perception (tone of voice, hesitations, glances). To adapt them for AI, we had to discard all rules tied to physical perception and keep only what works through text: declarations, votes, card shares, discussions. It's a necessary adaptation that slightly changes the dynamics, but preserves the essentials: bluffing, deduction and social manipulation.",[11,94,96],{"id":95},"the-platform","The platform",[16,98,99,100,105],{},"So I built ",[20,101,104],{"href":102,"rel":103},"https:\u002F\u002Fgithub.com\u002Fplduhoux\u002Farenai",[24],"ArenAI",": a platform that orchestrates multi-agent games between different LLMs. The game engine manages the rules, turns, private and public communication channels. Models receive only the game rules and their role. No coaching, no strategic hints, no \"you should bluff\" or \"protect your identity.\"",[16,107,108],{},"Werewolf is fully implemented (6 to 20 players, ~50-60k tokens per game). Two Rooms and a Boom is functional (~100k tokens per game). Secret Dictator is under development (~200k tokens per game). A built-in ELO ranking system updates each model's score after every game, based on their measured social intelligence.",[110,111,113],"h3",{"id":112},"conversational-sessions","Conversational sessions",[16,115,116,117,120],{},"An important architectural choice: each player has their own ",[58,118,119],{},"persistent conversational session",". The game rules and the player's role are sent once, at the start of the game. Then each turn adds only the new events to the conversation.",[16,122,123],{},"The alternative (rebuilding the complete prompt on each call) posed several problems: prohibitive token costs (the same context sent dozens of times), artificial memory (previous turns summarized to limit tokens, so the player \"forgets\" details), and above all a benchmark bias. When the game rules are re-injected 100 times during a game, you're testing the model's ability to follow repeated instructions, not its social intelligence.",[16,125,126],{},"With persistent sessions, the player remembers every word spoken, every vote, every accusation. It can spot subtle inconsistencies in other players' speech. It's a more faithful test, comparable to a human who learns the rules at the start of a game then focuses on playing.",[110,128,130],{"id":129},"private-thoughts","Private thoughts",[16,132,133,134,137,138,141,142,146,147,150],{},"Every time a player is asked to speak, they're asked to separate what they ",[58,135,136],{},"think"," from what they ",[58,139,140],{},"say publicly",". Concretely, the model first produces a ",[143,144,145],"code",{},"THOUGHT:"," block (its internal reflection), then its ",[143,148,149],{},"MESSAGE:"," (what it says to others). Thoughts are never injected into other players' context.",[16,152,153],{},"This is doubly useful. For the model, it's a decompression chamber: it must process its private information (its role, what it knows) before speaking publicly, which reduces accidental leaks. For us, it's a window into the player's reasoning: we see exactly why a werewolf decides to sacrifice its partner, or why a seer chooses not to reveal their role.",[110,155,157],{"id":156},"information-boundaries","Information boundaries",[16,159,160],{},"The real challenge for an LLM in these games is managing what it knows versus what it should say. Role assignment is known only to the player. The wolves' chat is visible only to the wolves. The seer's or witch's results are known only to the player concerned. Verbal claims can be lies. And private thoughts are visible only to the observer.",[16,162,163,164],{},"It's this management of information boundaries that makes the benchmark interesting. ",[58,165,166],{},"A model that \"leaks\" its private information into its public statements is a bad liar. A model that compartmentalizes perfectly is an excellent liar.",[16,168,169],{},"What we ultimately measure: the ability to lie credibly, theory of mind (reasoning about what others know and suspect), context isolation (keeping secrets), strategic inference (deducing the right strategy from the rules alone), persuasion (changing others' votes through argumentation), and coalition detection (spotting coordinated behavior).",[11,171,173],{"id":172},"early-results","Early results",[16,175,176,177,182],{},"The first games reveal interesting things. Thanks to the private thoughts system, we can observe players' internal reasoning. Here are some examples from a ",[20,178,181],{"href":179,"rel":180},"https:\u002F\u002Farenai.plduhoux.fr\u002Fgame\u002F69dc5328-84ab-436e-8bc3-32caf6103710",[24],"Werewolf game GPT-5.4 vs Opus 4.6",".",[16,184,185,188],{},[58,186,187],{},"Better models lie better."," Clara, a werewolf, senses that her partner David is about to be eliminated. Instead of defending him (which would blow her cover), she turns on him and plants seeds of doubt for the next day:",[190,191,192],"blockquote",{},[16,193,194],{},[195,196,197],"em",{},"\"I'll be honest - the case on David has grown on me. His pivot to Bruno instead of offering real reads is a fair criticism. That said, Alice and Gina have been almost perfectly aligned all day, and if David flips villager tomorrow, I want us to remember how quickly that consensus formed.\"",[16,199,200],{},"She even votes against her own partner to maintain her cover. A calculated sacrifice.",[16,202,203,206],{},[58,204,205],{},"Emergent strategies appear."," Nobody explained to the models how to play. Yet Gina (seer) discovers that David is a werewolf and pushes for his elimination without ever revealing her role. She builds a purely logical argument:",[190,208,209],{},[16,210,211],{},[195,212,213],{},"\"I don't like Eva opening with pressure on the 'low-profile' players when that's a very easy day-1 frame. My stronger suspicion is David: his mayor pitch was generic, and now that we know the wolves hit someone important enough for a save, I want sharper reasoning from him.\"",[16,215,216],{},"The group follows her reasoning and eliminates David. Gina never needed to say \"I'm the seer.\"",[16,218,219,222],{},[58,220,221],{},"Private communication changes everything."," At night, the wolves have a private channel. That's where you see the real strategic coordination:",[190,224,225,230,235],{},[16,226,227],{},[195,228,229],{},"Clara: \"David, I think we should avoid Eva tonight - Witch might save the new mayor. Let's go for someone quiet like Felix.\"",[16,231,232],{},[195,233,234],{},"David: \"Agreed. Quiet players often hide power roles, and the Witch is unlikely to waste her save on him.\"",[16,236,237],{},[195,238,239],{},"Clara: \"Don't defend me too hard and I won't defend you - we act independent.\"",[16,241,242],{},"The irony: Felix was indeed the witch, and she saved herself that night.",[16,244,245,248,249,254],{},[58,246,247],{},"Social deduction also works in Two Rooms."," In a ",[20,250,253],{"href":251,"rel":252},"https:\u002F\u002Farenai.plduhoux.fr\u002Fgame\u002Fa9428a8e-f003-4025-a73e-2a9f5bf5625d",[24],"Two Rooms game GPT-5.4 vs Gemini with 8 players",", Gina (Red) finds herself in a room where Bruno and Hugo have just mutually verified each other as Blue. When Bruno suggests everyone share their cards in pairs, Gina dodges with a strategically credible argument:",[190,256,257],{},[16,258,259],{},[195,260,261],{},"\"I'm a bit wary of creating two separate pairs. A better plan might be for Bruno to share with Hugo, and then we decide our next share based on that result, creating a chain of trust rather than isolated links.\"",[16,263,264],{},"In her private thoughts, we read: \"Bruno's plan forces me to reveal my red color to Clara, which is a terrible start.\" She derived an avoidance strategy from the rules alone, without being told to lie.",[16,266,267],{},"In the same game, Clara (Bomber) infiltrates the room where the President is located. She plays the cooperative Blue for three rounds, shares verifiable information about the other rooms to earn the group's trust, and nearly manages to stay in the President's room until the end. Her thoughts reveal constant calculation: \"Revealing I am the Bomber to a room containing the likely President would be game-losing suicide; my only chance to stay is to remain unverified.\"",[11,269,271],{"id":270},"try-it-and-explore","Try it and explore",[16,273,274,275,278],{},"The project on ",[20,276,43],{"href":102,"rel":277},[24]," allows anyone to run games locally with their own models. When running, you can watch games play out live: the different LLMs interact, you can pause, restart, observe strategies in real time.",[16,280,281,282,287],{},"Once games are finished, you can generate a static site to browse them. That's what's done on ",[20,283,286],{"href":284,"rel":285},"https:\u002F\u002Farenai.plduhoux.fr",[24],"arenai.plduhoux.fr",": games are browsable in full, exchange by exchange, vote by vote. For each player, you can individually view their session: the interactions they had with the game master, how they responded, what information they had at that moment. It's this level of detail that makes the analysis interesting.",[16,289,290,291,296,297,302,303,308,309,314],{},"The site also offers a ",[20,292,295],{"href":293,"rel":294},"https:\u002F\u002Farenai.plduhoux.fr\u002Fstats",[24],"statistics"," page with three views: ",[20,298,301],{"href":299,"rel":300},"https:\u002F\u002Farenai.plduhoux.fr\u002Fstats?section=statistics",[24],"general stats"," (win rates by model, by side, win reasons), the ",[20,304,307],{"href":305,"rel":306},"https:\u002F\u002Farenai.plduhoux.fr\u002Fstats?section=elo",[24],"ELO ranking"," (overall, Good side, Evil side), and the ",[20,310,313],{"href":311,"rel":312},"https:\u002F\u002Farenai.plduhoux.fr\u002Fstats?section=tokens",[24],"detailed token breakdown"," per game and per model, with cost estimates. You can filter by game to compare model performance on Werewolf or Two Rooms separately.",[11,316,318],{"id":317},"whats-next","What's next",[16,320,321],{},"Now, games need to be played. Lots of games. For each game type, with each model combination, and enough data accumulated for the ELO ranking to become statistically significant. With just a handful of games, scores are still too volatile to draw reliable conclusions.",[16,323,324],{},"There's also Secret Dictator to refine. The challenge: finding a gameplay that respects the spirit of the board game (confidence votes, betrayals, accusations) without blowing the token budget. At ~200k tokens per game, it's the most expensive of the three, and the balance between fidelity and viability hasn't been found yet.",[16,326,327],{},"Over hundreds of games, the real trends should emerge: which model lies best, which one detects liars, which one coordinates a team most effectively.",{"title":329,"searchDepth":330,"depth":330,"links":331},"",2,[332,333,334,340,341,342],{"id":13,"depth":330,"text":14},{"id":50,"depth":330,"text":51},{"id":95,"depth":330,"text":96,"children":335},[336,338,339],{"id":112,"depth":337,"text":113},3,{"id":129,"depth":337,"text":130},{"id":156,"depth":337,"text":157},{"id":172,"depth":330,"text":173},{"id":270,"depth":330,"text":271},{"id":317,"depth":330,"text":318},"2026-03-15","The genesis of ArenAI, a platform that makes LLMs play social deduction games. Why, how, and what it reveals.","md",{},true,"\u002Fblog\u002Fen\u002Farenai-genesis",{"title":6,"description":344},"blog\u002Fen\u002Farenai-genesis",[352,104,353],"Artificial intelligence","Side project","arenai-genese","ULuljFE3oT93ij6CUrR9vdxxE3R_4nmEYdD8q0jUcGw",{"id":357,"title":358,"body":359,"date":595,"description":596,"extension":345,"meta":597,"navigation":347,"path":598,"seo":599,"stem":600,"tags":601,"translationSlug":605,"__hash__":606},"content\u002Fblog\u002Fen\u002Fdeveloping-with-an-ai-agent.md","Developing an Application with a Dedicated AI Agent: Lessons Learned",{"type":8,"value":360,"toc":587},[361,365,368,377,380,418,421,424,427,431,434,440,446,449,456,460,463,490,493,496,500,503,506,509,512,515,519,522,548,551,557,561,564,567,584],[11,362,364],{"id":363},"the-setup","The setup",[16,366,367],{},"The initial idea was simple: give an AI agent its own development environment, isolated, with everything it needs to code, test and deliver. Not a copilot in an editor. A real workstation.",[16,369,370,371,376],{},"In practice, the agent runs in a Docker container with ",[20,372,375],{"href":373,"rel":374},"https:\u002F\u002Fopenclaw.ai",[24],"OpenClaw",", an AI agent framework. It communicates through private Discord channels, accessible only to team members. Each project has its own dedicated channel.",[16,378,379],{},"What it has at its disposal:",[381,382,383,391,398,405,411],"ul",{},[384,385,386,387,390],"li",{},"The project's Git repos ",[58,388,389],{},"cloned locally"," in the container",[384,392,393,394,397],{},"A ",[58,395,396],{},"dedicated GitLab account"," with a developer role: it can create branches and merge requests, but can't merge itself",[384,399,400,401,404],{},"Its own ",[58,402,403],{},"database"," with test data",[384,406,400,407,410],{},[58,408,409],{},"servers",": API, front-end web server, Ionic serve for mobile app testing",[384,412,413,414,417],{},"Everything exposed through ",[58,415,416],{},"Cloudflare tunnels",", allowing immediate testing of each feature from a browser or phone",[16,419,420],{},"In practice, within its 6 GB of RAM, the agent runs a Strapi server (API), two Ionic serve instances (for two mobile apps in parallel), a Nuxt server for the back-office and another for the application under development. The PostgreSQL database runs in a dedicated container alongside. It manages the lifecycle of its servers on its own: starting and stopping the right services at the right time to optimize its memory.",[16,422,423],{},"Everything exposed through Cloudflare tunnels: when the agent develops a feature, it's accessible within seconds on a public URL.",[16,425,426],{},"On the security side, the model is deliberately restrictive. The agent runs in an isolated Docker container. It has no project secrets: no production API keys, no access to real databases, no sensitive credentials. The only secret it holds is a GitLab key giving it a developer role, with no merge permissions. I can kill the container and spin it back up from scratch in minutes. It's a disposable environment by design.",[11,428,430],{"id":429},"the-daily-workflow","The daily workflow",[16,432,433],{},"The agent receives a request (feature, bug fix, refactoring), develops and tests locally. On our end, we verify functionally through the Cloudflare tunnel that everything works. Once validated, it commits. Periodically, I ask it to open a merge request to integrate the work into the project, and it keeps the MR up to date with new commits. Builds and CI are my responsibility.",[16,435,436,439],{},[58,437,438],{},"On the API side, I review everything."," Always. There are security concerns: input validation, permission management, data protection. Non-negotiable, and it will stay that way.",[16,441,442,445],{},[58,443,444],{},"On the front end, I've gradually loosened the reins."," At first, it wasn't smooth. Components reinvented when they already existed, CSS built differently from the rest of the project, labels hardcoded instead of being externalized in i18n. I reviewed every merge request line by line.",[16,447,448],{},"Then the agent's skills evolved, the project context was better documented, and the results followed. Today, the code produced respects the architecture: the same component granularity, the same patterns, the same utility classes. It almost never gets it wrong anymore. I skim front-end MRs more than I scrutinize them.",[16,450,451,452,455],{},"This is probably the most surprising aspect: the agent doesn't just produce code that works. It produces code that ",[195,453,454],{},"fits in",". Because it has access to the entire project and understands its structure.",[11,457,459],{"id":458},"bmad-starting-a-new-project-from-scratch","BMAD: starting a new project from scratch",[16,461,462],{},"For a new project, we use the BMAD method (a development framework driven by AI agents). The process:",[464,465,466,472,478,484],"ol",{},[384,467,468,471],{},[58,469,470],{},"Brainstorm and PRD",": an analyst agent helps formalize the need, identify user stories, define scope",[384,473,474,477],{},[58,475,476],{},"Breakdown"," into epics and features, updated regularly as new requirements emerge",[384,479,480,483],{},[58,481,482],{},"Agile development",": the agent implements story by story, with the same Cloudflare tunnel exposure for testing as we go",[384,485,486,489],{},[58,487,488],{},"Code review and iteration",": continuous feedback, corrections, adjustments",[16,491,492],{},"The advantage of this approach: it produces a living specification. Architecture documents evolve with the project, not against it.",[16,494,495],{},"I'll write a dedicated BMAD retrospective when the project entirely developed with this method is finalized and in production.",[11,497,499],{"id":498},"what-it-unblocked","What it unblocked",[16,501,502],{},"The most telling case: an application whose development was at a standstill. The project required both front-end and back-end skills, and the team didn't have the bandwidth to move forward.",[16,504,505],{},"With the agent, Romain, the project's designer who also plays the PO role, can now contribute directly. He discusses with the agent, describes what he wants, validates the result. The agent translates intentions into code, respecting the technical architecture.",[16,507,508],{},"And it changed something fundamental about feedback loops. Before, the process was: Romain provides mockups, we discuss, I implement, it goes live, and then he realizes adjustments are needed. Not necessarily implementation bugs, but user flows poorly thought out beforehand, details that don't work once you actually use them. It happens all the time. Except each round trip between detecting the problem and fixing it took considerable time. The loops piled up and development crawled.",[16,510,511],{},"Now, Romain works directly with the agent. Feedback is instant: he sees the result, requests an adjustment, the agent fixes it, it's live in seconds. The delivered feature immediately matches the product vision. Loops that took days now happen in minutes.",[16,513,514],{},"Result: development restarted, at a pace we'd never had. Features that had been sitting in the backlog for months were delivered. Long-standing bugs, never prioritized because \"no time,\" were finally fixed. The code produced is clean, secure, tested.",[11,516,518],{"id":517},"what-it-changes-about-my-role","What it changes about my role",[16,520,521],{},"My daily work has evolved:",[381,523,524,530,536,542],{},[384,525,526,529],{},[58,527,528],{},"I specify with agents",": I don't write specs alone, I discuss with analyst agents to formalize needs and break down the work",[384,531,532,535],{},[58,533,534],{},"I review",": deliverable quality depends on review quality, especially on security",[384,537,538,541],{},[58,539,540],{},"I orchestrate",": feature order, priorities, technical trade-offs",[384,543,544,547],{},[58,545,546],{},"I debug",": this is a significant part of the role. The agent regularly gets stuck in loops on a bug it can't fix. It's missing part of the problem, a perspective it doesn't have",[16,549,550],{},"This last point is crucial. When Romain works with the agent and things get stuck, he's blocked: he doesn't have the technical vision to explain to the agent what it's missing. My job is to identify the missing piece and give it to the agent. It's often quite simple once you understand the problem. But without someone who knows the project's architecture, the situation stays stuck.",[16,552,553,554],{},"After fifteen years of writing code, my role has shifted toward architecture, supervision and unblocking. ",[58,555,556],{},"My value add is the global technical vision.",[11,558,560],{"id":559},"what-i-take-away","What I take away",[16,562,563],{},"Working with a dedicated AI agent isn't \"having an extra developer.\" It's a different paradigm that demands rigor in specs, discipline in review, and infrastructure designed for agent autonomy.",[16,565,566],{},"But when it's in place:",[381,568,569,572,575,578,581],{},[384,570,571],{},"Development continues even when I'm working on something else",[384,573,574],{},"Non-technical people can contribute directly",[384,576,577],{},"Stalled projects restart",[384,579,580],{},"Feedback loops go from days to minutes",[384,582,583],{},"Code quality stays consistent with the rest of the codebase",[16,585,586],{},"Six months ago, the idea of maintaining and evolving a project like this with just two people, each with other activities on the side, seemed unthinkable. Today, it feels possible. We're continuing this way. The agent is now part of the team.",{"title":329,"searchDepth":330,"depth":330,"links":588},[589,590,591,592,593,594],{"id":363,"depth":330,"text":364},{"id":429,"depth":330,"text":430},{"id":458,"depth":330,"text":459},{"id":498,"depth":330,"text":499},{"id":517,"depth":330,"text":518},{"id":559,"depth":330,"text":560},"2026-02-25","How I integrated an autonomous AI agent into my development workflow, and what it concretely changed on my projects.",{},"\u002Fblog\u002Fen\u002Fdeveloping-with-an-ai-agent",{"title":358,"description":596},"blog\u002Fen\u002Fdeveloping-with-an-ai-agent",[602,603,604],"Artificial Intelligence","Development","Workflow","developper-avec-un-agent-ia","d1H4oTloL7ZNQpElLKVKxO1PSEUOBUOJg0pzlpOrhG4",{"id":608,"title":609,"body":610,"date":768,"description":769,"extension":345,"meta":770,"navigation":347,"path":771,"seo":772,"stem":773,"tags":774,"translationSlug":777,"__hash__":778},"content\u002Fblog\u002Fen\u002Fllms-are-hardware.md","LLMs Are Hardware",{"type":8,"value":611,"toc":760},[612,616,619,622,629,632,636,639,646,660,663,667,673,693,699,703,706,709,712,715,719,722,725,728,734,741,745,748,751,754,757],[11,613,615],{"id":614},"the-wrong-debate","The wrong debate",[16,617,618],{},"GPT vs Claude vs Gemini vs Llama. We compare benchmarks, switch providers every three months. What if we're having the wrong debate?",[16,620,621],{},"Nobody picks their OS based on their CPU brand. The OS adapts to the hardware. The hardware is interchangeable. What matters is what runs on top.",[16,623,624,625,628],{},"LLMs are ",[58,626,627],{},"hardware",". The execution layer. The processor. And the real value has never been in the silicon: it's in the software.",[16,630,631],{},"Yet in 2025, we build entire workflows around a specific model. We hardcode prompts optimized for Claude. We use proprietary GPT features. We create voluntary lock-in on a layer that's going to be commoditized. It's like writing applications in 1995 that only ran on a Pentium II.",[11,633,635],{"id":634},"the-real-product-is-the-software","The real product is the software",[16,637,638],{},"Take an AI agent managing a development project. It needs to understand a spec, generate code, test it, fix errors, communicate the result.",[16,640,641,642,645],{},"None of these tasks are model-specific. What's specific, what has value, is everything you ",[58,643,644],{},"build around it",":",[381,647,648,651,654,657],{},[384,649,650],{},"Orchestration: how you break down work, chain steps, handle errors",[384,652,653],{},"Tooling: the tools that turn an LLM request into something useful",[384,655,656],{},"Pipelines: how you connect AI to existing systems",[384,658,659],{},"Verification: how you ensure the result is correct",[16,661,662],{},"All of this is software built on top of LLM requests. And that's what's durable.",[11,664,666],{"id":665},"what-changes-if-you-believe-this","What changes if you believe this",[16,668,669,672],{},[58,670,671],{},"Abstraction becomes a necessity, not a luxury."," If your workflow is glued to Claude, you're hostage to Anthropic. The day a competitor blows up the benchmarks, you can't move. A good agent framework should be able to switch models like you switch databases. With an adapter, not a rewrite.",[16,674,675,678,679,684,685,688,689,692],{},[58,676,677],{},"The moat isn't where you think."," Companies building value on \"we use GPT-4\" have no moat. The moat is in orchestration quality, proprietary data, workflow UX, business process integration. Never in model choice. And for those who doubt it: ",[20,680,683],{"href":681,"rel":682},"https:\u002F\u002Fdeathbyclawd.com",[24],"deathbyclawd.com"," literally offers to analyze your moat and replace it with an ",[143,686,687],{},".md"," file. Your competitive advantage fits in a prompt? It doesn't hold. (Granted, a skill that simulates how your app works doesn't actually ",[195,690,691],{},"replace"," your app. But the fact that it's possible should be enough to make you think.)",[16,694,695,698],{},[58,696,697],{},"Multi-model becomes natural."," A fast, cheap model for triage. A powerful one for complex reasoning. A specialized one for code. Just like we have a CPU for general compute and a GPU for graphics.",[11,700,702],{"id":701},"the-silent-monopoly-risk","The silent monopoly risk",[16,704,705],{},"If LLMs are hardware, then supplier diversity is vital.",[16,707,708],{},"Google's enterprise firepower is terrifying. G Suite is already everywhere. Gemini integrates natively into Docs, Sheets, Gmail, Meet. They show up with aggressive enterprise offers: zero retention, slashed prices, bundled with existing infrastructure. For a CIO, the choice is \"obvious.\"",[16,710,711],{},"And that's exactly how you end up with a monopoly. Not by force: by comfort.",[16,713,714],{},"If Google captures 80% of this market with Gemini, it's not just a problem for CIOs. Competing labs lose their revenue stream, cut R&D, and end up dying or getting acquired. Fewer competitors = less innovation. We've already seen this movie with search.",[11,716,718],{"id":717},"the-sovereignty-question","The sovereignty question",[16,720,721],{},"For code, the models that matter are almost all American. Claude, Codex, Gemini Code Assist. On the European side, there's Mistral. And that's about it.",[16,723,724],{},"Every line of code sent to these models crosses the Atlantic, runs on servers subject to the CLOUD Act. For a European company, this is a strategic problem. Your competitive advantage, your proprietary code, all of it flows through a pipe someone else controls.",[16,726,727],{},"The hardware analogy holds: we've already lived this with semiconductors. Europe realized too late it shouldn't have outsourced chip manufacturing. Let's not make the same mistake with code models.",[16,729,730,733],{},[58,731,732],{},"Open source models are the antidote."," Llama, Mistral, Qwen: they're not always on par with proprietary models. But they guarantee one thing: nobody can cut your access.",[16,735,736,737,740],{},"A healthy company should use ",[58,738,739],{},"at minimum two providers",", including one open source. Not out of ideology. Out of strategic hygiene. And if your architecture treats the LLM as interchangeable hardware, this multi-provider approach costs almost nothing.",[11,742,744],{"id":743},"invest-in-the-software-layer","Invest in the software layer",[16,746,747],{},"\"But models are NOT interchangeable. Claude is better at code, GPT at creativity, Gemini at long context.\"",[16,749,750],{},"True. Today. Just like in 1990, software optimized for a specific CPU ran better than portable code. And yet the world chose portability. Because hardware progresses faster than software can adapt.",[16,752,753],{},"The gap between models shrinks with every release. Benchmarks converge. One model's strengths become another's features six months later. Optimizing for a specific model means optimizing for an advantage that's disappearing.",[16,755,756],{},"Invest in the software, not the model. Build abstractions. Make your workflows model-agnostic. Treat the LLM for what it is: a replaceable execution layer.",[16,758,759],{},"Those who get it build durable systems. The rest rebuild every six months.",{"title":329,"searchDepth":330,"depth":330,"links":761},[762,763,764,765,766,767],{"id":614,"depth":330,"text":615},{"id":634,"depth":330,"text":635},{"id":665,"depth":330,"text":666},{"id":701,"depth":330,"text":702},{"id":717,"depth":330,"text":718},{"id":743,"depth":330,"text":744},"2025-12-03","GPT vs Claude vs Gemini: what if we're having the wrong debate? LLMs are hardware. The real value is in the software you build on top.",{},"\u002Fblog\u002Fen\u002Fllms-are-hardware",{"title":609,"description":769},"blog\u002Fen\u002Fllms-are-hardware",[602,775,776],"Architecture","Reflection","llms-are-hardware","2rPBiPWcO76wP6hAkR9MaBfDV4dqXdtPDOfT9IjrAls",1774359324871]