Revise bets on custom editor stack for AI-native writing

TLDR: New AI-native editor Revise rolls its own stack, Scale AI starts Elo-ranking voice models, and a major utility experiments with LLM-based predictive maintenance.

Revise ships AI-native document editor with custom CRDT engine

Revise, a new AI editor for documents, has launched with a fully custom word processor and rendering engine built on top of the Y.js conflict-free replicated data type (CRDT) stack as of 2026-03-23. The founder says they have been using agentic coding tools throughout a 10 month build and are still deeply involved in the code and architecture.

For AI engineers, the interesting bit is not “AI in a doc editor” but the choice to own the entire editing stack instead of wrapping ProseMirror or TipTap. That gives much more control over latency, operational transforms versus CRDT behavior, and how AI suggestions show up and sync. It also means you inherit all the joys of layout bugs, IME edge cases, and collaboration conflicts yourself.

If you are working on multi-user AI agents that operate directly on structured documents, this is worth watching as a reference architecture or cautionary tale. The Show HN thread is also a rare, detailed build story.
Read more →

Scale AI launches Elo-ranked benchmark for voice AI models

Scale AI has launched a "Voice Showdown" that ranks speech-to-speech and audio models like OpenAI GPT-4o Audio, Google Gemini 2.5 Flash Audio, Grok Voice, and Qwen 3 Omni via an Elo-style scoring system as of 2026-03-23. Initial results show Gemini 2.5 Flash Audio and GPT-4o Audio tied at the top for speech-to-speech baselines with Elo scores of 1060 and 1059, while GPT-4o Audio leads style-controlled tasks at 1102.

This is one of the first public, head-to-head evaluations that treats voice models as competitive players instead of static benchmarks. For anyone building voice agents, the breakdown by task type matters: Gemini 3 Pro and Flash dominate dictation, whereas GPT Realtime reportedly lags on multilingual and short, noisy utterances. The dataset, evaluation protocol, and how Scale handles subjective judgments will determine how much you can trust these numbers.

Expect vendors to optimize directly to this leaderboard, so it may become a de facto target metric in voice stacks, similar to LMSYS for text.
Read more →

Yarra Valley Water trials LLM-based predictive maintenance at scale

Yarra Valley Water, a major Australian water authority, is piloting a generative AI and large language model (LLM) based inference engine to predict failures across millions of assets serving about 2 million premises as of 2026-03-23. The proof of concept is targeted to go operational next year and will ingest sensor data to anticipate issues and cut maintenance costs.

For infra and reliability engineers, this is a concrete example of LLMs stepping into classic predictive maintenance territory that used to belong to gradient boosted trees and bespoke anomaly models. The architecture decision here is interesting: the team is still weighing on-premises versus private cloud hosting, which reflects the regulatory and latency constraints around operational technology.

If this works, utilities will become a new high-value market for agentic monitoring and remediation systems. If it struggles, it will be another data point that plain time-series models still beat LLMs for narrow prediction tasks.
Read more →

Quick Hits

Tech Employees Are Reportedly Being Evaluated by How Fast They Burn Through LLM Tokens Volume of LLM token usage is reportedly being used as a performance signal at firms like Meta, OpenAI, and Shopify, which is a pretty noisy proxy if you care about actual productivity.
ai @ai-sdk/[email protected] Vercel AI SDK now exposes provider reported cost for Perplexity in providerMetadata, which makes it easier to surface per-call spend in your own logging and dashboards.
Merge State Visualizer Simon Willison used Claude and Pyodide to turn Bram Cohen’s CRDT version control prototype into an interactive merge visualizer, useful if you are choosing data sync strategies for collaborative agents.
JavaScript Sandboxing Research Deep survey of Node.js sandboxing options like worker_threads, node:vm, isolated-vm, vm2, and QuickJS, relevant if your agents run untrusted user code.
Starlette 1.0 skill Concise guide plus a task manager demo built on Starlette 1.0 that shows patterns for async routing, templating, and DB access, handy if your agent backend is Python ASGI.
They’re Vibe-Coding Spam Now Tedium looks at AI-written, “vibe-coded” spam emails that evade traditional filters, worth a skim if your agents touch email or abuse detection.
Teaching Claude to QA a mobile app Walkthrough of using Anthropic Claude as a mobile app QA assistant, including test design and limitations, good inspiration for product engineers.
litellm v1.82.1.dev.1 Nightly dev release of LiteLLM with incremental fixes and changes in the full changelog, relevant if you are on the bleeding edge of their proxy.
litellm v1.81.14.dev.3 Another dev branch update focusing on UI tests, routing refactors, and table components, a reminder to keep an eye on version drift in staging versus prod.