Korean AI Beats NVIDIA’s GR00T—RealWorld’s RLDX-1 Sweeps 8 Global Benchmarks

Photo of author

By Global Team

Domestic physical AI startup RealWorld said its robotics foundation model developed for humanoid robots with five-finger hands has ranked first in the world in global evaluations.

RealWorld unveiled its proprietary robotics foundation model, “RLDX-1,” on the 7th and said it outperformed all existing state-of-the-art models across eight global public benchmarks.

The model is designed to enable humanoid robots to move their five fingers with precision. A single model handles the full process of understanding human commands, judging objects, and moving directly.

The reason the company’s achievement is drawing attention is not simply because of its score. It is being read as a signal that the physical AI industry is shifting from competition among vision- and language-centric general models to a contest over “physical finesse.”

The industry has long assumed that when a robot’s AI intelligence becomes sufficiently advanced, dexterity will naturally follow. Most humanoid robots have therefore been built with only two or three fingers, since increasing the number to five sharply raises degrees of freedom and makes control computations more complex.

RealWorld takes the opposite view. The company argues that intelligence itself is incomplete without dexterity. It says that only when force and touch can be handled properly can precise work in industrial settings become possible. It calls this perspective “Dexterity-First.”

The core of the technology lies in a Multi-Stream Action Transformer, or MSAT. Unlike conventional vision-language-action models, which bundle visual, linguistic, action, tactile, and memory signals into a single stream, MSAT assigns each signal its own independent stream and then integrates them through joint attention across modalities. Physical signals such as torque and touch, which cannot be captured by vision alone, as well as long-term memory, are processed separately in dedicated modules.

“The key to RLDX-1 is that the structure separates each modality so it can be fully expressed in its own place,” said Bae Jae-kyung, RealWorld’s chief technology officer. “Accurately capturing contact moments through torque signals and reasoning about dynamic changes over time is an area that conventional VLA systems have been structurally unable to handle.”

The performance gap is confirmed by the numbers. RLDX-1 scored 70.6 on RoboCasa Kitchen, a benchmark that evaluates whether a robot can operate in an environment similar to an actual kitchen. It is the first VLA model to break into the 70-point range. The task evaluates complex household-like actions such as opening a refrigerator and taking out a cup.

On GR-1 Tabletop, a humanoid-specific benchmark that tests whether a robot can pick up objects a person points to on a table, it scored 58.7, 10.7 percentage points higher than NVIDIA’s GR00T N1.6, the next-best model.

The results were even more impressive when applied to a real robot. When deployed on ALLEX, the humanoid robot developed by WIRobotics, RLDX-1 achieved a 70.8% success rate in the “coffee pouring” task, which requires handling dynamic changes in weight. Considering that competing models remained in the high-30% range, this is about twice as high.

The newly released RLDX-1 consists of three 8.1-billion-parameter model variants, including a pretraining checkpoint and two platform-specific mid-training checkpoints. The model weights, training code, and technical documentation have also been made available to external researchers through GitHub and Hugging Face.

It runs on a single backbone not only for WIRobotics’ ALLEX but also for the collaborative robot arm Franka Research 3 and the open-source platform OpenArm. In other words, the company has demonstrated a cross-embodiment structure that is not tied to a specific piece of hardware.

RealWorld designed the model from the outset with industrial deployment in mind. Working with dozens of partner companies, it directly fed manufacturing and logistics data into its training pipeline.

The company also built its own benchmark, DexBench, to define the hand-manipulation tasks repeatedly encountered in industrial settings. It evaluates performance across five areas: grasp diversity, spatial precision, temporal precision, contact precision, and contextual awareness.

The significance of this lies in the fact that data from industrial settings has translated into a top ranking on global benchmarks. This contrasts with existing models trained on data created in simulation environments, which have often suffered performance degradation when moved into real-world applications.

RealWorld will hold a launch event called “Dexterity Night” in the United States on the 13th local time. Humanoid hardware companies from Korea, the United States, and Japan will participate in a panel discussion on why the next inflection point in the robotics industry is the hand. Launch events will then continue in Japan and Korea.

The company’s next target is the “4D+ world model.” Almost all robotics foundation model companies are pursuing vision-based world models built on video data, but RealWorld believes that such an approach has fundamental limitations. It argues that information needed for precise hand work—such as contact torque, tactile feedback, and robot joint states—cannot be captured in camera images.

“Information that is not contained in pixels will never appear, no matter how much video you collect,” said Ryu Joong-hee, CEO of RealWorld. “RLDX-1 is only the first milestone in the direction we are heading.” He added, “It is the starting point of a long roadmap toward a 4D+ world model, built on data and technology verified in industrial settings across Korea and Japan, together with global humanoid partners.”

The physical AI market is still an area where no absolute dominant player has been established. A Korean startup has disrupted a field previously led by NVIDIA’s GR00T and Physical Intelligence’s PI-Zero.

The key question now is how far a “hand-centered” strategy, which departs from the vision- and language-centered race for general-purpose models, can go.

Leave a Comment