This little AMD PC just ran a huge 397B AI model that required a server room full of GPUs a year ago

AMD’s Ryzen AI Halo recently went on sale for $4,000, sparking an interesting debate about how it compares to Nvidia’s slightly more expensive DGX Spark offering.

However, the configuration offered by Ryzen AI Halo has been on the market for a few months now, and while most OEMs and enterprise vendors offer the same flavor and configuration, Shenzhen-based memory and storage company Longsys has gone a step further.

The storage giant demonstrated a localized version of an AI model with 397B parameters running on its own version of Ryzen AI Halo, with the same 16-core Ryzen AI Max+ 395 configuration and 128GB of RAM.

How was the Ryzen AI Max+ 395 able to run such a large model with only 128GB of RAM?

While the model being run was not explicitly stated, it appears to be a custom version derived from Alibaba’s Qwen 3.5 397B (A17B), a basic multi-modal model that leverages a Mixture of Experts (MoE) approach, which made the original DeepSeek such a powerful challenger.

Even if you were taking advantage of INT4 quantization, the memory requirements far exceed the memory offered by the device demonstrating the feat: only 96 GB of VRAM is available to the GPU in a unified 128 GB configuration, compared to the estimated 200-250 GB of VRAM the model needs to run.

The secret is Longsys’ newly introduced custom iSA and SPU configuration that offers the ability to compress data in real time, a feat the company says allows it to fit up to double the amount of data on storage drives of up to 128GB, taking advantage of a caching layer that greatly reduces DRAM requirements.

The approach involves offloading experts that are not in active use into a large, fast storage buffer from where the AI chip can reintroduce them if necessary.

In a press release, Longsys said its approach worked by targeting “weaknesses of MoE LLMs,” such as a large number of parameters, rapid KV cache expansion, and I/O latency that hampers inference efficiency.

“It leverages expert offloading, intelligent cache management, and predictive prefetching algorithms to efficiently solve storage scheduling challenges and comprehensively improve the smoothness of local AI inference,” the company added.

It’s important to note that while the move itself is an impressive feat, Longsys did not provide details on the computing power in terms of tokens per second, where the Ryzen AI chip is relatively limited compared to most modern GPU AI offerings.

Still, the approach that essentially treats storage as memory suggests that localized AI could run considerably larger models, and that memory might not be such a difficult limitation for certain approaches.

It means that memory limitations can be circumvented by taking advantage of fast storage and running a frontier-level model that would otherwise require tens of thousands of dollars in AI hardware, which is no small feat. It means that models that were previously restricted only to data centers can now be run on a device that fits in the palm of your hand.

Google logo on black background next to text that says