NFS over RDMA · QuTS hero · TS-h1290FX

Your GPU Shouldn't Have to
Wait for Your Storage

Every millisecond of I/O wait time wastes expensive GPU computing power.
The TS-h1290FX with NFS over RDMA ensures storage performance keeps pace with computing speeds.

100× Latency Reduction
85% CPU Load Reduction
100Gbps Near Line-Rate Throughput
<5% GPU Wait Time
Scroll

How Much Time Do Your GPUs
Spend Waiting for Data?

AI training costs are determined by GPU time, but over 40% of computing time is wasted due to storage I/O bottlenecks.

01

The Hidden Tax of the TCP Stack

For every data read, the CPU must process TCP packet fragmentation, checksum calculations, and kernel context switches. This overhead generates zero AI computing value but silently consumes up to 99% of CPU resources.

CPU Usage ≥ 99%
02

The Quadruple Cost of Memory Copying

In a traditional NFS path, the same data must be copied 4-6 times between the kernel buffer and user space before reaching the GPU. Every copy adds latency, and every added microsecond of latency drains computing power.

Latency 100–500 μs
03

The Real Cost of GPU Idling

Taking an 8×H100 cluster as an example, cloud costs exceed $24 per hour. When GPU utilization drops to 60% due to I/O bottlenecks, nearly $10 per hour is completely wasted.

GPU Idle > 40%
04

The Larger the Scale, the Deeper the Bottleneck

While barely manageable with a single GPU, expanding to 4, 8, or 16 GPUs concurrently reading from the same storage causes traditional TCP NFS contention latency to worsen exponentially.

Multi-Node Concurrency Breaking Point

Two Paths,
Completely Different Outcomes

NFS over RDMA is not a minor tweak to traditional protocols; it fundamentally reconstructs the entire data path from storage to GPU memory.

Traditional NFS over TCP Performance Bottleneck
Application Issues Read Request AI training task requests the next batch of data
Enters Kernel Mode — Context Switch #1 App switches to kernel; CPU must save/restore all register states, taking 1–10 μs
Full TCP/IP Stack Processing TCP fragmentation, retransmission, and checksum calculations are executed by the CPU and cannot be offloaded.
NIC Transmits Data Data is encapsulated and sent to the network
Returns to Kernel — Context Switch #2 Receiver enters kernel mode again, triggering a second context switch.
Data Copied ×4–6 Times Kernel buffer → DMA buffer → User space; each copy consumes CPU and memory bandwidth.
Application Finally Gets Data GPUs remain completely idle during the wait.
End-to-End Latency100 – 500 μs
CPU Usage≈ 99%
GPU Wait Ratio> 40%
NFS over RDMA (RoCE) TS-h1290FX
Application Issues Read Request AI training task requests the next batch of data
Kernel Bypass — Direct HCA Communication The application bypasses the OS kernel and communicates directly with the RDMA NIC (HCA), eliminating context switches.
Hardware Offloads All Protocol Processing The HCA performs all network protocol calculations at the hardware level, completely freeing the CPU for AI computing.
Zero-Copy Direct Memory Write Data is written directly from the NAS NVMe drives to the AI server's application memory, requiring no intermediate copying.
Data Ready, GPU Starts Computing Instantly The entire data path is free of kernel switches, redundant copies, and protocol stack CPU drain.
End-to-End Latency1 – 2 μs
CPU Usage≈ 15%
GPU Wait Ratio< 5%

The Numbers Behind
the TS-h1290FX

Random Read 816K
4K random read IOPS
Eliminates training data I/O wait
Max Capacity 737TB
12 × 61.44 TB NVMe U.2
PCIe Gen 4 All-Flash Array
Max Memory 1 TB
DDR4 ECC RDIMM 3200 MHz
8 slots × 128 GB
CPU 16C
AMD EPYC™ 7302P
Up to 3.3 GHz Boost
Built-in Networking 2×25G
SFP28 + 2×2.5GbE
4× PCIe Gen 4 Expansion Slots
Expandable to 100G
Install a QNAP QXG-100G2SF
for full-speed RDMA connections
ZFS Snapshots
Near-limitless snapshot restore points
coupled with WORM immutability
Power Efficiency 24/7
All-flash low-power design
Supports continuous production line analysis

See the Difference
Clearly

Spec Item QNAP TS-h1290FX Competitor A (SATA NAS) Competitor B (Enterprise AFA)
CPU AMD EPYC™ 7302P 16C / 3.3 GHz Strongest Intel Xeon D-1541 8C / 2.7 GHz High-end Intel series
Storage Interface NVMe PCIe Gen 4 ×4 U.2 Fastest SATA 6 Gb/s NVMe / SAS / FC
NVMe Slots 12 × 2.5" U.2 PCIe Gen 4 No native support (adapter required)Unsupported 48 × 2.5" NVMe
NFS over RDMA ✓ Fully optimized native support Native ✗ Unsupported Unsupported △ Partially supported
Built-in Networking 2× 25GbE SFP28 + 2× 2.5GbE 2× 10GbE + 4× 1GbE Multiple 25/100GbE (depends on config)
PCIe Expansion 4× PCIe Gen 4 Gen 4 2× PCIe Gen 3 High-density multi-slot
Max Memory 1 TB DDR4 ECC 3200 MHz 64 GB DDR4 2666 MHz 1,280 GB
ZFS File System ✓ QuTS hero native integration Depends on vendor
S3 Object Storage ✓ QuObjects (includes Object Lock) Depends on vendor
Multi-Tenant Isolation ✓ NFS shares + ZFS snapshot isolation Limited support Supported

Who is Using It,
and the Problems It Solves

🤖

AI / LLM Model Training

Multiple GPU nodes read hundreds of GB of training sets in parallel. Under traditional NFS, I/O wait time exceeds computing time. RDMA ensures data delivery keeps up with GPU demand.

GPU Utilization Boost 40% → >95%
Single Epoch Training Time Reduced by 30–60%
Storage CPU Load 99% → 15%
🏥

Smart Healthcare Imaging AI

Pathology slides and 3D DICOM images often span gigabytes. If AI-assisted diagnosis stalls on reading, clinical benefits are severely compromised. Low-latency storage empowers diagnostic AI to operate at peak efficiency.

Image Preprocessing Acceleration Multi-path parallel without slowdown
Report Generation Wait Significantly reduced response time
Data Integrity ZFS self-healing protection
🏭

Semiconductor Yield Big Data Analysis

Production lines generate massive process data per second. AI models must analyze historical data in real-time to find key yield variables. I/O latency translates to analysis delays, ultimately resulting in yield loss.

Historical Data Retrieval Speed Millisecond → Microsecond access
24/7 Continuous Analysis All-flash low power support
TCO Streamlined hardware for enterprise performance

Everything You Might Want to Ask,
Right Here

Does RDMA require specialized network switches? Can I use my existing data center architecture?
NFS over RDMA (RoCE v2) operates on standard Ethernet networks but requires switches that support PFC (Priority Flow Control) to enable a lossless Ethernet environment. Most modern enterprise-grade switches (e.g., Mellanox/NVIDIA Spectrum, Cisco Nexus, Arista series) support this feature. QNAP can provide network planning advice to help confirm if your existing environment is compatible.
How big is the actual latency gap between NFS over RDMA and traditional NFS over TCP?
Under laboratory conditions, the end-to-end latency for NFS over TCP typically ranges from 100–500 microseconds (μs), with bottlenecks mainly stemming from kernel context switches and memory copying. NFS over RDMA can compress latency to 1–2 μs—an improvement of about 100 times. For AI training scenarios with frequent small-batch random reads, this gap directly translates into improved GPU utilization and overall shorter training cycles.
How is the space efficiency of ZFS? Are compression and deduplication effective for AI training sets?
ZFS features built-in real-time LZ4/Zstandard compression and block-level deduplication. For image training sets containing massive amounts of similar samples, the compression ratio often reaches 1.3–2×; for text-based datasets (like tokenized corpora), compression benefits are even more significant. Deduplication is particularly well-suited for storing multiple model checkpoint versions, potentially saving massive amounts of space. In ZFS, compression is hardware-assisted, meaning it has a minimal impact on I/O performance.
We only have 4 GPUs. Is the TS-h1290FX worth the investment?
The hourly computing cost for 4 high-end GPUs (like H100/A100) is already substantial. Even in small-scale clusters, if storage I/O causes GPU utilization to fall below 70%, it means over 30% of your computing expenditure is wasted. The investment in a TS-h1290FX usually achieves ROI within a few months to a year, driven entirely by the performance gains from increased GPU utilization. For a specific TCO calculation, feel free to contact our sales team.
Does the TS-h1290FX support simultaneous use by multiple teams (multi-tenancy)?
Fully supported. The TS-h1290FX can be configured with multiple independent NFS shares, individual user accounts, and network isolation. Combined with the ZFS Dataset and Snapshot mechanisms, you can establish independent storage spaces, backup strategies, and access controls for each team or department, making it ideal for Managed Service Providers (MSPs) or large enterprise internal multi-department scenarios.
Compared to pure cloud AI training platforms, what are the advantages of an on-premises TS-h1290FX?
The main challenges of cloud platforms include exorbitant data transfer fees (egress costs), regulatory compliance risks for sensitive training data, and unpredictable long-term computing costs. The TS-h1290FX provides high-speed on-premises storage, ensuring data never leaves your facility while utilizing RDMA to match the I/O performance of high-end cloud storage. It acts as the perfect balance between performance, data sovereignty, and TCO.
Can the TS-h1290FX be integrated into existing MLOps workflows (e.g., Kubernetes, Kubeflow)?
Yes. The TS-h1290FX provides standard NFS v4.1 mounting, which Kubernetes can directly utilize via PersistentVolume (PV). On Kubernetes nodes supporting RDMA, pairing with the RDMA Device Plugin easily enables full-speed NFS over RDMA connections. Additionally, through the S3-compatible endpoints provided by QuObjects, it can be seamlessly integrated into MLOps toolchains utilizing the S3 protocol (such as MLflow artifact store or DVC remote storage).
How do we handle backup and disaster recovery for model checkpoints?
The TS-h1290FX offers a multi-layered protection strategy: ZFS snapshots can be scheduled to run automatically every hour, providing granular restore points; paired with another ZFS NAS, SnapSync enables real-time block-level synchronization for offsite disaster recovery; for long-term archiving, Hybrid Backup Sync (HBS 3) supports backing up data to the cloud (AWS S3, Azure Blob, B2, etc.). This triple-layered protection can be flexibly configured according to your RTO/RPO requirements.
Does the TS-h1290FX support the S3 object storage protocol?
Supported. After installing QuObjects, the TS-h1290FX acts as an on-premises S3-compatible object storage endpoint, supporting Object Lock (WORM) immutable storage. This allows for hybrid workflows in AI: high-speed dataset reading during the training phase via NFS over RDMA, and secure storage and management of model versions and analysis results during the inference phase via the S3 protocol.

Eliminate GPU Wait Times

TS-h1290FX × NFS over RDMA — The Storage Infrastructure for On-Premises AI Training

View Product Page Contact Sales Team