End AI Storage Bottlenecks with NFS over RDMA

Problem Diagnosis

How Much Time Do Your GPUs
Spend Waiting for Data?

AI training costs are determined by GPU time, but over 40% of computing time is wasted due to storage I/O bottlenecks.

The Hidden Tax of the TCP Stack

For every data read, the CPU must process TCP packet fragmentation, checksum calculations, and kernel context switches. This overhead generates zero AI computing value but silently consumes up to 99% of CPU resources.

CPU Usage ≥ 99%

The Quadruple Cost of Memory Copying

In a traditional NFS path, the same data must be copied 4-6 times between the kernel buffer and user space before reaching the GPU. Every copy adds latency, and every added microsecond of latency drains computing power.

Latency 100–500 μs

The Real Cost of GPU Idling

Taking an 8×H100 cluster as an example, cloud costs exceed $24 per hour. When GPU utilization drops to 60% due to I/O bottlenecks, nearly $10 per hour is completely wasted.

GPU Idle > 40%

The Larger the Scale, the Deeper the Bottleneck

While barely manageable with a single GPU, expanding to 4, 8, or 16 GPUs concurrently reading from the same storage causes traditional TCP NFS contention latency to worsen exponentially.

Multi-Node Concurrency Breaking Point

Technical Solution

Two Paths,
Completely Different Outcomes

NFS over RDMA is not a minor tweak to traditional protocols; it fundamentally reconstructs the entire data path from storage to GPU memory.

Traditional NFS over TCP Performance Bottleneck

①

Application Issues Read Request AI training task requests the next batch of data

Enters Kernel Mode — Context Switch #1 App switches to kernel; CPU must save/restore all register states, taking 1–10 μs

Full TCP/IP Stack Processing TCP fragmentation, retransmission, and checksum calculations are executed by the CPU and cannot be offloaded.

②

NIC Transmits Data Data is encapsulated and sent to the network

Returns to Kernel — Context Switch #2 Receiver enters kernel mode again, triggering a second context switch.

Data Copied ×4–6 Times Kernel buffer → DMA buffer → User space; each copy consumes CPU and memory bandwidth.

③

Application Finally Gets Data GPUs remain completely idle during the wait.

End-to-End Latency100 – 500 μs

CPU Usage≈ 99%

GPU Wait Ratio> 40%

NFS over RDMA (RoCE) TS-h1290FX

✓

Application Issues Read Request AI training task requests the next batch of data

✓

Kernel Bypass — Direct HCA Communication The application bypasses the OS kernel and communicates directly with the RDMA NIC (HCA), eliminating context switches.

✓

Hardware Offloads All Protocol Processing The HCA performs all network protocol calculations at the hardware level, completely freeing the CPU for AI computing.

✓

Zero-Copy Direct Memory Write Data is written directly from the NAS NVMe drives to the AI server's application memory, requiring no intermediate copying.

✓

Data Ready, GPU Starts Computing Instantly The entire data path is free of kernel switches, redundant copies, and protocol stack CPU drain.

End-to-End Latency1 – 2 μs

CPU Usage≈ 15%

GPU Wait Ratio< 5%

Specification Comparison

See the Difference
Clearly

Spec Item	QNAP TS-h1290FX	Competitor A (SATA NAS)	Competitor B (Enterprise AFA)
CPU	AMD EPYC™ 7302P 16C / 3.3 GHz Strongest	Intel Xeon D-1541 8C / 2.7 GHz	High-end Intel series
Storage Interface	NVMe PCIe Gen 4 ×4 U.2 Fastest	SATA 6 Gb/s	NVMe / SAS / FC
NVMe Slots	12 × 2.5" U.2 PCIe Gen 4	No native support (adapter required)Unsupported	48 × 2.5" NVMe
NFS over RDMA	✓ Fully optimized native support Native	✗ Unsupported Unsupported	△ Partially supported
Built-in Networking	2× 25GbE SFP28 + 2× 2.5GbE	2× 10GbE + 4× 1GbE	Multiple 25/100GbE (depends on config)
PCIe Expansion	4× PCIe Gen 4 Gen 4	2× PCIe Gen 3	High-density multi-slot
Max Memory	1 TB DDR4 ECC 3200 MHz	64 GB DDR4 2666 MHz	1,280 GB
ZFS File System	✓ QuTS hero native integration	✗	Depends on vendor
S3 Object Storage	✓ QuObjects (includes Object Lock)	✗	Depends on vendor
Multi-Tenant Isolation	✓ NFS shares + ZFS snapshot isolation	Limited support	Supported

Applicable Scenarios

Who is Using It,
and the Problems It Solves

🤖

AI / LLM Model Training

Multiple GPU nodes read hundreds of GB of training sets in parallel. Under traditional NFS, I/O wait time exceeds computing time. RDMA ensures data delivery keeps up with GPU demand.

GPU Utilization Boost 40% → >95%

Single Epoch Training Time Reduced by 30–60%

Storage CPU Load 99% → 15%

🏥

Smart Healthcare Imaging AI

Pathology slides and 3D DICOM images often span gigabytes. If AI-assisted diagnosis stalls on reading, clinical benefits are severely compromised. Low-latency storage empowers diagnostic AI to operate at peak efficiency.

Image Preprocessing Acceleration Multi-path parallel without slowdown

Report Generation Wait Significantly reduced response time

Data Integrity ZFS self-healing protection

🏭

Semiconductor Yield Big Data Analysis

Production lines generate massive process data per second. AI models must analyze historical data in real-time to find key yield variables. I/O latency translates to analysis delays, ultimately resulting in yield loss.

Historical Data Retrieval Speed Millisecond → Microsecond access

24/7 Continuous Analysis All-flash low power support

TCO Streamlined hardware for enterprise performance

Frequently Asked Questions

Everything You Might Want to Ask,
Right Here

Does RDMA require specialized network switches? Can I use my existing data center architecture? ▾

NFS over RDMA (RoCE v2) operates on standard Ethernet networks but requires switches that support PFC (Priority Flow Control) to enable a lossless Ethernet environment. Most modern enterprise-grade switches (e.g., Mellanox/NVIDIA Spectrum, Cisco Nexus, Arista series) support this feature. QNAP can provide network planning advice to help confirm if your existing environment is compatible.

How big is the actual latency gap between NFS over RDMA and traditional NFS over TCP? ▾

Under laboratory conditions, the end-to-end latency for NFS over TCP typically ranges from 100–500 microseconds (μs), with bottlenecks mainly stemming from kernel context switches and memory copying. NFS over RDMA can compress latency to 1–2 μs—an improvement of about 100 times. For AI training scenarios with frequent small-batch random reads, this gap directly translates into improved GPU utilization and overall shorter training cycles.

How is the space efficiency of ZFS? Are compression and deduplication effective for AI training sets? ▾

ZFS features built-in real-time LZ4/Zstandard compression and block-level deduplication. For image training sets containing massive amounts of similar samples, the compression ratio often reaches 1.3–2×; for text-based datasets (like tokenized corpora), compression benefits are even more significant. Deduplication is particularly well-suited for storing multiple model checkpoint versions, potentially saving massive amounts of space. In ZFS, compression is hardware-assisted, meaning it has a minimal impact on I/O performance.

We only have 4 GPUs. Is the TS-h1290FX worth the investment? ▾

The hourly computing cost for 4 high-end GPUs (like H100/A100) is already substantial. Even in small-scale clusters, if storage I/O causes GPU utilization to fall below 70%, it means over 30% of your computing expenditure is wasted. The investment in a TS-h1290FX usually achieves ROI within a few months to a year, driven entirely by the performance gains from increased GPU utilization. For a specific TCO calculation, feel free to contact our sales team.

Does the TS-h1290FX support simultaneous use by multiple teams (multi-tenancy)? ▾

Fully supported. The TS-h1290FX can be configured with multiple independent NFS shares, individual user accounts, and network isolation. Combined with the ZFS Dataset and Snapshot mechanisms, you can establish independent storage spaces, backup strategies, and access controls for each team or department, making it ideal for Managed Service Providers (MSPs) or large enterprise internal multi-department scenarios.

Compared to pure cloud AI training platforms, what are the advantages of an on-premises TS-h1290FX? ▾

The main challenges of cloud platforms include exorbitant data transfer fees (egress costs), regulatory compliance risks for sensitive training data, and unpredictable long-term computing costs. The TS-h1290FX provides high-speed on-premises storage, ensuring data never leaves your facility while utilizing RDMA to match the I/O performance of high-end cloud storage. It acts as the perfect balance between performance, data sovereignty, and TCO.

Can the TS-h1290FX be integrated into existing MLOps workflows (e.g., Kubernetes, Kubeflow)? ▾

Yes. The TS-h1290FX provides standard NFS v4.1 mounting, which Kubernetes can directly utilize via PersistentVolume (PV). On Kubernetes nodes supporting RDMA, pairing with the RDMA Device Plugin easily enables full-speed NFS over RDMA connections. Additionally, through the S3-compatible endpoints provided by QuObjects, it can be seamlessly integrated into MLOps toolchains utilizing the S3 protocol (such as MLflow artifact store or DVC remote storage).

How do we handle backup and disaster recovery for model checkpoints? ▾

The TS-h1290FX offers a multi-layered protection strategy: ZFS snapshots can be scheduled to run automatically every hour, providing granular restore points; paired with another ZFS NAS, SnapSync enables real-time block-level synchronization for offsite disaster recovery; for long-term archiving, Hybrid Backup Sync (HBS 3) supports backing up data to the cloud (AWS S3, Azure Blob, B2, etc.). This triple-layered protection can be flexibly configured according to your RTO/RPO requirements.

Does the TS-h1290FX support the S3 object storage protocol? ▾

Supported. After installing QuObjects, the TS-h1290FX acts as an on-premises S3-compatible object storage endpoint, supporting Object Lock (WORM) immutable storage. This allows for hybrid workflows in AI: high-speed dataset reading during the training phase via NFS over RDMA, and secure storage and management of model versions and analysis results during the inference phase via the S3 protocol.

Your GPU Shouldn't Have to
Wait for Your Storage

How Much Time Do Your GPUs
Spend Waiting for Data?

The Hidden Tax of the TCP Stack

The Quadruple Cost of Memory Copying

The Real Cost of GPU Idling

The Larger the Scale, the Deeper the Bottleneck

Two Paths,
Completely Different Outcomes

The Numbers Behind
the TS-h1290FX

See the Difference
Clearly

Who is Using It,
and the Problems It Solves

AI / LLM Model Training

Smart Healthcare Imaging AI

Semiconductor Yield Big Data Analysis

Everything You Might Want to Ask,
Right Here

Eliminate GPU Wait Times

Your GPU Shouldn't Have toWait for Your Storage

How Much Time Do Your GPUsSpend Waiting for Data?

The Hidden Tax of the TCP Stack

The Quadruple Cost of Memory Copying

The Real Cost of GPU Idling

The Larger the Scale, the Deeper the Bottleneck

Two Paths,Completely Different Outcomes

The Numbers Behindthe TS-h1290FX

See the DifferenceClearly

Who is Using It,and the Problems It Solves

AI / LLM Model Training

Smart Healthcare Imaging AI

Semiconductor Yield Big Data Analysis

Everything You Might Want to Ask,Right Here

Eliminate GPU Wait Times

Your GPU Shouldn't Have to
Wait for Your Storage

How Much Time Do Your GPUs
Spend Waiting for Data?

Two Paths,
Completely Different Outcomes

The Numbers Behind
the TS-h1290FX

See the Difference
Clearly

Who is Using It,
and the Problems It Solves

Everything You Might Want to Ask,
Right Here