L40S vs MI355X

Ada LovelacevsCDNA 4Updated 36 days ago

The AMD Instinct MI355X claims victory for the most common AI use case of LLM training and inference, thanks to its 288 GB VRAM, 8000 GB/s bandwidth, and up to 4600 TFLOPS FP8 performance that enable larger models and batches unattainable on the L40S. Current availability tilts toward the L40S at $0.40 per hour, but specs crown the MI355X for demanding production.

L40S from $0.55/hr

Specifications Compared

SpecL40SMI355X
TDP350W750W
VRAM48 GB288 GB
CUDA Cores18,176
Memory TypeGDDR6XHBM3e
ArchitectureAda LovelaceCDNA 4
Form FactorsPCIeOAM
InterconnectPCIe 4.0Infinity Fabric
Tensor Cores568
FP8 Performance724 TFLOPS4,600 TFLOPS
FP16 Performance362 TFLOPS2,300 TFLOPS
FP32 Performance91 TFLOPS2300 TFLOPS
FP64 Performance1.4 TFLOPS72 TFLOPS
INT8 Performance724 TOPS4,600 TOPS
Memory Bandwidth864 GB/s8,000 GB/s

Performance Analysis

Raw compute performance positions the MI355X far ahead: its 2300 TFLOPS FP16 exceeds the L40S's 362 TFLOPS by over six times, and 2300 TFLOPS FP32 dwarfs the L40S's 91 TFLOPS by 25 times. FP8 reaches 4600 TFLOPS on the MI355X against 724 TFLOPS on the L40S. This delta impacts training and inference profoundly: the L40S suits FP16-dominant neural network training via tensor cores, but the MI355X's balanced FP16 and FP32 excels in mixed-precision training and FP32-heavy scientific simulations.

Memory specifications transform real-world usability: 288 GB HBM3e on the MI355X supports models exceeding 48 GB GDDR6X limits on the L40S, enabling single-GPU handling of massive LLMs. The 8000 GB/s bandwidth versus 864 GB/s allows vastly larger batch sizes, reducing training iterations and accelerating convergence by minimizing data bottlenecks.

Power and interconnects add context: the L40S's 350 W TDP enables denser racks than the MI355X's 750 W, while PCIe 4.0 offers broad compatibility against Infinity Fabric's specialized scaling. Overall, the MI355X prioritizes peak throughput for frontier workloads, while the L40S balances efficiency for production-scale inference.

Live Cloud Pricing

Real-time prices from 25+ providers. Updated every 60 seconds.

L40S

ProviderGPU ModelVRAMHost SpecsRegionPriceStatusAction
TensorDock
TensorDock
NVIDIA L40S
48GB VRAM
$0.55/GPU/hr
Available
RunPod
RunPod
NVIDIA L40S
48GB VRAM
$0.86/GPU/hr
Massed Compute
Massed Compute
4×NVIDIA L40S
48GB VRAM
$0.88/GPU/hr
$3.52/hr total (4×)
Available
Massed Compute
Massed Compute
2×NVIDIA L40S
48GB VRAM
$0.88/GPU/hr
$1.76/hr total (2×)
Available
Massed Compute
Massed Compute
NVIDIA L40S
48GB VRAM
$0.88/GPU/hr
Available

Compare real-time pricing across 25+ providers

When to Choose the L40S

The L40S emerges as the superior choice for deployments requiring immediate availability and cost efficiency. With cloud pricing from $0.40 per hour and an average of $1.10 per hour across 18 offers, it avoids the MI355X's lack of live instances. Its 350 W TDP and PCIe form factor facilitate integration into existing PCIe 4.0 systems without specialized OAM support.

Current workloads like Stable Diffusion generation or fine-tuning models under 48 GB VRAM benefit from the L40S's 362 TFLOPS FP16 and 724 TFLOPS FP8, delivering reliable performance without overprovisioning power or memory.

When to Choose the MI355X

The MI355X stands out for workloads demanding extreme scale and memory capacity. Its 288 GB HBM3e VRAM handles LLMs that exceed the L40S's 48 GB limit, while 8000 GB/s bandwidth supports massive batch sizes in training.

High-compute tasks leverage 2300 TFLOPS FP16, 2300 TFLOPS FP32, and 4600 TFLOPS FP8, ideal for FP32-intensive scientific computing or next-generation inference at scale, despite the 750 W TDP and OAM form factor.

Use Cases

LLM Training
MI355X

The MI355X's 288 GB HBM3e VRAM and 8000 GB/s bandwidth support massive batch sizes and models exceeding the L40S's 48 GB limit. Its 2300 TFLOPS FP16 outperforms the L40S's 362 TFLOPS for faster convergence.

LLM Inference
MI355X

4600 TFLOPS FP8 on the MI355X accelerates high-throughput inference for large models, surpassing the L40S's 724 TFLOPS. 288 GB VRAM enables deployment without multi-GPU sharding.

Fine-tuning
L40S

The L40S's 48 GB VRAM suffices for most fine-tuning tasks under that threshold, with immediate availability at $0.40 per hour. 362 TFLOPS FP16 handles efficient iterations without the MI355X's 750 W overhead.

Stable Diffusion
L40S

Stable Diffusion models fit within 48 GB GDDR6X, and the L40S's 724 TFLOPS FP8 delivers fast generation. Lower 350 W TDP and PCIe compatibility suit creative workflows.

Scientific Computing
MI355X

2300 TFLOPS FP32 on the MI355X excels in simulations requiring high precision, far beyond the L40S's 91 TFLOPS. Infinity Fabric aids multi-node scaling.

Frequently Asked Questions

What is the VRAM difference between L40S and MI355X?

The L40S provides 48 GB GDDR6X VRAM, while the MI355X offers 288 GB HBM3e. This sixfold increase enables the MI355X to load much larger models without distribution across multiple GPUs.

How do FP16 performance figures compare?

The MI355X achieves 2300 TFLOPS FP16, over six times the L40S's 362 TFLOPS. This gap accelerates AI training and inference on the MI355X for FP16-heavy workloads.

What are the current cloud prices for these GPUs?

L40S instances start at $0.40 per hour with an average of $1.10 per hour across 18 offers. The MI355X has no live cloud offers available yet.

Which GPU has higher memory bandwidth?

The MI355X delivers 8000 GB/s with HBM3e, compared to the L40S's 864 GB/s GDDR6X. Higher bandwidth on the MI355X supports larger batches and faster data transfer.

What are the TDP ratings?

The L40S consumes 350 W TDP, lower than the MI355X's 750 W. This makes the L40S more power-efficient for dense deployments.

Which is better for FP32 workloads?

The MI355X provides 2300 TFLOPS FP32, vastly superior to the L40S's 91 TFLOPS. It suits scientific computing and simulations needing high single-precision performance.

Which is cheaper to rent, the L40S or the MI355X?

Cloud rental prices for both the L40S and MI355X vary by provider, configuration, and availability. This page shows live pricing from 25+ providers updated every 60 seconds. Scroll to the Live Cloud Pricing section to compare current rates.

How much VRAM does the L40S have compared to the MI355X?

The L40S has 48 GB of GDDR6X memory. The MI355X has 288 GB of HBM3e memory.

Can I find L40S and MI355X GPUs available to rent right now?

Yes. This page shows real-time availability across 25+ cloud GPU providers. The Live Cloud Pricing section displays only in-stock offers with current pricing.

What is the main difference between the L40S and the MI355X?

The L40S uses the Ada Lovelace architecture (2023) while the MI355X uses CDNA 4 (2025). The MI355X delivers 6.4x the FP16 throughput and 9.3x the memory bandwidth of the L40S.

L40S vs MI355X: NVIDIA 48GB vs AMD 288GB | GPUPerHour