A40 vs L40S

AmperevsAda LovelaceUpdated 36 days ago

The L40S emerges as the winner for prevalent AI tasks like training and inference. It provides 362 TFLOPS FP16 versus 37.4 TFLOPS and 864 GB/s bandwidth over 696 GB/s, overwhelming the A40 despite a $0.40 per hour starting price, ensuring future-proof efficiency.

A40 from $0.08/hrL40S from $0.55/hr

Specifications Compared

SpecA40L40S
TDP300W350W
VRAM48 GB48 GB
CUDA Cores10,75218,176
Memory TypeGDDR6GDDR6X
ArchitectureAmpereAda Lovelace
Form FactorsPCIePCIe
InterconnectNVLinkPCIe 4.0
Tensor Cores336568
FP16 Performance37.4 TFLOPS362 TFLOPS
FP32 Performance37.4 TFLOPS91 TFLOPS
FP64 Performance0.6 TFLOPS1.4 TFLOPS
INT8 Performance299 TOPS724 TOPS
Memory Bandwidth696 GB/s864 GB/s

Performance Analysis

Compute specifications highlight the L40S dominance: 362 TFLOPS FP16 versus the A40's 37.4 TFLOPS accelerates deep learning training by nearly 9.7 times, reducing epochs for models like transformers. FP32 performance at 91 TFLOPS on the L40S outpaces the A40's 37.4 TFLOPS by 2.4 times, aiding precision-sensitive simulations. The L40S FP8 at 724 TFLOPS enables ultra-fast inference with quantization, ideal for deployment.

Memory bandwidth of 864 GB/s on the L40S exceeds the A40's 696 GB/s by 24 percent, supporting larger batch sizes in training and minimizing data starvation for 48 GB VRAM utilization. This delta enhances throughput in memory-bound workloads such as fine-tuning large models. The L40S 350W TDP versus 300W reflects higher performance density, though it requires robust power delivery.

In real-world terms, the L40S handles modern Ada-optimized frameworks efficiently, while the A40 suffices for Ampere-era codebases but lags in raw speed.

Live Cloud Pricing

Real-time prices from 25+ providers. Updated every 60 seconds.

A40

ProviderGPU ModelVRAMHost SpecsRegionPriceStatusAction
TensorDock
TensorDock
NVIDIA RTX A4000
16GB VRAM
$0.08/GPU/hr
Available
Vast.ai
Vast.ai
8×NVIDIA RTX A4000
16GB VRAM
$0.15/GPU/hr
$1.17/hr total (8×)
Available
Hyperstack
Hyperstack
4×NVIDIA RTX A4000
16GB VRAM
$0.15/GPU/hr
$0.60/hr total (4×)
Available
Hyperstack
Hyperstack
2×NVIDIA RTX A4000
16GB VRAM
$0.15/GPU/hr
$0.30/hr total (2×)
Available
Hyperstack
Hyperstack
NVIDIA RTX A4000
16GB VRAM
$0.15/GPU/hr
Available

L40S

ProviderGPU ModelVRAMHost SpecsRegionPriceStatusAction
TensorDock
TensorDock
NVIDIA L40S
48GB VRAM
$0.55/GPU/hr
Available
RunPod
RunPod
NVIDIA L40S
48GB VRAM
$0.86/GPU/hr
Massed Compute
Massed Compute
4×NVIDIA L40S
48GB VRAM
$0.88/GPU/hr
$3.52/hr total (4×)
Available
Massed Compute
Massed Compute
2×NVIDIA L40S
48GB VRAM
$0.88/GPU/hr
$1.76/hr total (2×)
Available
Massed Compute
Massed Compute
NVIDIA L40S
48GB VRAM
$0.88/GPU/hr
Available

Compare real-time pricing across 25+ providers

When to Choose the A40

The A40 fits cost-sensitive or power-limited environments. Pricing starts at $0.24 per hour across 23 cloud offers, undercutting the L40S $0.40 per hour minimum, with 48 GB GDDR6 VRAM at 300W TDP suiting legacy servers. NVLink interconnect enables scalable multi-GPU training for Ampere-specific software stacks.

When to Choose the L40S

The L40S targets high-performance AI pipelines. Its 362 TFLOPS FP16 and 724 TFLOPS FP8 dwarf the A40 equivalents, speeding LLM training and inference, while 864 GB/s bandwidth handles large batches. Average $1.10 per hour across 18 offers delivers strong value for Ada workloads.

Use Cases

LLM Training
L40S

L40S FP16 at 362 TFLOPS is 9.7 times the A40's 37.4 TFLOPS, slashing training times for large models. Higher 864 GB/s bandwidth supports bigger batches on 48 GB VRAM.

LLM Inference
L40S

L40S FP8 reaches 724 TFLOPS for quantized serving, far beyond A40 capabilities. 362 TFLOPS FP16 ensures low-latency responses.

Fine-tuning
L40S

L40S 91 TFLOPS FP32 and 362 TFLOPS FP16 outperform A40's 37.4 TFLOPS each, accelerating parameter updates. Bandwidth edge aids memory-intensive tuning.

Stable Diffusion
L40S

L40S 362 TFLOPS FP16 generates images 9.7 times faster than A40's 37.4 TFLOPS. 48 GB VRAM handles high-resolution diffusion models.

Scientific Computing
L40S

L40S 91 TFLOPS FP32 exceeds A40's 37.4 TFLOPS by 2.4 times for simulations. Ada architecture optimizes parallel compute workloads.

Frequently Asked Questions

Do the A40 and L40S have the same VRAM?

Both GPUs provide 48 GB VRAM. A40 uses GDDR6, while L40S employs faster GDDR6X with 864 GB/s bandwidth versus 696 GB/s.

Which GPU is cheaper in the cloud?

A40 starts at $0.24 per hour (average $1.26 per hour across 23 offers). L40S begins at $0.40 per hour (average $1.10 per hour across 18 offers).

What is the FP16 performance difference?

L40S delivers 362 TFLOPS FP16, 9.7 times the A40's 37.4 TFLOPS. This gap favors L40S for AI training.

Which has higher TDP?

L40S TDP is 350W, higher than A40's 300W. This supports greater compute but needs better cooling.

What architectures do they use?

A40 is Ampere from 2020 with NVLink. L40S is Ada Lovelace from 2023 with PCIe 4.0.

Is L40S better for inference?

Yes, L40S FP8 at 724 TFLOPS excels for quantized inference. FP16 at 362 TFLOPS also outpaces A40's 37.4 TFLOPS.

Which is cheaper to rent, the A40 or the L40S?

Cloud rental prices for both the A40 and L40S vary by provider, configuration, and availability. This page shows live pricing from 25+ providers updated every 60 seconds. Scroll to the Live Cloud Pricing section to compare current rates.

How much VRAM does the A40 have compared to the L40S?

The A40 has 48 GB of GDDR6 memory. The L40S has 48 GB of GDDR6X memory.

Can I find A40 and L40S GPUs available to rent right now?

Yes. This page shows real-time availability across 25+ cloud GPU providers. The Live Cloud Pricing section displays only in-stock offers with current pricing.

What is the main difference between the A40 and the L40S?

The A40 uses the Ampere architecture (2020) while the L40S uses Ada Lovelace (2023). The L40S delivers 9.7x the FP16 throughput and 1.2x the memory bandwidth of the A40.