L40S vs RTX 4090

Ada LovelacevsAda LovelaceUpdated 40 days ago

The RTX 4090 emerges as the winner for most common cloud AI tasks due to its superior price-to-performance ratio: $0.39 per hour average versus $1.66 delivers ample 165 TFLOPS FP16 and 1008 GB/s bandwidth for training and inference on models under 24 GB, with broader availability across 75 offers.

L40S from $0.55/hrRTX 4090 from $0.39/hr

Specifications Compared

SpecL40SRTX-4090
TDP350W450W
VRAM48 GB24 GB
CUDA Cores18,17616,384
Memory TypeGDDR6XGDDR6X
ArchitectureAda LovelaceAda Lovelace
Form FactorsPCIePCIe
InterconnectPCIe 4.0PCIe 4.0
Tensor Cores568512
FP8 Performance724 TFLOPS660 TFLOPS
FP16 Performance362 TFLOPS165 TFLOPS
FP32 Performance91 TFLOPS82.6 TFLOPS
FP64 Performance1.4 TFLOPS1.3 TFLOPS
INT8 Performance724 TOPS660 TOPS
Memory Bandwidth864 GB/s1,008 GB/s

Performance Analysis

The L40S outperforms in FP16 at 362 TFLOPS versus the RTX 4090's 165 TFLOPS, accelerating mixed-precision training where models leverage half-precision for speed: this delta enables faster convergence on large neural networks. Its FP32 rate of 91 TFLOPS edges out 82.6 TFLOPS, benefiting single-precision scientific simulations. The doubled 48 GB VRAM on the L40S sustains larger batch sizes in LLM training, reducing overhead from model swapping compared to the 24 GB limit on the RTX 4090.

Memory bandwidth favors the RTX 4090 at 1008 GB/s over 864 GB/s, improving throughput in bandwidth-bound inference scenarios like high-resolution image generation. For FP8 inference optimized for deployment, the L40S's 724 TFLOPS surpasses 660 TFLOPS, supporting quantized models at scale. Lower TDP of 350W on the L40S versus 450W aids dense cloud packing, though real-world efficiency hinges on workload memory intensity.

Live Cloud Pricing

Real-time prices from 25+ providers. Updated every 60 seconds.

L40S

ProviderGPU ModelVRAMHost SpecsRegionPriceStatusAction
TensorDock
TensorDock
NVIDIA L40S
48GB VRAM
$0.55/GPU/hr
Available
RunPod
RunPod
NVIDIA L40S
48GB VRAM
$0.86/GPU/hr
Massed Compute
Massed Compute
NVIDIA L40S
48GB VRAM
$0.88/GPU/hr
Available
Massed Compute
Massed Compute
2×NVIDIA L40S
48GB VRAM
$0.88/GPU/hr
$1.76/hr total (2×)
Available
Massed Compute
Massed Compute
NVIDIA L40S
48GB VRAM
$0.88/GPU/hr
Available

RTX 4090

ProviderGPU ModelVRAMHost SpecsRegionPriceStatusAction
TensorDock
TensorDock
NVIDIA GeForce RTX 4090
24GB VRAM
$0.39/GPU/hr
Available
Vast.ai
Vast.ai
NVIDIA GeForce RTX 4090
24GB VRAM
$0.40/GPU/hr
Available
TensorDock
TensorDock
NVIDIA GeForce RTX 4090
24GB VRAM
$0.48/GPU/hr
Available
Vast.ai
Vast.ai
NVIDIA GeForce RTX 4090
24GB VRAM
$0.53/GPU/hr
Available
Vast.ai
Vast.ai
4×NVIDIA GeForce RTX 4090
24GB VRAM
$0.67/GPU/hr
$2.67/hr total (4×)
Available

Compare real-time pricing across 25+ providers

When to Choose the L40S

The L40S excels in enterprise deployments requiring 48 GB VRAM: large-scale LLM training or fine-tuning of models exceeding 24 GB fits perfectly, avoiding fragmentation. Its superior FP16 at 362 TFLOPS and FP32 at 91 TFLOPS handle compute-intensive tasks efficiently despite the $1.66 per hour average cost.

When to Choose the RTX 4090

The RTX 4090 suits budget-conscious users with its $0.39 per hour average pricing across 75 offers: workloads like Stable Diffusion or smaller inference batches thrive on 24 GB VRAM and 1008 GB/s bandwidth. Higher availability makes it ideal for prototyping or high-volume parallel jobs where 165 TFLOPS FP16 suffices.

Use Cases

LLM Training
L40S

The L40S's 48 GB VRAM and 362 TFLOPS FP16 support larger models and batches without swapping. RTX 4090's 24 GB limits scale on massive datasets.

LLM Inference
Either

RTX 4090's 1008 GB/s bandwidth aids high-throughput serving under 24 GB. L40S's 724 TFLOPS FP8 handles quantized large models efficiently.

Fine-tuning
L40S

91 TFLOPS FP32 and 48 GB VRAM on L40S accelerate parameter-efficient tuning of big models. RTX 4090 constraints apply to memory-heavy adapters.

Stable Diffusion
RTX 4090

RTX 4090's 24 GB VRAM and 1008 GB/s bandwidth generate images rapidly at $0.39 per hour. L40S overkill for typical resolutions.

Scientific Computing
RTX 4090

RTX 4090's 82.6 TFLOPS FP32 and lower $0.27 per hour cost fit simulations. L40S's extras unnecessary for standard HPC loads.

Frequently Asked Questions

Which has more VRAM, L40S or RTX 4090?

The L40S provides 48 GB GDDR6X VRAM, twice the RTX 4090's 24 GB. This advantage suits large model training. RTX 4090 suffices for most inference.

What is the FP16 performance difference?

L40S delivers 362 TFLOPS FP16, more than double the RTX 4090's 165 TFLOPS. This boosts mixed-precision AI training speed. Inference sees similar gains.

How do cloud prices compare?

RTX 4090 starts at $0.27 per hour averaging $0.39 across 75 offers. L40S begins at $1.65 averaging $1.66 over three offers. Cost drives most choices.

Which has higher memory bandwidth?

RTX 4090 offers 1008 GB/s, exceeding L40S's 864 GB/s. Bandwidth aids data-heavy tasks like image processing. VRAM capacity offsets for L40S.

What are the TDP ratings?

L40S consumes 350W TDP, lower than RTX 4090's 450W. This improves density in multi-GPU clouds. Power efficiency varies by workload.

Are both on Ada Lovelace?

Yes, L40S uses 2023 Ada Lovelace, RTX 4090 2022 version. Shared architecture ensures similar tensor core features. L40S targets datacenter optimization.

Which is cheaper to rent, the L40S or the RTX 4090?

Cloud rental prices for both the L40S and RTX 4090 vary by provider, configuration, and availability. This page shows live pricing from 25+ providers updated every 60 seconds. Scroll to the Live Cloud Pricing section to compare current rates.

How much VRAM does the L40S have compared to the RTX 4090?

The L40S has 48 GB of GDDR6X memory. The RTX 4090 has 24 GB of GDDR6X memory.

Can I find L40S and RTX 4090 GPUs available to rent right now?

Yes. This page shows real-time availability across 25+ cloud GPU providers. The Live Cloud Pricing section displays only in-stock offers with current pricing.

What is the main difference between the L40S and the RTX 4090?

The L40S uses the Ada Lovelace architecture (2023) while the RTX 4090 uses Ada Lovelace (2022). The RTX 4090 delivers 0.5x the FP16 throughput and 1.2x the memory bandwidth of the L40S.