Notes on renting a gpu
This past week I attempted to reproduce the results from the 2019 deep double descent paper and found myself in need of a beefier GPU.
The details of the matter don't matter too much here, but the paper uses the resnet18 model from 2015 so the gpu requirements aren't too crazy. My 2024 m4 pro macbook can run this model locally at ~10s per training iteration.
However, the paper wants 4000 iterations (or "epochs") which would take 11 hours total. From there, it additionally wants 64 model sizes with different 3 configurations to compare between. Even cutting that down to just 7 model sizes (size=1, 2, ...64), that's over 9 days of non-stop training time for my poor macbook.
I thought "renting a GPU" would be a complicated endeavor but it turns out to be mostly the same as spinning up an EC2 machine. Here, I'll document some things I learned about how GPU workloads scale and some practical notes for doing training runs on them.
getting a machine running
I made accounts on both vast.ai and lambda labs, having taken recommendations from ARENA. Lambda feels like a more enterprise solution () whereas vast.ai seems like airbnb'ing GPUs from randos. Overall, vast.ai has both more and cheaper options.
selecting a machine
There are a lot of knobs to decide what machine to get. I ended up renting probably 10-15 machines while I figured out what was going on.
network
Machines with ~100mbps up/down both took longer to bootup initially and to download pytorch and friends which got frustrating pretty quickly.
memory and cpu
Resnet18 is pretty small (0.5-1GB per process depending on model size), but I ideally wanted many copies running in parallel. That meant my bottleneck was going to be on compute, not memory or memory bandwidth.
If I'm understanding correctly, more expensive cards (A100, etc.) could offer more memory/parallelism but can actually run at a slower clock speed. My training process is bottlenecked by the serial nature of training, so that extra parallelism would go to waste.
I ended up going with 2x Nvidia 4090 cards which was overkill on memory (24GB), but could train an epoch in a reasonable amount of time.
sm and pytorch compatibility
Once I figured out I wanted a fast, cheap card, I tried a 5090 (successor to the 4090, released Jan 2025), but as of writing in June 2025, stable pytorch doesn't support its sm version (think: something like the ISA for the gpu). The 5090 requires v9.0+ but pytorch is at v8.x. This ended up being hugely frustrating for someone not well-versed in python package management and I ended up giving up after wrestling uv
and pip
for an hour.
other considerations
I didn't pay much attention to what CPU/memory bandwidth machines had because they weren't bottlenecks for my model runs, but I imagine they can factor in (dataset loading etc, especially for large models) depending on the workload.
I was also kind of shocked at the price. A 4090 will run you $4.7k, but renting one is ~$.27/hr. That's nearly 2 years of compute before you get to break even.
For my purposes, 7 model sizes x 3 configs x 4000 epochs ended up taking about ~20-22 hours for a full run over 2x 4090s, costing ~$13. Not bad! The deep double descent paper doesn't disclose its training time for comparison.
figuring out what I actually needed
After you actually boot a machine up, it's more similar than not to any other VM. You can ssh in, git pull your notebook down, and get things running. As a vscode enjoyer, the remote ssh workflow works great.
I looked at a lot of nvidia-smi
(system management interface) which shows the current GPU % utilization, memory usage, and power draw which was super helpful for experimenting with how to run my workload best. Later, I switched to nvitop
which has the same information but with nicer visualizations and updates over time.
Coming from a background of mostly running CPU-compute systems, I was surprised to find that context switching seems relatively cheap on GPUs. Intuitively, this kind of makes sense since CPUs specialize in a lot of hardware (pipelining, branch prediction, etc.) to make serialized programs fast, while GPUs end up with specializations to fan-out compute and doing things in parallel.
To illustrate this point, at first I thought I wanted many GPUs running a single process instead of many processes on a single GPU, but that turned out to not matter much. More runs on a single GPU turned out to scale linear-ish-ly and didn't have obvious contention problems.
scripting multiple runs
My initial training script would execute only one model config at a time. I didn't really want to manually manage 21 different scripts on the remote machine so I ended up using the python multiprocessing
module with something like:
random_start = random.randint(0, gpu_count - 1)
k_set = [2**x for x in range(0, 7)]
jobs = [
TrainingArgs(
model_args=ModelArgs(k=k),
# ...other training args
rank=(i + random_start)%gpu_count,
)
for i, k in enumerate(k_set)
]
with mp.Pool(processes=gpu_count * args.jobs_per_gpu) as pool:
pool.map(train, jobs)
As a caveat, this code assigns jobs to GPUs up front (via "rank"), but I found that pool.map
doesn't seem to actually respect that ordering for starting jobs, so if you do run a configuration where not all the jobs are executed from the start, this might try to schedule all the jobs that are on the same GPU for some reason. I briefly looked at fixing this but gave up since I wanted all the jobs running at once anyways.
I also had some trouble with separate tqdm
progress bars clobbering each other. You can mitigate this to some degree with the position argument, but it only helps so much since the separate subprocesses can't coordinate writes to stdout.
gpu compute is more familiar than not
That's about it! For whatever reason, I thought "connecting to a GPU" would be a very annoying task of connecting my local pytorch over the network and doing some driver voodoo, but it turned out to be pretty tame experience.