Profiling hardware utilization
Let’s suppose, the job has 4 GPUs allocated, but only 1 GPU is used according to jobstats. In cases like this, it’s a wise choice to inspect the job’s GPU usage by other means than jobstats, and nvidia-smi. Here, we show how to use different tools, to profile hardware utilization.
The following examples are working with AI_env/v2 (module load AI_env/v2).
Torch profiler
Import tools for profiler:
from torch.profiler import profile, record_function, ProfilerActivity
To record the GPU and CPU activities, write a context manager as the following:
...
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True, pin_memory=True)
# Profile the first training batch passing forward and backward, and the optimizer separately from the whole epoch.
activities = [ProfilerActivity.CPU, ProfilerActivity.CUDA]
with profile(activities=activities, record_shapes=True) as prof:
images, labels = next(iter(train_loader))
images = images.to(f'cuda:{dev[0]}', non_blocking=True)
labels = labels.to(f'cuda:{dev[1]}', non_blocking=True)
outputs = model(images)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
prof.export_chrome_trace('trace.json') # save the profiler results in a json file
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True, pin_memory=True)
# Continue training as always.
start = datetime.now()
for epoch in range(num_epochs):
...
- Visualize the results
You need to export the trace.json file with scp (see here: https://docs.hpc.dkf.hu/first-steps/ssh-key.html#copying-files-using-scp)
You need a Chrome, or Chromium web browser.
Type in this URL in the browser: chrome://tracing, then “Load” the trace.json file.
Code example available here: https://git.einfra.hu/hpc-public/AI_examples/-/tree/main/multi_GPU_tracing?ref_type=heads
Torch Tensorboard Profiler
Import tools for profiler:
from torch.profiler import profile, record_function, ProfilerActivity
To record GPU anc CPU activities, write a context manager as the following:
...
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True, pin_memory=True)
# profile the first training batch
activities = [ProfilerActivity.CPU, ProfilerActivity.CUDA]
with profile(activities=activities,
on_trace_ready=torch.profiler.tensorboard_trace_handler('log'), # save the log files in the 'log' folder
record_shapes=True,
profile_memory=True,
with_stack=True) as prof:
images, labels = next(iter(train_loader))
images = images.to(f'cuda:{dev[0]}', non_blocking=True)
labels = labels.to(f'cuda:{dev[1]}', non_blocking=True)
outputs = model(images)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True, pin_memory=True)
start = datetime.now()
for epoch in range(num_epochs): # continue training as usual
...
- To visualize the results
Forward port: ssh -L 6006:localhost:6006 <yourusername>@komondor.hpc.kifu.hu
After connecting, load the AI_env module: module load AI_env/v2
Launch Tensorboard on Komondor: tensorboard --logdir=log
On your local computer, type the following URL in the browser: localhost:6006
Code example available here: https://git.einfra.hu/hpc-public/AI_examples/-/tree/main/multi_GPU_TB_tracing?ref_type=heads
Complete documentation of Pytorch Tensorboard Profiler: https://docs.pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html
Official Pytorch documentation: https://pytorch.org/docs/stable/index.html