Parallel Training Experiments with Notebooks
One of the most popular features of proxiML Notebooks is the ability to copy them into new instances with only three clicks. This tutorial walks through an example of how to use the notebook copy feature to spin off training experiments to test different hyperparameters on the same model in parallel.
Create the Source Notebook
This tutorial uses the data and model code as the Get Started With Notebooks tutorial. Refer to that tutorial for a more detailed walk through of creating and using notebooks. If you already have the notebook from that tutorial running, you can skip to the next section.
Navigate to the Notebook Dashboard and click the Create
button. Input a memorable name as the job name and select an available GPU Type (the code in this tutorial assumes a RTX 3090
). Expand the Data
section and click Add Dataset
. Select Public Dataset
as the dataset type and select ImageNet
. Expand the Model
section. Keep git
selected as the Model Type
and specify the tutorial code git repository url https://github.com/proxiML/examples.git
in the Model Code Location
field to automatically download the tutorial's model code. Click Next
to view a summary of the new notebook and click Create
to start the notebook.
When the notebook reaches the running
state, click the Open
button to launch the a new window to the Jupyter environment. Navigate to models/notebooks
in the file explorer pane and double clicking on the pytorch-imagenet.ipynb
file. Observe the code section with the header Hyperparameters
. These are the settings we will experiment with in the next step. From the Run
option in the menu bar, select Run All Cells
to start training. Scroll down to the bottom of the notebook to see the output from the training loop.
Running a New Experiment
While the source job is training, navigate back to the Notebooks Dashboard. To fork the existing job into a new notebook, select the job from the dashboard and click the Copy
button in the dashboard menu. The Copy Notebook dialog will appear. To copy the entire notebook, including any changes made to the model code or files added, select Full (Data and Configuration)
as the Copy Type
. You have an opportunity to change the GPU Type
, GPU Count
, or Disk Size
before copying, but this is not necessary for this tutorial. Give the notebook a memorable name that reflects the experiment you will run with this notebook (e.g. suffix the job name with "vgg16") and click Copy
.
Once the copy process completes the notebook will automatically start. Note that during the copy process, the source notebook continued running and the training process was not interrupted. Navigate to the models/notebooks
in the file explorer pane. If the source notebook finished at least one epoch prior to the copy, you will see the checkpoint files in the directory. Open the same notebook (pytorch-imagenet.ipynb
) and scroll down the to Hyperparameters
section. This model accepts architecture as a hyperparameter, so change the arch
variable to vgg16
. This architecture requires considerably more GPU memory than the default resnet18
, as a result, lower the batch size (256
is a good number for the RTX 3090).
If you get CUDA Out Of Memory errors, that indicates your batch size is too high. When adjusting the batch size inside a running notebook, be sure to restart the kernel after each trial to free the GPU memory from the previous attempt. The easiest way to do this is from the Kernel
menu and select Restart Kernel and Run Al Cells...
.
Once you are done editing the hyperparameters, run all cells to start training. Now you can observe the training progress of the vgg16
model and the resnet18
model in parallel.
Next Steps
You can repeat the process as many times as needed to try combinations of architectures or other hyperparameters. Once you are done, be sure to stop the notebooks to stop billing. The next tutorials focus on getting familiar with proxiML Training Jobs.