Skip to main content

Create Checkpoints and Datasets from Job Outputs

· 2 min read

Checkpoints and Datasets are now supported output destinations for proxiML Training and Inference jobs.

How It Works

Previously, the proxiML output type did not accept the output_uri property. This property can now be specified as model (the default if not provided), dataset, or checkpoint. Additionally, there is a new output_options field called save_model. This field is set to True by default when using the model output type and False when using dataset or checkpoint. Currently, this field can only be changed when using the proxiML SDK.

When save_model is set to True, the ML_OUTPUT_PATH environment variable is set to /opt/ml/models instead of /opt/ml/output and the contents of /opt/ml/models is uploaded to the output destination. If set to false, ML_OUTPUT_PATH remains set to /opt/ml/output and only that directory is uploaded to the output destination.

Using the Web Platform

To create a dataset or checkpoint from a job, select proxiML as the Output Type and the desired entity type data section of the job form. Once each worker in the job finishes, it will save the entire directory structure of /opt/ml/output to a new dataset or checkpoint with the name Job - <job name> if there is one worker or Job - <job name> Worker <worker number> if there are multiple workers.

Using the SDK

To save a training job's output to a checkpoint instead of a model, use the following syntax:

job = await proximl.jobs.create(
"Training Checkpoint Output",
type="training",
...
data=dict(
...
output_type="proximl",
output_uri="checkpoint",
output_options=dict(save_model=False),
),
...
)